Difference between revisions of "Google Season of Docs 2022/Organize and Update Apertium User Documentation"

From Apertium
Jump to navigation Jump to search
m (Protected "Google Season of Docs 2022/Organize and Update Apertium User Documentation": Locked for Google's review ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite)))
 
(10 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
== About Apertium ==
 
== About Apertium ==
   
Apertium is a free and open source machine translation and language technology platform. We have over 500 languages and pairs, maintained using 15+ different tools.
+
Apertium (current version 3.8, first release 2004) is a free and open source (mainly GPLv3) rule-based machine translation and language technology platform. We have over 500 languages and pairs, maintained using 15+ different tools, with contributors from all around the globe.
   
 
== About the project ==
 
== About the project ==
Line 15: Line 15:
 
=== The solution ===
 
=== The solution ===
   
  +
Following the 4-part division proposed by https://documentation.divio.com into Reference, Tutorials, How-to Guides, and Explanations, this project will gather and reorganize existing documentation into a single, easily-located, authoritative source to replace the existing hodge-podge of often unmaintained fragments.
The solution to the above problem is to create updated documentation for all pipeline modules and/or a full tutorial.
 
   
  +
The majority of existing documents will fall under Reference and Tutorials, which will then be expanded and updated to reflect the current state of all the commonly used components of a translation system.
Ideally documentation on a given tool will exist in a single place, and a full tutorial will also have a single unified source. One possibility is to generate one set of docs from another, or from a single unified source. For example, if we want tools to be documented in both their GitHub repos and on the wiki, we should generate one set of documentation from the other (or a third source). If we want a full tutorial to be on the wiki but also available in PDF format, then we should designate one source as the original and generate the others from them.
 
  +
  +
How-to Guides and Explanations, on the other hand, will be gathered and those that are outdated will be corrected, but expansion of this material will primarily take the form of examples and guidelines for future contributors.
   
 
=== The scope ===
 
=== The scope ===
   
 
* Overview of the Apertium platform
 
* Overview of the Apertium platform
* All stages of the Apertium pipeline
+
* Reference documentation and tutorials for all stages of the Apertium pipeline
  +
* Organized collection of how-to guides and background material
* The main approaches to and tools for each stage
 
   
 
=== Measuring success ===
 
=== Measuring success ===
   
Unfortunately, the only metric we have is how many people contact us either via mailing list or IRC, and that number has fallen drastically during the Covid-19 pandemic. But from both feedback and direct questioning, we know contributors (potential and current) manage to find incorrect or outdated documentation.
+
Unfortunately, the only metric we have is how many people contact us either via mailing list or IRC, and that number has fallen drastically during the Covid-19 pandemic. But from both feedback and direct questioning, we know contributors (potential and current) manage to find incorrect or outdated documentation, or are puzzled about as-yet undocumented features and behavior.
   
 
So the way we would measure success is that the number of contributors somehow winding up following an old tutorial drops close to zero.
 
So the way we would measure success is that the number of contributors somehow winding up following an old tutorial drops close to zero.
   
  +
=== Timeline ===
== Existing Documentation ==
 
   
  +
Our technical writer is an active contributor who is very familiar with the various components of Apertium. We estimate that this project will take 3-4 months. A conservative timeline is given below.
=== Formal Descriptions ===
 
 
{| class="wikitable" border="1"
 
|-
 
! Source
 
! Mostly Complete
 
! Partial
 
|-
 
| [https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf 2.0 docs]
 
|
 
* stream format
 
* transfer
 
* monodix
 
* bidix
 
|
 
* tagger
 
* lrx
 
* format handling
 
|-
 
| wiki
 
|
 
* recursive
 
* anaphora
 
|
 
* separable
 
* makefiles and modes
 
|-
 
| github
 
|
 
* lexd
 
|
 
|-
 
| external sources
 
|
 
* HFST (probably don't redo)
 
* CG3 (link to, don't redo)
 
|
 
|}
 
 
missing:
 
 
* build scripts (filter-rules, etc)
 
* spellchecker
 
* postgenerator?
 
 
=== Tutorials ===
 
 
Even things in the "substantive" column will likely need a fair amount of work for the purposes of this project.
 
 
{| class="wikitable" border="1"
 
|-
 
! Source
 
! Substantive
 
! Fragmentary
 
|-
 
| Apertium wiki
 
|
 
* monodix
 
* bidix
 
* init
 
|
 
* transfer
 
* recursive
 
* anaphora
 
|-
 
| [[User:Firespeaker]]'s course wiki
 
|
 
* lexd
 
* bidix
 
* lrx
 
* recursive
 
|
 
* CG3
 
|}
 
 
missing:
 
 
* HFST
 
* tagger
 
* separable
 
 
== Timeline ==
 
 
This follows the 4-part division of https://documentation.divio.com
 
   
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
Line 211: Line 130:
   
 
== Budget ==
 
== Budget ==
  +
  +
{| class="wikitable" border="1"
  +
|-
  +
! Budget item
  +
! Amount
  +
|-
  +
| Paying technical writer
  +
| $6000
  +
|-
  +
! TOTAL:
  +
! $6000
  +
|}
  +
  +
We considered adding a $500 "just in case" post, but we can't imagine anything else to cover. We've never paid org mentors, and we don't need to restore from ancient archives or broken hardware - and even if we did, it'd likely be faster to just rewrite that part.
  +
  +
== Additional information ==
  +
  +
Apertium has participated in Google Summer of Code 12 times: 2009, 2010, 2011, 2012, 2013, 2014, 2016, 2017, 2018, 2019, 2020, and 2021.
  +
  +
The technical writer participated in GSoC as a student in 2019 and 2021, and as a mentor in 2020.
  +
  +
== Appendix: Survey of existing documentation ==
  +
  +
=== Formal descriptions ===
  +
  +
{| class="wikitable" border="1"
  +
|-
  +
! Source
  +
! Mostly Complete
  +
! Partial
  +
|-
  +
| [https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf 2.0 docs]
  +
|
  +
* stream format
  +
* transfer
  +
* monodix
  +
* bidix
  +
|
  +
* tagger
  +
* lrx
  +
* format handling
  +
|-
  +
| wiki
  +
|
  +
* recursive
  +
* anaphora
  +
* regtest
  +
|
  +
* separable
  +
* makefiles and modes
  +
|-
  +
| github
  +
|
  +
* lexd
  +
|
  +
|-
  +
| external sources
  +
|
  +
* HFST (probably don't redo)
  +
* CG3 (link to, don't redo)
  +
|
  +
|}
  +
  +
missing:
  +
  +
* common build scripts (filter-rules, etc)
  +
* postgenerator?
  +
  +
=== Tutorials ===
  +
  +
{| class="wikitable" border="1"
  +
|-
  +
! Source
  +
! Substantive
  +
! Fragmentary
  +
|-
  +
| Apertium wiki
  +
|
  +
* monodix
  +
* bidix
  +
* init
  +
|
  +
* transfer
  +
* recursive
  +
* anaphora
  +
|-
  +
| [[User:Firespeaker]]'s course wiki
  +
|
  +
* lexd
  +
* bidix
  +
* lrx
  +
* recursive
  +
|
  +
* CG3
  +
|}
  +
  +
missing:
  +
  +
* HFST
  +
* tagger
  +
* separable
  +
* regtest

Latest revision as of 17:59, 25 March 2022

About Apertium

Apertium (current version 3.8, first release 2004) is a free and open source (mainly GPLv3) rule-based machine translation and language technology platform. We have over 500 languages and pairs, maintained using 15+ different tools, with contributors from all around the globe.

About the project

The problem

Apertium's wiki and other documentation are out of date, poorly organized, not visible enough, and just plain not user-friendly.

This ranges from documentation of individual tools not reflecting their current state, to our best how-to guides reflecting how things were done a decade ago. Documentation is scattered between the Apertium wiki, individual GitHub repos, an out-of-date pdf "Book", and even published papers and third party sites.

The result is new users and contributors wasting time reading out-of-date materials, and even long-time contributors having no way to be aware of changes to the tools they use.

The solution

Following the 4-part division proposed by https://documentation.divio.com into Reference, Tutorials, How-to Guides, and Explanations, this project will gather and reorganize existing documentation into a single, easily-located, authoritative source to replace the existing hodge-podge of often unmaintained fragments.

The majority of existing documents will fall under Reference and Tutorials, which will then be expanded and updated to reflect the current state of all the commonly used components of a translation system.

How-to Guides and Explanations, on the other hand, will be gathered and those that are outdated will be corrected, but expansion of this material will primarily take the form of examples and guidelines for future contributors.

The scope

  • Overview of the Apertium platform
  • Reference documentation and tutorials for all stages of the Apertium pipeline
  • Organized collection of how-to guides and background material

Measuring success

Unfortunately, the only metric we have is how many people contact us either via mailing list or IRC, and that number has fallen drastically during the Covid-19 pandemic. But from both feedback and direct questioning, we know contributors (potential and current) manage to find incorrect or outdated documentation, or are puzzled about as-yet undocumented features and behavior.

So the way we would measure success is that the number of contributors somehow winding up following an old tutorial drops close to zero.

Timeline

Our technical writer is an active contributor who is very familiar with the various components of Apertium. We estimate that this project will take 3-4 months. A conservative timeline is given below.

Time Period Goal Details Deliverable
Phase 1: Reference
Week 1

May 1-7

Gather and convert existing documentation
  • Set up repo for canonical copy
  • Copy all existing docs to canonical repo
  • Delete outdated info
Single canonical source containing existing info
Weeks 2-4

May 8-28

Fill in gaps in formal docs Up-to-date formal documentation of main pipeline modules and common build scripts
Phase 2: Tutorials
Weeks 5-7

May 29-June 18

Dictionary tutorials
  • Basic introduction to shell and common Apertium-related commands
  • Guidance for selecting arguments for apertium-init
  • Instructions for going from a linguistic paradigm to monodix/lexc/lexd
  • Introduction to twol
Information sufficient to get a beginner set up and contributing to lexicons
Weeks 8-10

June 19-July 2

Transfer tutorials
  • How to go from a word-order or agreement difference to a working transfer rule in either formalism
Systematic tutorial for writing transfer rules
Weeks 11-13

July 3-23

Other tutorials
  • Lexical selection
  • Training a tagger
  • Writing CG rules
  • Anaphora resolution
  • Separable
End-to-end tutorial for the translation pipeline
Phase 3: Explanation
Weeks 14-15

July 24-August 6

Theoretical background
  • RBMT
  • FSTs
  • other things, if time
Introductions to why Apertium uses the technology that it does
Phase 4: How-to guides and code structure
Weeks 16-18

August 7-27

How-to and code
  • A few how-to guides and make it easy to add more
  • For each core repo:
    • Document listing the general purpose of each source file
    • Doc-comment for each noteworthy function
    • Outline of the operation and control flow of each class corresponding to an executable
Guidelines for contributing to the code

Budget

Budget item Amount
Paying technical writer $6000
TOTAL: $6000

We considered adding a $500 "just in case" post, but we can't imagine anything else to cover. We've never paid org mentors, and we don't need to restore from ancient archives or broken hardware - and even if we did, it'd likely be faster to just rewrite that part.

Additional information

Apertium has participated in Google Summer of Code 12 times: 2009, 2010, 2011, 2012, 2013, 2014, 2016, 2017, 2018, 2019, 2020, and 2021.

The technical writer participated in GSoC as a student in 2019 and 2021, and as a mentor in 2020.

Appendix: Survey of existing documentation

Formal descriptions

Source Mostly Complete Partial
2.0 docs
  • stream format
  • transfer
  • monodix
  • bidix
  • tagger
  • lrx
  • format handling
wiki
  • recursive
  • anaphora
  • regtest
  • separable
  • makefiles and modes
github
  • lexd
external sources
  • HFST (probably don't redo)
  • CG3 (link to, don't redo)

missing:

  • common build scripts (filter-rules, etc)
  • postgenerator?

Tutorials

Source Substantive Fragmentary
Apertium wiki
  • monodix
  • bidix
  • init
  • transfer
  • recursive
  • anaphora
User:Firespeaker's course wiki
  • lexd
  • bidix
  • lrx
  • recursive
  • CG3

missing:

  • HFST
  • tagger
  • separable
  • regtest