Google Season of Docs 2022/Organize and Update Apertium User Documentation
Contents
About Apertium
Apertium (current version 3.8, first release 2004) is a free and open source (mainly GPLv3) rule-based machine translation and language technology platform. We have over 500 languages and pairs, maintained using 15+ different tools, with contributors from all around the globe.
About the project
The problem
Apertium's wiki and other documentation are out of date, poorly organized, not visible enough, and just plain not user-friendly.
This ranges from documentation of individual tools not reflecting their current state, to our best how-to guides reflecting how things were done a decade ago. Documentation is scattered between the Apertium wiki, individual GitHub repos, an out-of-date pdf "Book", and even published papers and third party sites.
The result is new users and contributors wasting time reading out-of-date materials, and even long-time contributors having no way to be aware of changes to the tools they use.
The solution
Following the 4-part division proposed by https://documentation.divio.com into Reference, Tutorials, How-to Guides, and Explanations, this project will gather and reorganize existing documentation into a single, easily-located, authoritative source to replace the existing hodge-podge of often unmaintained fragments.
The majority of existing documents will fall under Reference and Tutorials, which will then be expanded and updated to reflect the current state of all the commonly used components of a translation system.
How-to Guides and Explanations, on the other hand, will be gathered and those that are outdated will be corrected, but expansion of this material will primarily take the form of examples and guidelines for future contributors.
The scope
- Overview of the Apertium platform
- Reference documentation and tutorials for all stages of the Apertium pipeline
- Organized collection of how-to guides and background material
Measuring success
Unfortunately, the only metric we have is how many people contact us either via mailing list or IRC, and that number has fallen drastically during the Covid-19 pandemic. But from both feedback and direct questioning, we know contributors (potential and current) manage to find incorrect or outdated documentation, or are puzzled about as-yet undocumented features and behavior.
So the way we would measure success is that the number of contributors somehow winding up following an old tutorial drops close to zero.
Timeline
Our technical writer is an active contributor who is very familiar with the various components of Apertium. We estimate that this project will take 3-4 months. A conservative timeline is given below.
Time Period | Goal | Details | Deliverable |
---|---|---|---|
Phase 1: Reference | |||
Week 1
May 1-7 |
Gather and convert existing documentation |
|
Single canonical source containing existing info |
Weeks 2-4
May 8-28 |
Fill in gaps in formal docs |
|
Up-to-date formal documentation of main pipeline modules and common build scripts |
Phase 2: Tutorials | |||
Weeks 5-7
May 29-June 18 |
Dictionary tutorials |
|
Information sufficient to get a beginner set up and contributing to lexicons |
Weeks 8-10
June 19-July 2 |
Transfer tutorials |
|
Systematic tutorial for writing transfer rules |
Weeks 11-13
July 3-23 |
Other tutorials |
|
End-to-end tutorial for the translation pipeline |
Phase 3: Explanation | |||
Weeks 14-15
July 24-August 6 |
Theoretical background |
|
Introductions to why Apertium uses the technology that it does |
Phase 4: How-to guides and code structure | |||
Weeks 16-18
August 7-27 |
How-to and code |
|
Guidelines for contributing to the code |
Budget
Budget item | Amount |
---|---|
Paying technical writer | $6000 |
TOTAL: | $6000 |
We considered adding a $500 "just in case" post, but we can't imagine anything else to cover. We've never paid org mentors, and we don't need to restore from ancient archives or broken hardware - and even if we did, it'd likely be faster to just rewrite that part.
Additional information
Apertium has participated in Google Summer of Code 12 times: 2009, 2010, 2011, 2012, 2013, 2014, 2016, 2017, 2018, 2019, 2020, and 2021.
The technical writer participated in GSoC as a student in 2019 and 2021, and as a mentor in 2020.
Appendix: Survey of existing documentation
Formal descriptions
Source | Mostly Complete | Partial |
---|---|---|
2.0 docs |
|
|
wiki |
|
|
github |
|
|
external sources |
|
missing:
- common build scripts (filter-rules, etc)
- postgenerator?
Tutorials
Source | Substantive | Fragmentary |
---|---|---|
Apertium wiki |
|
|
User:Firespeaker's course wiki |
|
|
missing:
- HFST
- tagger
- separable
- regtest