Ideas for Google Summer of Code/FieldWorks data extraction

From Apertium

< Ideas for Google Summer of Code

Revision as of 13:09, 20 March 2020 by TommiPirinen (talk | contribs) (flex data as corpus)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

FieldWorks stores a lot of data of the sort that we want for building monodix.

Things we might be able to get:

Lexicon entries
Morphology
Bidix entries
Reference corpus / gold standars
any number of things that might be extractable from glossed text

Coding Challenge

Write a script that reads a FieldWorks file and outputs the headword and part of speech of each lexicon entry.

Downloading FieldWorks and making up your own data to test this is fine (you'll probably end up doing a lot of it over the course of the project).

Links

http://software.sil.org/fieldworks/resources/tutorial/lexicon/
- Description of lexicon features
http://software.sil.org/fieldworks/resources/tutorial/grammar/
- Links to morphological stuff
http://downloads.sil.org/FieldWorks/WW-ConceptualIntro/ConceptualIntroduction.htm
- Long list of data we might be able to get
https://github.com/sillsdev/FieldWorks
- FieldWorks internals (might need this to figure out formats, but hopefully not)

Retrieved from "https://wiki.apertium.org/w/index.php?title=Ideas_for_Google_Summer_of_Code/FieldWorks_data_extraction&oldid=71247"