Difference between revisions of "Task ideas for Google Code-in/Scrape inflection information from Wiktionary"

From Apertium
Jump to navigation Jump to search
 
Line 3: Line 3:
   
 
This objective of this task is to convert tables of inflectional information on [http://www.wiktionary.org Wiktionary] into a format useful for Apertium, e.g. [[speling format]].
 
This objective of this task is to convert tables of inflectional information on [http://www.wiktionary.org Wiktionary] into a format useful for Apertium, e.g. [[speling format]].
  +
  +
   
 
==Example==
 
==Example==
Line 40: Line 42:
   
   
  +
== Resources ==
See [[User:Wei2912]]'s Crawler if you want to build on some previous work.
 
  +
* [[User:Wei2912]]'s Crawler
  +
* https://github.com/wswu/yawipa / https://www.cs.jhu.edu/~winston/yawipa-data.html – seems like a very complete project, with many different kinds of tables
   
 
[[Category:Tasks for Google Code-in|Scrape inflection information from Wiktionary]]
 
[[Category:Tasks for Google Code-in|Scrape inflection information from Wiktionary]]

Latest revision as of 12:10, 26 May 2023

Objective[edit]

This objective of this task is to convert tables of inflectional information on Wiktionary into a format useful for Apertium, e.g. speling format.


Example[edit]

For example, take the Bulgarian noun вода, the page on Wiktionary for вода has the inflection information for Bulgarian. The table looks something like:

Singular Plural
indefinite вода води
definite водата водите
vocative водо води

The equivalent in speling format would be:

вода; вода; sg.ind; n.f
вода; водата; sg.def; n.f
вода; водо; sg.voc; n.f
вода; води; pl.ind; n.f
вода; водите; pl.def; n.f
вода; води; pl.voc; n.f

Where n.f means "noun, feminine" (this information will also typically be on the Wiktionary page).


Note: for most parts of speech, the fourth column will just have the part of speech alone, and all other sub-tags in the third column, e.g. adjectives look like

vacker; vackert; abs.ind.sg.nt; adj
vacker; vackra; abs.pl; adj


Resources[edit]