Difference between revisions of "User:David Nemeskey/GSOC progress 2013"
Jump to navigation
Jump to search
Line 46: | Line 46: | ||
** two-phase conversion (handle special words first, then the general part) |
** two-phase conversion (handle special words first, then the general part) |
||
** "mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg> |
** "mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg> |
||
** "hova" => <itg> |
|||
** "így<CONJ>" => <cnjadv> |
|||
** "ekkor<ADV>" => <cnjadv>? |
|||
** "aztán<CONJ>" => <cnjadv> |
|||
==== ATT -> lttoolbox compiler ==== |
==== ATT -> lttoolbox compiler ==== |
Revision as of 10:42, 18 June 2013
Contents
Tasks
XML format
Compiler
Code
- Read the fomacg code[1] and understand what it does.
- Add comments do a little refactoring (e.g. separate function for creating @*word@* FSTs).
- Check and test if it works on all grammars, fix it if it isn't.
Research
- Decrease the number of rules applied (see in the proposal).
- When and how can rules be merged?
Miscellaneous / Extra
Hungarian CG grammar
Write a simple CG grammar for Hungarian, somewhere around 50-150 rules.
- Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.
- Read the contents of cg_material.zip.
- Install Apertium, VISL CG3 and a language pair (cy-en)
- Study the CG grammar of an Apertium language.
- Write a Hungarian grammar that covers the sentences in this sample file
- TODOs:
- add a sentence to the rasskaz file for the "az a" construct.
- prevb disambiguation
- interjections (after e.g. "ez")
- verb-noun agreement (person, number, definite -- vt/vi)
- should a predeterminer be followed by a determiner and not "nem"?
Hunmorph converter
Write a converter from ocamorph's output to Apertium's format.
- Again, use the sentences in this sample file as reference.
- While a C-based converter would definitely be possible, I opted for a foma-based (xfst -- lexc?) implementation, so that this task also serves for practice.
- TODOs:
- some analyses are repeated in the output: fix them! -- Not a real fix, because the problem wasn't in the script, but in ocamorph, but I wrote a python script that discards repeated analyses.
- hunmorph_to_apertium.foma does not handle compounds (+)
- two-phase conversion (handle special words first, then the general part)
- "mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg>
- "hova" => <itg>
- "így<CONJ>" => <cnjadv>
- "ekkor<ADV>" => <cnjadv>?
- "aztán<CONJ>" => <cnjadv>
ATT -> lttoolbox compiler
Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component.
- ATT format
- "<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);"
References
- ↑ Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.
- ↑ András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.