User:David Nemeskey/GSOC progress 2013

Tasks

XML format

See User:David_Nemeskey/CG_XML_brainstorming.

Rule applier

Rough execution flow:

Load the rules
1. DELIMITERS must also be converted to an FST and saved.
2. Rule organization:
  - separate files in a directory?
  - one file per section?
  - one big file (needs another "index" file then)
Read the input stream cohort by cohort
- StreamReader class in fomacg_stream_reader.h
- Read until a delimiter is found (see above)
Convert the cohorts from the Apertium stream format to fomacg's format
- Converter in fomacg_converter.h
- apertium_to_fomacg.foma
- conversion from wchar_t to utf-8 char
Apply rules in the order of the sections they are defined in
Convert the cohorts back to the Apertium stream format

Compiler

Code

Read the fomacg code^[1] and understand what it does.
Add comments do a little refactoring (e.g. separate function for creating @*word@* FSTs).
Check and test if it works on all grammars, fix it if it isn't.

Development

DELIMITERS must also be converted to an FST and saved.
- I have run into a foma bug, which also corrupted the output of all rules that contained lemma tags. I have patched the code on my side and opened a bug report (two, actually).

TODO?'s

cg-comp is not so strict as fomacg about the position part in contextual tests, e.g. it accepts 1*, not just *1. Loosen up fomacg?

Research

Decrease the number of rules applied (see in the proposal).
When and how can rules be merged?

Miscellaneous / Extra

Hungarian CG grammar

Write a simple CG grammar for Hungarian, somewhere around 50-150 rules.

Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.
Read the contents of cg_material.zip.
Install Apertium, VISL CG3 and a language pair (cy-en)
Study the CG grammar of an Apertium language.
Write a Hungarian grammar that covers the sentences in this sample file
- The tags will be based on those in KR-code^[2]. See the next task.
TODOs:
- add a sentence to the rasskaz file for the "az a" construct.
- prevb disambiguation
The file is here.

Hunmorph converter

Write a converter from ocamorph's output to Apertium's format.

Again, use the sentences in this sample file as reference.
While a C-based converter would definitely be possible, I opted for a foma-based (xfst -- lexc?) implementation, so that this task also serves for practice.
TODOs:
- some analyses are repeated in the output: fix them! -- Not a real fix, because the problem wasn't in the script, but in ocamorph, but I wrote a python script that discards repeated analyses.
- hunmorph_to_apertium.foma does not handle compounds (+)
- two-phase conversion (handle special words first, then the general part)
- a few non-general conversions:
  - "mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg>
  - "hova" => <itg>
  - "így<CONJ>" => <cnjadv>
  - "ekkor<ADV>" => <cnjadv>?
  - "aztán<CONJ>" => <cnjadv>
Apparently a pure lexc/foma based implementation wasn't possible (at least with the one line -- one word output format I decided ask from hunmorph). The reason is that the tagset of hunmorph and Apertium does not match exactly, and therefore I needed to add exceptional rules for certain words and lexical categories. However, flookup returns all possible analyses, so in this case it returned both the exceptional and the regular translation. The current solution consists of four files:
- kr_to_apertium_spec.lexc contains the words / lexical categories that need special treatment (vbser, prns, "ez"/"az", etc.)
- kr_to_apertium.lexc contains the rest of the categories, i.e. whose the translation was straightforward
- kr_to_apertium.foma is a simple foma script that writes the previous two into the binary file kr_to_apertium.fst
- hunmorph_to_apertium.cpp loads the two FSTs from the binary file and applies them to the readings. First it tries to parse the reading with the _spec FST; if it fails, it reverts back to the general one. This mechanism ensures that all readings get only one translation, and it is also the correct one.

ATT -> lttoolbox compiler

Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component.

ATT format
"<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);"
- The actual implementation turned out to be a little more difficult. Anyway, the code is here.
- The transducer we used for testing, kaz.att, didn't work with lt-proc, so Fran told me to create two transducers instead of one: one for words (main) and one for punctuation (final).

References

↑ Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.
↑ András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.

[1] Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.

[2] András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.

[1]

[2]

User:David Nemeskey/GSOC progress 2013

Contents

Tasks

XML format

Rule applier

Compiler

Code

Development

TODO?'s

Research

Miscellaneous / Extra

Hungarian CG grammar

Hunmorph converter

ATT -> lttoolbox compiler

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools