Difference between revisions of "User:David Nemeskey/GSOC progress 2013"

From Apertium
Jump to navigation Jump to search
Line 260: Line 260:
===== Hacky hacks =====
===== Hacky hacks =====


# <span style="color:#009000">''An <code>apply_detmin_fst()</code> instead of <code>apply_down()</code> for the apertium <-> fomacg converters''</span>
# <span style="color:#007000">''An <code>apply_detmin_fst()</code> instead of <code>apply_down()</code> for the apertium <-> fomacg converters''</span>
# Sigma optimisation
# Sigma optimisation
## <code>apply_create_sigmatch()</code> is responsible for about 54% of <code>apply_detmin_fsa()</code>'s time. Basically we read the input string each time we run a condition FSA.
## <code>apply_create_sigmatch()</code> is responsible for about 54% of <code>apply_detmin_fsa()</code>'s time. Basically we read the input string each time we run a condition FSA.
Line 273: Line 273:
#### Translate the input sentence to sigmas only once; before running a rule FSA, look up all <code>IDENTITY_SYMBOL</code>s in the dictionary and map them to their code in the FSA's sigma.
#### Translate the input sentence to sigmas only once; before running a rule FSA, look up all <code>IDENTITY_SYMBOL</code>s in the dictionary and map them to their code in the FSA's sigma.
# <code>apply_down()</code> still accounts for a rather large percentage of the running time. This is because it is called not only by <code>apply_rules()</code>, but by the code that converts the apertium stream format to fomacg's format and vice versa.
# <code>apply_down()</code> still accounts for a rather large percentage of the running time. This is because it is called not only by <code>apply_rules()</code>, but by the code that converts the apertium stream format to fomacg's format and vice versa.
## <span style="color:#008000">''The latter FST is definitely deterministic. Using the new <code>apply_detmin_fst()</code> function instead of <code>apply_down()</code> decreased the running time of the Hungarian corpus '''from 0.94s to 0.87s'''.''</span>
## <span style="color:#007000">''The latter FST is definitely deterministic. Using the new <code>apply_detmin_fst()</code> function instead of <code>apply_down()</code> decreased the running time of the Hungarian corpus '''from 0.94s to 0.87s'''.''</span>
## Isn't the rule FST deterministic as well?! <span style="color:#b00000">... 'course not.</span>
## <span style="color:#c07070">''Isn't the rule FST deterministic as well?!</span> ... <span style="color:#900000">'course not.''</span>


<div align="center">
<div align="center">

Revision as of 16:17, 6 September 2013

Tasks

XML format

See User:David_Nemeskey/CG_XML_brainstorming.

Rule applier

Rough execution flow:

  1. Load the rules
    1. DELIMITERS must also be converted to an FST and saved.
    2. Rule organization:
      • separate files in a directory?
      • one file per section?
      • one big file (needs another "index" file then)
  2. Read the input stream cohort by cohort
    • StreamReader class in fomacg_stream_reader.h
    • Read until a delimiter is found (see above)
  3. Convert the cohorts from the Apertium stream format to fomacg's format
    • Converter in fomacg_converter.h
    • apertium_to_fomacg.foma
    • conversion from wchar_t to utf-8 char
  4. Apply rules in the order of the sections they are defined in
  5. Convert the cohorts back to the Apertium stream format

Compiler

Code

  • Read the fomacg code[1] and understand what it does.
  • Add comments do a little refactoring (e.g. separate function for creating @*word@* FSTs).
  • Check and test if it works on all grammars, fix it if it isn't.

Development

  1. DELIMITERS must also be converted to an FST and saved.
    • I have run into a foma bug, which also corrupted the output of all rules that contained lemma tags. I have patched the code on my side and opened a bug report (two, actually).
  2. fomacg treats all tags as symbols. Usually this is OK, but for lemmas, it may not be optimal
    1. it blows up the number of symbols (if not the transducer) => turns out the transducer is smaller this way;
    2. might be slower to parse? Or faster => faster, if we use exploded strings, fomacg_proc becomes about 33% slower.
  3. If one tag is a postfix of another, and the longer tag is not mentioned in the rule, fomacg might trigger the wrong rule, believing that it sees the shorter tag. E.g. if there is a "predet " and a "det " tag, and the rule is SELECT (det), it will also select predets as well. This is because in fomacg, the symbol that corresponds to a tag is its name plus a space; in other words, only the right boundary of a symbol is checked.
    1. Change the symbols from "tag " to "<tag> " to ensure the "postfix property", except for word forms and lemmas, which are already enclosed in quotes.
    2. Keep as much of Apertium's format as possible?
  4. Insert ">>>" and "<<<" to before and after the sentence, respectively.
  5. Make the rules fail-fast?

Speed problems

This first version of fomacg_proc, where rules are separate FSTs applied in order, has some speed issues. With the 40-rule CG-2 subset of my Hungarian grammar, it disambiguates a 13760-token test corpus (the rasskaz sentences 32x concatenated) in 6.4 seconds. Under the same conditions, cg-proc needs 0.3 seconds.

Speed measurements (see below for details) summary
  1. apply_down
    1. original rules, strcmp-based application check: 6.4 seconds
    2. deleteX rules, strlen-based application check: 5.9 seconds
    3. deleteX rules, FSA (apply_down)-based application check: 9 - 11 seconds
    4. deleteX rules, FSA (apply_detmin_fsa)-based check: 1.47 seconds
    5. deleteX rules, FSA (apply_detmin_fsa)-based check, smallest-first condition trees: 1.35 seconds, running time: 0.95 seconds.
    6. deleteX rules, FSA (apply_detmin_fsa)-based check, smallest-first condition trees, apply_detmin_fst conversion: 1.3 seconds, running time: 0.87 seconds.
    7. deleteX rules, FSA (custom_detmin_fsa)-based check, smallest-first condition trees, apply_detmin_fst conversion: 1.1 seconds, running time: 0.75 seconds.
    8. deleteX rules, FSA (custom_detmin_fsa)-based check, smallest-first condition trees, apply_detmin_fst conversion both ways: 1 seconds, running time: 0.655 seconds.
  2. fsm_compose
    1. simple rules, fsm_intersect-based application check: 45 seconds
    2. deleteX rules, fsa->statecount-based application check: 28.3 seconds
Table 0. Running times for different languages.
Method (ID in list above) hun init hun run br init br run
1 0s 6.4s N/A N/A
2 0s 5.9s N/A N/A
3 0s 9-11s N/A N/A
4 0s 1.47s N/A N/A
5 0.37s 0.95s N/A N/A
6 0.37s 0.87s 7.9s 4.9s
7 0.37s 0.75s 7.9s 3.54s
8 0.37s 0.655s 7.9s 3.13s
Simple ideas to speed it up
  1. do not use symbols for tags, etc (see above): execution time grows by 33%
  2. compose nets on the same priority level: the FSTs become too big too fast -- on some simpler rules:
    1. rule: 3.5 kB. 44 states, 176 arcs
    2. rules: 446.0 kB. 1965 states, 28482 arcs
    3. rules: 4.9 MB. 17303 states, 320664 arcs
    4. rules: 61.5 MB. 163945 states, 4029156 arcs
  3. union nets on the same priority level: the nets don't grow as fast as with compose, but still not useable:
    1. rule: 3.5 kB. 44 states, 176 arcs
    2. rules: 47.3 kB. 202 states, 2957 arcs
    3. rules: 238.8 kB. 856 states, 15208 arcs
    4. rules: 1.3 MB. 3988 states, 82261 arcs
    5. rules: 7.2 MB. 20296 states, 471188 arcs
    6. rules: 53.5 MB. 131639 states, 3504119 arcs
Clever ideas
  1. Use fail-fast FSTs. The idea was that these could speed up the computation because unapplicable rules don't have to copy the whole string over, they can return on the first error. I have tried to augment the ConstrainR and ConstrainS pattern in two ways:
    1. compose $"$A$ " at the end: worked, but the rule FSTs became larger and slower;
    2. instead of deleting $1$ and #1# from the string, I composed ["$1$ " => _ MarkedReading & NotMarkedReading] & $"$1$ " at the beginning. This didn't work, the rules only recognized cohorts, not sentences. And if I change the rule to ?* rule ?*, it becomes non-deterministic.
    3. Maybe this idea wasn't that clever after all -- the rule might have to read the whole sentence anyway...
  2. Do not pass the sentence to the rule, just the current word (+ the left & right contexts).
    1. Unfortunately, this does not mitigate the need of reading the sentence repeatedly over-and-over, because of right contexts: a rule can be enabled by another that changes something in a later word.
  3. Instead of applying the rules to the sentence string, convert the string to an (linear) FSA and compose the rules with it.
    1. Unfortunately, this is even slower: on one of my machines, instead of 6.5 seconds, it takes 47! Why? It could be because:
      1. All fsm_ functions are destructive. However, about half of the time (most, if the TODO above is implemented), I needed the old transducer so I had to fsm_copy it.
      2. Why indeed? According to callgrind (note: the numbers are not entirely accurate, because there was a bug in the program when I measured them. fomacg_to_fsa could be around 13%):
        1. 28.52% fomacg_to_fsa == fsm_concat
        2. 23.01% fsm_intersect
        3. 43.19% fsm_compose
      3. If we go deeper:
        1. 24.01% fsm_merge_sigma (called by all three)
        2. 24.01% fsm_minimize (called 82.87% of the time by fsm_concat, and 16.05% by fsm_intersect)
      4. What can we take home from this?
        1. About 50% of the time is spent in fsm_compose, which we cannot avoid. It might be possible, however, to write a custom <fsm_compose> that exploits the fact that one of its input is not a general FST, but a linear FSA.
        2. fsm_intersect takes about 50% less than fsm_compose (both are called once per rule), so trying the intersection-based method looks promising (though the sentence won't be linear in that case)
        3. The overhead of fsm_minimize could be avoided entirely by using a custom fsm_concat to construct the sentence. It is a linear automaton, so no minimisation is needed.
    2. Sometimes apply down'ing a rule works, but composing it to the sentence FSA doesn't. This was a bug in fomacg.
    3. Currently we are at 28.3 seconds
      1. fsm_concat_custom() does not call fsm_minimize().
      2. We use rules that delete the #X# readings.
  4. (Mans's suggestion): Create an automaton for each rule that doesn't change anything, but only accepts the sentence the associated rule would change. "This automaton can be determinized and minimized and applied *much* faster than the corresponding transducer."
    1. If the automaton is just used to check the applicability of a single rule, execution time goes up to 9 seconds. :(
      1. According to callgrind, apply_follow_next_arc is called 1,358,508 times when we used only FTSs, and 2,589,103 times with the new method.
      2. In the original test set (1/32nd of the performance test corpus), there were 3541 attempts at rule applications, 79 of which were successful.That means 3541 FST application in the first case, and 3541 FSA + 79 FST applications in the second.
      3. This is especially interesting if we know that the total number of states in the rule FSTs is 3,467, while in the condition FSAs, it is only 1,597. The total number of arcs in the FSAs also seem to be about half of that in the FSTs. So why?
    2. If, however, we use the new apply_detmin_fsa function to execute a deterministic minimal FSA (even with linear transition search), without all the cruft in apply_net, we are down to 1.45 seconds.
      1. We have to find a good place for apply_detmin_fsa.
    3. It could be possible to check the applicability of a number of rules with a single automaton. Researching this right now...
      1. The best speed could probably be achieved by building FSAs from conditions numbering the powers of two. In this way, we could find the rule to apply in log(G) checks, thereby breaking the linearity barrier.
      2. However, I have found that the limit where the FSAs are still of managable size is about 1000 states. This (depending on the number of states of the original FSA) is usually reached after unioning 3-6 FSAs. This means that while log(G) is out of reach, we can still reduce our running time to about 1/3-1/6th of what it is now.
      3. Tests of different ways of building condition trees
        1. Smallest-first: always merge the two smallest trees.
  5. TODO Index rules by their required tags (word-form, target, C conditions, etc.), and skip those that cannot possible be activated by the sentence.
Table 1. The effect of different methods of grouping conditions on running time
Condition Total running time FST loading FST application Conditions checked Tree size File size
Flat (no tree) 1.476s 0.041s 1.435s 107,488 N/A 76 kB
Smallest first 1.355s 0.355s 0.9s 42,336 2-4 rules 4.4 MB
Fixed height 2.070s 1.167s 0.84s 36,832 4 rules 12 MB
Table 2. Running the algorithm on different languages
*Smallest first
**Hit physical memory limit
Language Number of rules Number of FSTs Size of FST file Memory consumed by foma when loading Memory consumed by foma Size of VISL-GC binary Initialisation Running time VISL-GC running time
Hungarian 35 95 3.6MB ? 2.7MB 8kB 0.41s 0.94s* 0.3s
Breton-French 226 602 39MB 1GB 666MB 36kB 7.9s 4.9s 0.8s
sme-fin 1172 3149 514MB 3550MB** 2476MB 184kB
Hacky hacks
  1. An apply_detmin_fst() instead of apply_down() for the apertium <-> fomacg converters
  2. Sigma optimisation
    1. apply_create_sigmatch() is responsible for about 54% of apply_detmin_fsa()'s time. Basically we read the input string each time we run a condition FSA.
    2. What we can do:
      1. Roll our own custom_create_sigmatch() that splits the input at spaces and treates each word/tag/etc. as a single symbol, not as a character sequence.
        1. Currently, only words/tags/etc. that occur in the rules are handled this way. Extending this idea to unknown tags (those, whose characters are classified as IDENTITY_SYMBOL) results in fewer transitions in the FSA, and only one (failed) look-up in the sigma trie for each unknown tag, as opposed to one for each of its letters.
        2. Table 3 below summarizes the results: we have achieved a 16-36% speed-up and decreased the running time on the Hungarian corpus to 0.75s.
      2. Sigma unification:
        1. Create an FSA "$0$ " "#BOC# " "| " "#0# " "#EOC# ". Its sigma will be the "canonical" one.
        2. Move around sigmas in all condition FSAs (and rule FSTs?) so that the symbols above have the same codes.
        3. At runtime, create a dictionary from all the symbols not in the canonical set.
        4. Translate the input sentence to sigmas only once; before running a rule FSA, look up all IDENTITY_SYMBOLs in the dictionary and map them to their code in the FSA's sigma.
  3. apply_down() still accounts for a rather large percentage of the running time. This is because it is called not only by apply_rules(), but by the code that converts the apertium stream format to fomacg's format and vice versa.
    1. The latter FST is definitely deterministic. Using the new apply_detmin_fst() function instead of apply_down() decreased the running time of the Hungarian corpus from 0.94s to 0.87s.
    2. Isn't the rule FST deterministic as well?! ... 'course not.
Table 3. Improvements due to custom_detmin_fsa()
The numbers were achieved by using the smallest-first tree building method.
Language Avg. running time w/ apply_detmin_fsa Avg. running time w/ custom_detmin_fsa Speed-up
Hungarian 0.868s 0.745s 16.5%
Breton-French 4.82s 3.54s 36%

TODO?'s

  1. cg-comp is not so strict as fomacg about the position part in contextual tests, e.g. it accepts 1*, not just *1. Loosen up fomacg?

Research

  • Decrease the number of rules applied (see in the proposal).
  • When and how can rules be merged?

Miscellaneous / Extra

Hungarian CG grammar

Write a simple CG grammar for Hungarian, somewhere around 50-150 rules.

  • Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.
  • Read the contents of cg_material.zip.
  • Install Apertium, VISL CG3 and a language pair (cy-en)
  • Study the CG grammar of an Apertium language.
  • Write a Hungarian grammar that covers the sentences in this sample file
    • The tags will be based on those in KR-code[2]. See the next task.
  • TODOs:
    • add a sentence to the rasskaz file for the "az a" construct.
    • prevb disambiguation
  • The file is here.

Hunmorph converter

Write a converter from ocamorph's output to Apertium's format.

  • Again, use the sentences in this sample file as reference.
  • While a C-based converter would definitely be possible, I opted for a foma-based (xfst -- lexc?) implementation, so that this task also serves for practice.
  • TODOs:
    • some analyses are repeated in the output: fix them! -- Not a real fix, because the problem wasn't in the script, but in ocamorph, but I wrote a python script that discards repeated analyses.
    • hunmorph_to_apertium.foma does not handle compounds (+)
    • two-phase conversion (handle special words first, then the general part)
    • a few non-general conversions:
      • "mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg>
      • "hova" => <itg>
      • "így<CONJ>" => <cnjadv>
      • "ekkor<ADV>" => <cnjadv>?
      • "aztán<CONJ>" => <cnjadv>
  • Apparently a pure lexc/foma based implementation wasn't possible (at least with the one line -- one word output format I decided ask from hunmorph). The reason is that the tagset of hunmorph and Apertium does not match exactly, and therefore I needed to add exceptional rules for certain words and lexical categories. However, flookup returns all possible analyses, so in this case it returned both the exceptional and the regular translation. The current solution consists of four files:
    • kr_to_apertium_spec.lexc contains the words / lexical categories that need special treatment (vbser, prns, "ez"/"az", etc.)
    • kr_to_apertium.lexc contains the rest of the categories, i.e. whose the translation was straightforward
    • kr_to_apertium.foma is a simple foma script that writes the previous two into the binary file kr_to_apertium.fst
    • hunmorph_to_apertium.cpp loads the two FSTs from the binary file and applies them to the readings. First it tries to parse the reading with the _spec FST; if it fails, it reverts back to the general one. This mechanism ensures that all readings get only one translation, and it is also the correct one.

ATT -> lttoolbox compiler

Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component.

  • ATT format
  • "<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);"
    • The actual implementation turned out to be a little more difficult. Anyway, the code is here.
    • The transducer we used for testing, kaz.att, didn't work with lt-proc, so Fran told me to create two transducers instead of one: one for words (main) and one for punctuation (final).

References

  1. Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.
  2. András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.