Difference between revisions of "User:David Nemeskey/GSOC progress 2013"

From Apertium
Jump to navigation Jump to search
 
(104 intermediate revisions by the same user not shown)
Line 2: Line 2:


=== XML format ===
=== XML format ===

See [[User:David_Nemeskey/CG_XML_brainstorming]].

=== Rule applier ===

Rough execution flow:

# Load the rules
## DELIMITERS must also be converted to an FST and saved.
## Rule organization:
##* separate files in a directory?
##* one file per section?
##* one big file (needs another "index" file then)
# Read the input stream cohort by cohort
#* ''<code>StreamReader</code> class in ''fomacg_stream_reader.h
#* Read until a delimiter is found (see above)
# Convert the cohorts from the Apertium stream format to fomacg's format
#* ''<code>Converter</code> in ''fomacg_converter.h
#* ''apertium_to_fomacg.foma''
#* conversion from <code>wchar_t</code> to ''utf-8'' <code>char</code>
# Apply rules in the order of the sections they are defined in
# Convert the cohorts back to the Apertium stream format


=== Compiler ===
=== Compiler ===

==== Code ====

* Read the [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/fomacg/ fomacg] code<ref>Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.</ref> and understand what it does.
* ''Add comments do a little refactoring (e.g. separate function for creating ''@*word@*'' FSTs).''
* Check and test if it works on all grammars, fix it if it isn't.

==== Development ====

# ''DELIMITERS must also be converted to an FST and saved.''
#* ''I have run into a foma bug, which also corrupted the output of all rules that contained lemma tags. I have patched the code on my side and opened a [http://code.google.com/p/foma/issues/detail?id=49 bug report] ([http://code.google.com/p/foma/issues/detail?id=50 two, actually]).''
# '' ''fomacg'' treats all tags as symbols. Usually this is OK, but for lemmas, it may not be optimal''
## ''it blows up the number of symbols (if not the transducer) => turns out the transducer is smaller this way;''
## ''might be slower to parse? Or faster => faster, if we use exploded strings, ''fomacg_proc'' becomes about 33% slower.''
# ''If one tag is a postfix of another, and the longer tag is not mentioned in the rule, ''fomacg'' might trigger the wrong rule, believing that it sees the shorter tag. E.g. if there is a <code>"predet "</code> and a <code>"det "</code> tag, and the rule is <code>SELECT (det)</code>, it will also select <code>predet</code>s as well. This is because in ''fomacg'', the symbol that corresponds to a tag is its name plus a space; in other words, only the right boundary of a symbol is checked.''
## ''Change the symbols from <code>"tag "</code> to <code>"<tag> "</code> to ensure the "postfix property", except for word forms and lemmas, which are already enclosed in quotes.''
## Keep as much of Apertium's format as possible?
# ''Insert ">>>" and "<<<" to before and after the sentence, respectively.''
# Make the rules fail-fast?

==== Speed problems ====

This first version of ''fomacg_proc'', where rules are separate FSTs applied in order, has some speed issues. With the 40-rule CG-2 subset of my Hungarian grammar, it disambiguates a 13760-token test corpus (the ''rasskaz'' sentences 32x concatenated) in 6.4 seconds. Under the same conditions, ''cg-proc'' needs 0.3 seconds.

===== Speed measurements (see below for details) summary =====
# <code>apply_down</code>
## original rules, <code>strcmp</code>-based application check: 6.4 seconds
## '''deleteX rules''', '''<code>strlen</code>'''-based application check: 5.9 seconds
## deleteX rules, '''FSA (<code>apply_down</code>)'''-based application check: 9 - 11 seconds
## deleteX rules, FSA ('''<code>apply_detmin_fsa</code>''')-based check: 1.47 seconds
## deleteX rules, FSA (<code>apply_detmin_fsa</code>)-based check, '''smallest-first condition trees''': 1.35 seconds, running time: 0.95 seconds.
## deleteX rules, FSA (<code>apply_detmin_fsa</code>)-based check, smallest-first condition trees, '''<code>apply_detmin_fst</code> conversion''': 1.3 seconds, running time: 0.87 seconds.
## deleteX rules, FSA ('''<code>custom_detmin_fsa</code>''')-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion: 1.1 seconds, running time: 0.75 seconds.
## deleteX rules, FSA (<code>custom_detmin_fsa</code>)-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion '''both ways''': 1 seconds, running time: 0.655 seconds.
## deleteX rules, FSA ('''string vector only''' <code>custom_detmin_fsa</code>)-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: 1 seconds, running time: 0.648 seconds (but Breton gained a lot from this, see table!).
## deleteX rules, FSA ('''<code>common_detmin_fsa</code> with set''')-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: 1 seconds, running time: '''0.623s''' seconds (but again, Breton gained a lot more from this).
## deleteX rules, FSA (<code>common_detmin_fsa</code> with '''vector''')-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: '''0.93''' seconds, running time: 0.559s seconds (again, Breton... you get the idea).
## deleteX rules, FSA (<code>common_detmin_fsa</code> with '''vector''')-based check, '''<code>common_apply_down</code>-based rule application''', smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: 0.72 seconds, running time: 0.346s seconds.
## deleteX rules, FSA (<code>common_detmin_fsa</code> with '''vector''')-based check, <code>common_apply_down</code>-based rule application, '''<code>common_create_sigmatch</code>-based string->int conversion''', smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: 0.655 seconds, running time: 0.285s seconds (''note that '''this running time is less than the full execution time of VISL-GC''', so we're on the right track! Unfortunately, we are not there with Breton yet.'')
## Same as above, but after having optimised the input stream reader code and getting rid of deques: '''0.615''' seconds, running time: '''0.245s''' seconds.
# <code>fsm_compose</code>
## simple rules, <code>fsm_intersect</code>-based application check: 45 seconds
## deleteX rules, <code>fsa->statecount</code>-based application check: '''28.3''' seconds

<div align="center">
{| class="wikitable"
|+ align="bottom" | ''Table 0. Running times for different languages.''
! Method (ID in list above)
! hun init
! hun run
! br init
! br run
|-
|1
|0s
|6.4s
|N/A
|N/A
|-
|2
|0s
|5.9s
|N/A
|N/A
|-
|3
|0s
|9-11s
|N/A
|N/A
|-
|4
|0s
|1.47s
|N/A
|N/A
|-
|5
|'''0.37s'''
|0.95s
|N/A
|N/A
|-
|6
|0.37s
|0.87s
|'''7.9s'''
|4.9s
|-
|7
|0.37s
|0.75s
|7.9s
|3.54s
|-
|8
|0.37s
|0.655s
|7.9s
|3.13s
|-
|9
|0.37s
|0.648s
|7.9s
|2.73s
|-
|10
|0.37s
|0.623s
|7.9s
|2.12s
|-
|11
|0.37s
|0.559s
|7.9s
|1.566s
|-
|12
|0.37s
|0.346s
|7.9s
|1.088s
|-
|13
|0.37s
|0.285s
|7.9s
|0.96s
|-
|14
|0.37s
|'''0.245s'''
|7.9s
|'''0.916s'''
|}
</div>

===== Simple ideas to speed it up =====
# <span style="color:#c07070">do not use symbols for tags, etc (see above): </span><span style="color:#900000">''execution time '''grows''' by 33%''</span>
# <span style="color:#c07070"><code>compose net</code>s on the same priority level: ''the FSTs become too big too fast -- on some simpler rules:''
## rule: 3.5 kB. 44 states, 176 arcs
## rules: 446.0 kB. 1965 states, 28482 arcs
## rules: 4.9 MB. 17303 states, 320664 arcs
## rules: 61.5 MB. 163945 states, 4029156 arcs</span>
# <span style="color:#c07070"><code>union net</code>s on the same priority level: ''the nets don't grow as fast as with <code>compose</code>, but still not useable:''
## rule: 3.5 kB. 44 states, 176 arcs
## rules: 47.3 kB. 202 states, 2957 arcs
## rules: 238.8 kB. 856 states, 15208 arcs
## rules: 1.3 MB. 3988 states, 82261 arcs
## rules: 7.2 MB. 20296 states, 471188 arcs
## rules: 53.5 MB. 131639 states, 3504119 arcs</span>

===== Clever ideas =====
# <span style="color:#c07070">''Use fail-fast FSTs. The idea was that these could speed up the computation because unapplicable rules don't have to copy the whole string over, they can return on the first error. I have tried to augment the <code>ConstrainR</code> and <code>ConstrainS</code> pattern in two ways:
## compose <code>$"$A$ "</code> at the end: worked, but the rule FSTs became larger and slower;''
## ''instead of deleting <code>$1$</code> and <code>#1#</code> from the string, I composed <code>["$1$ " => _ MarkedReading & NotMarkedReading] & $"$1$ "</code> at the beginning. This didn't work, the rules only recognized cohorts, not sentences. And if I change the rule to <code>?* rule ?*</code>, it becomes non-deterministic.''
## ''Maybe this idea wasn't that clever after all -- the rule might have to read the whole sentence anyway...''</span>
# Do not pass the sentence to the rule, just the current word (+ the left & right contexts).
## Unfortunately, this does not mitigate the need of reading the sentence repeatedly over-and-over, because of right contexts: a rule can be enabled by another that changes something in a later word.
# <span style="color:#c07070">''Instead of applying the rules to the sentence string, convert the string to an (linear) FSA and compose the rules with it.''</span
## <span style="color:#900000">'''''Unfortunately, this is even slower: on one of my machines, instead of 6.5 seconds, it takes 47!'''</span> <span style="color:#c07070">Why? It could be because:''
### ''All <code>fsm_</code> functions are destructive. However, about half of the time (most, if the TODO above is implemented), I needed the old transducer so I had to <code>fsm_copy</code> it.''
### ''Why indeed? According to callgrind ('''note: the numbers are not entirely accurate, because there was a bug in the program when I measured them. <code>fomacg_to_fsa</code> could be around 13%'''):''
#### ''28.52% <code>fomacg_to_fsa</code> == <code>fsm_concat</code>''
#### ''23.01% <code>fsm_intersect</code>''
#### ''43.19% <code>fsm_compose</code>''
### ''If we go deeper:''
#### ''24.01% <code>fsm_merge_sigma</code> (called by all three)''
#### ''24.01% <code>fsm_minimize</code> (called 82.87% of the time by <code>fsm_concat</code>, and 16.05% by <code>fsm_intersect</code>)''
### ''What can we take home from this?''
#### ''About 50% of the time is spent in <code>fsm_compose</code>, which we cannot avoid. It might be possible, however, to write a custom <fsm_compose> that exploits the fact that one of its input is not a general FST, but a linear FSA.''
#### ''<code>fsm_intersect</code> takes about 50% less than fsm_compose</code> (both are called once per rule), so trying the intersection-based method looks promising (though the sentence won't be linear in that case)''
#### ''The overhead of <code>fsm_minimize</code> could be avoided entirely by using a custom <code>fsm_concat</code> to construct the sentence. It is a linear automaton, so no minimisation is needed.''
## ''Sometimes ''apply down'''ing a rule works, but composing it to the sentence FSA doesn't. This was a bug in fomacg.''
## ''Currently we are at 28.3 seconds''
### ''<code>fsm_concat_custom()</code> does not call <code>fsm_minimize()</code>.''
### ''We use rules that delete the <tt>#X#</tt> readings.''</span>
# <span style="color:#009000">''Mans's original rules replaced the <code>#0# </code> in front of filtered readings with an <code>#X# </code>. However, nothing prevents us from actually ''removing'' those readings... since this way the sentence becomes shorter with each rule applied, subsequent rule applications / checks become faster. I call this new behavior "'''deleteX'''", and it is realised by composing the transducer ''cg_DeleteX.foma'' that deletes the <code>#X# </code>'d readings to all rules.''
## As can be seen in Table 0, this is indeed the case.
## The situation gets a bit more tricky if we replace string-based <code>apply_down()</code> with <code>common_apply_down()</code> (see above).
### ''As we are dealing with numbers now, it turns out that we would now be better off with getting back to replacing <code>#0# </code> with <code>#X# </code>, as that wouldn't require creating and filling a new <code>vector</code> of <code>Symbol</code>s.''
### However, <code>callgrind</code> tells us that we spend 71.23% of our time checking rule conditions, and only about 6.75% applying rules; and the performance of the former is, again, heavily affected by the length of the sentence.
### ''So it may just be that we are still better off with deleteX rules. However, if we manage to decrease the number of condition checks, we might reconsider dropping them.''
## ''As it turns out, the ''cg_DeleteX.foma'' is not enough; we are left with <code>#X# -> 0</code> branches in our FSTs that are never traversed; yet, they take up space.
### ''The culprits were the ''initial filter'' pattern and the <code>cg_match_cohort_careful()</code> function in fomacg. Modifying these got us rid of the unnecessary branches. However, even though it isn't used anywhere now, <code>#X# </code> remained in the sigma of the FSTs; apparently ''foma'' does not remove unused symbols from the alphabet of the transducer created via composition.''
### ''The effect of removing said branches can be seen in Table 5. As I could not recreate the condition trees I am using for the performance tests, I had to create new ones. I sorted the rules in alphabetical order, and created complete binary trees from them (3-level trees for Hungarian, 2-level trees for Breton). Therefore, the numbers in Table 5 cannot be directly compared with those in Table 0; however, the "before" numbers correspond to row 14 of Table 0.''</span>
# <span style="color:#009000">''(Mans's suggestion): Create an automaton for each rule that doesn't change anything, but only accepts the sentence the associated rule would change. "This automaton can be determinized and minimized and applied *much* faster than the corresponding transducer."''</span>
## <span style="color:#c07070">If the automaton is just used to check the applicability of a single rule, </span><span style="color:#900000">'''execution time goes up to 9 seconds'''. :(</span>
### <span style="color:#c07070">''According to ''callgrind'', <code>apply_follow_next_arc</code> is called 1,358,508 times when we used only FTSs, and 2,589,103 times with the new method.''</span>
### <span style="color:#c07070">''In the original test set (1/32nd of the performance test corpus), there were 3541 attempts at rule applications, 79 of which were successful.That means 3541 FST application in the first case, and 3541 FSA + 79 FST applications in the second.''</span>
### <span style="color:#c07070">''This is especially interesting if we know that the total number of states in the rule FSTs is 3,467, while in the condition FSAs, it is only 1,597. The total number of arcs in the FSAs also seem to be about half of that in the FSTs. So why?''</span>
## <span style="color:#007000">''If, however, we use the new <code>apply_detmin_fsa</code> function to execute a deterministic minimal FSA (even with linear transition search), without all the cruft in <code>apply_net</code>, we are down to '''1.45 seconds'''.''
## <span style="color:#909000">''It could be possible to check the applicability of a number of rules with a single automaton. Researching this right now...''
### ''The best speed could probably be achieved by building FSAs from conditions numbering the powers of two. In this way, we could find the rule to apply in ''<tt>log(G)</tt>'' checks, thereby breaking the linearity barrier.''
### ''However, I have found that the limit where the FSAs are still of managable size is about 1000 states. This (depending on the number of states of the original FSA) is usually reached after unioning 3-6 FSAs. This means that while ''<tt>log(G)</tt>'' is out of reach, we can still reduce our running time to about 1/3-1/6th of what it is now.''
### ''Tests of different ways of building condition trees''
#### ''Flat: no trees are used''
#### ''Smallest-first: always merge the two smallest trees.''
#### ''Fixed height: full binary trees with a fixed number of levels. In Table 1, we use trees with 3 levels.''</span>
# Compress the condition FSAs, so that we can check more rules with a single FSA application.
## Try Automata Mista (Huet, 2003)
## Inter-automata bits (such as "skip this reading")
## Similar rule(parts), e.g. <code>SELECT np IF "Mari";</code> and <code>SELECT n IF "török";</code>: the FSA is the same, the only difference are the tags
### flag diacritics?
### push-down automata?
# '''TODO''' Index rules by their required tags (word-form, target, C conditions, etc.), and skip those that cannot possible be activated by the sentence.

<div align="center">
{| class="wikitable"
|+ align="bottom" | ''Table 1. The effect of different methods of grouping conditions on running time''
! Condition
! Total running time
! FST loading
! FST application
! Conditions checked
! Tree size
! File size
|-
|Flat (no tree)
|1.476s
|0.041s
|1.435s
|107,488
|N/A
|76 kB
|-
|Smallest first
|1.355s
|<span style="color:#b00000">'''0.355s'''</span>
|0.9s
|42,336
|2-4 rules
|<span style="color:#b00000">'''4.4 MB'''</span>
|-
|Fixed height
|2.070s
|<span style="color:#b00000">'''1.167s'''</span>
|'''0.84s'''
|36,832
|4 rules
|<span style="color:#b00000">'''12 MB'''</span>
|}

{| class="wikitable"
|+ align="bottom" | ''Table 2. Running the algorithm on different languages<br>*Smallest first<br>**Hit physical memory limit''
! Language
! Number of rules
! Number of FSTs
! Size of FST file
! Memory consumed by foma when loading
! Memory consumed by foma
! Size of VISL-GC binary
! Initialisation
! Running time
! VISL-GC running time
|-
|Hungarian
|35
|95
|3.6MB
|?
|2.7MB
|8kB
|0.41s
|0.94s*
|0.3s
|-
|Breton-French
|226
|602
|39MB
|1GB
|666MB
|36kB
|7.9s
|4.9s
|0.8s
|-
|sme-fin
|1172
|3149
|<span style="color:#b00000">'''514MB'''</span>
|3550MB**
|2476MB
|184kB
|}
</div>

===== Hacky hacks =====

# <span style="color:#007000">''An <code>apply_detmin_fst()</code> instead of <code>apply_down()</code> for the apertium <-> fomacg converters''</span>
# Sigma optimisation
## <code>apply_create_sigmatch()</code> is responsible for about 54% of <code>apply_detmin_fsa()</code>'s time. Basically we read the input string each time we run a condition FSA.
## What we can do:
### <span style="color:#007000">''Roll our own <code>custom_create_sigmatch()</code> that splits the input at spaces and treates each word/tag/etc. as a single symbol, not as a character sequence.''
#### ''Currently, only words/tags/etc. that occur in the rules are handled this way. Extending this idea to unknown tags (those, whose characters are classified as <code>IDENTITY_SYMBOL</code>) results in fewer transitions in the FSA, and only one (failed) look-up in the sigma trie for each unknown tag, as opposed to one for each of its letters.''</span>
#### ''Furthermore, while the original <code>custom_detmin_fsa()</code> and <code>custom_create_sigmatch()</code> required that, apart from the symbol vector, the original string is passed to them too, we can actually get away without the latter. Instead of using a <code>sigmatch</code> table of the size of the string, we can use one of the size of the vector, which is much shorter. Maybe because of this (fewer memory re-allocations), this simple modification to the functions resulted in '''14.5%''' speed increase for the Breton grammar.''
#### Table 3 below summarizes the results: we have achieved a '''16-36% speed-up''' and decreased the running time on the Hungarian corpus to '''0.75s'''. The last two columns shows the effect of factoring in the f2a direction of hack #1, as well as the modification to <code>custom_detmin_fsa()</code> described above, which allowed a '''further 1.5-14.5% speed-up'''.
### Sigma unification:
#### <span style="color:#007000">''Create an FSA that contains all symbols from all rules and conditions. Its sigma will be the "canonical"/"universal" one.''</span>
#### <span style="color:#007000">''Move around sigmas in all condition FSAs and rule FSTs so that the symbols above have the same codes.''</span>
#### <span style="color:#007000">''Write a <code>common_detmin_fsa()</code> that, if it sees a symbol unknown to the FSA in state, it treats it as an <code>IDENTITY_SYMBOL</code> if it can. There are several ways to achieve this end:</span>
##### <span style="color:#007000">''Have a <code>sigma</code> set for all FSAs; if an input symbol is not in the set, replace it. Note that for this use case, <code>unordered_set</code> is '''slower''' than <code>set</code> because the number of elements in a single set is small.''</span>
##### Have a mapping that translates all symbols in the universal sigma to the local one. For symbols in the local sigma this will be identity; for the rest, <code>IDENTITY</code> :)
##### <span style="color:#007000">''The latter can be implemented with a <code>vector</code>, which is much faster than a <code>map</code>, though its memory requirements are higher (<code>O(|conditions| * |sigma|)</code>).''</span>
##### Results are in Table 4, below.
#### <span style="color:#007000">''Let the universal FSA create the sigmatch table, and pass its <code>apply_handle</code> to the others in <code>common_detmin_fsa()</code>. This enables us to convert the symbols to sigma ids only once per rule application.''</span>
### <span style="color:#007000">''Get rid of strings for good:''</span>
#### <span style="color:#007000">''Write an <code>common_apply_down()</code> that''</span>
##### <span style="color:#007000">''handles nondeterminism (with the restriction that for an input there is only a single output)''</span>
##### <span style="color:#007000">''both the input and output are sigma ids''</span>
#### <span style="color:#007000">''Because we currently delete filtered readings instead of just <code>#X# </code>'ing them,''</span>
##### <span style="color:#007000">''we have to pass a sequence of <code>(sigma, symbol)</code> pairs to it, or''</span>
##### <span style="color:#007000">''actually, we pass a sequence of <code>Symbol</code>s, which store the sigma id, its position in the string and its length to avoid string creation overhead.''</span>
##### we can just bring back the <code>#X# </code>.
### <span style="color:#007000">''See Table 0 for the results; for the Hungarian, we beat VISL-GC a bit; for Breton, we need to find another 20%.''</span>
# <code>apply_down()</code> still accounts for a rather large percentage of the running time. This is because it is called not only by <code>apply_rules()</code>, but by the code that converts the apertium stream format to fomacg's format and vice versa.
## <span style="color:#007000">''The latter FST is definitely deterministic. Using the new <code>apply_detmin_fst()</code> function instead of <code>apply_down()</code> decreased the running time of the Hungarian corpus '''from 0.94s to 0.87s'''.''</span>
## <span style="color:#c07070">''Isn't the rule FST deterministic as well?!</span> ... <span style="color:#900000">'course not.''</span>

<div align="center">
{| class="wikitable"
|+ align="bottom" | ''Table 3. Improvements due to custom_detmin_fsa()<br>The numbers were achieved by using the smallest-first tree building method.''<br>''(*) see above; additionally, with hack #1 in the f2a direction too.''<br>
''(**) compared to the regular <code>custom_detmin_fsa</code>, but with hack #1.''
! Language
! <code>apply_detmin_fsa()</code>
! <code>custom_detmin_fsa()</code>
! Speed-up
! <code>custom_detmin_fsa()</code>(*)
! Speed-up(**)
|-
|Hungarian
|0.868s
|0.745s
|16.5%
|0.648s
|1.5%
|-
|Breton-French
|4.82s
|3.54s
|36%
|2.73s
|'''14.5%'''
|}
</div>

<div align="center">
{| class="wikitable"
|+ align="bottom" | ''Table 4. Improvements due to common_detmin_fsa()<br>The numbers were achieved by using the smallest-first tree building method.''
! Language
! <code>custom_detmin_fsa()</code>
! <code>common_detmin_fsa() w/ set</code>
! Speed-up
! <code>custom_detmin_fsa() w/ vector</code>
! Speed-up
|-
|Hungarian
|0.648s
|0.623s
|4%
|0.559s
|16%
|-
|Breton-French
|2.73s
|2.12s
|29%
|1.566s
|'''74%'''
|}
</div>

<div align="center">
{| class="wikitable"
|+ align="bottom" | ''Table 5. Improvements in memory usage and running time due to removing <code>#X# </code> from the initial filter pattern.''
! Language
! Before -- memory
! Before -- initialisation
! Before -- runtime
! After -- memory
! After -- initialisation
! After -- runtime
|-
|Hungarian
|7.1MB
|0.721s
|0.238s
|6.5MB
|0.658s
|0.232s
|-
|Breton-French
|7.4MB
|1.814s
|1.105s
|6MB
|1.67s
|1.007s
|}
</div>

==== Size problems ====

As mentioned above, union'ing the rule condition FSAs is a good way to speed up the rule application process. Unfortunately, the resulting automata may be exponential in size, making FSAs that check many rules at once unfeasible. Unless, of course, we can reduce their memory footprint. This section documents the experiments we conducted to towards this goal.

# Automata Mista
## Due to Gérard Huet
### Huet, G. (2004). Automata mista. Verification: Theory and Practice, 271-273.
### Huet, G. (2005). A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger. Journal of Functional Programming, 15(4), 573-614.
### While the code implementating AM is included in the papers, it is not immediately usable and is rather hard to read, because
#### the code is in ML (with the occasional OCaml library call), which isn't very intuitive
#### it is written in a functional style, which, when translated line-by-line to Python, leads to abysmal performance
#### the author couldn't have been bothered to explain what certain function arguments or variables mean
## Promise
### Huet reports huge savings; for example, the Sanskrit morphological analyser, an FSA of 200,000 states, could be represented by a trie of 7,500 nodes.
### Then again, FSAs derived from word lists have a structure that gives them naturally to tries. Whether the method fares similarly well on generic FSAs is yet to be seen.
## Implementation
### <span style="color:#009000">''I have translated the caml code to Python. If it seems promising, I'll be rewriting it in C++.''</span>
## Evaluation
### <span style="color:#009000">''I fed the vocabulary of the Longman dictionary (more specifically, the strangely short variant that we have) to it. From the 47,051 tokens (words, numbers, g@rb@ge), the trie builder built a raw trie (with only forward sharing) of 150,222 nodes. Running the sharing algorithm (2/p9) on it reduced the number of nodes to 42,540 -- a 71.68% percent reduction.''</span>
### <span style="color:#009000">''Of course, due to how the trie is built (there is an empty node at the end of each path), about 47050 of those nodes are trivial to merge; this decreases the reduction to a more humble '''58.77%'''.''</span>
### <span style="color:#009000">''Collapsing nonaccepting, nonbranching trie paths (not described in the papers) results in a further reduction of '''54.8%''', bringing down the number of nodes to 19,213. However, I don't see a way to decrease this number further, and the compression level is not nearly the same as the one reported for the Sanskrit analyzer.''</span>
### <span style="color:#009000">''At the same time, foma creates an 42,540-state automaton from the same word list, which hints at the not very surprising fact that "sharing" is basically the same as minimising a deterministic automaton. As such, the only advantage AM can boast of is that states can be collapsed, but I am not sure that we are better off, due to the additional bookkeeping required (sequence of letters on arcs).''</span>
# Decrease the alphabet
## We cannot hope for large savings from this one, but <code>| </code> seems to be completely superfluous: the end of a reading is clearly marked by the beginning of the next one (<code>#0# </code>) or the end of the cohort (<code>#EOC# </code>).
### There is a possibility that this change might make the rules more complicated; still, we should try it.
## It is not necessary to save all FSAs to disk. In case of a full binary tree with three levels, nodes 3, 5 and 7 (BFS ordering) are superfluous: if node 1 matches the sentence, but node 2 does not, it is obvious that node 3 would, and so on. The rule applier has already been built with this idea in mind, but the redundant trees are still save to file and read into memory.
# <span style="color:#009000">''Delete the sigma tries from the condition automata and rules.''</span>
## <span style="color:#009000">''See table 6. for results on some of the rule sets. It is obvious from the data that the reduction is greater for rule sets where the condition trees are shallow; as the conditions are merged, the growth in the number of states and transitions becomes more substantial -- in the extreme case the sigma of a joined condition might not grow at all, but the size of the state space always does considerably.''</span>

<div align="center">
{| class="wikitable"
|+ align="bottom" | ''Table 6. Improvements in memory usage due to removing the sigma trie from the condition and rule FSMs.<br>Memory consumption is measured as percentages of the 4GB system memory.''
! Rule set
! Before
! After
! Reduction
|-
|apertium-br-fr.br-fr_cg2_orig.fst
|23.5%
|17.8%
|24.25%
|-
|apertium-br-fr.br-fr_cg2_l2.fst
|9.6%
|4.4%
|54.16%
|-
|apertium-br-fr.br-fr_cg2_l1.fst
|5.1%
|1.3%
|74.5%
|-
|hun_cg2_l1.fst
|0.5%
|0.1%
|80%
|-
|hun_cg2_l3.fst
|2.1%
|1.5%
|28.57%
|-
|apertium-sme-fin.fin-sme_cg2_l1.fst
|21%
|4.1%
|80.47%
|-
|apertium-sme-fin.fin-sme_cg2_l2.fst
|32.3%
|8.9%
|72.44%
|}
</div>

==== TODO?'s ====

# ''cg-comp'' is not so strict as ''fomacg'' about the position part in contextual tests, e.g. it accepts <code>1*</code>, not just <code>*1</code>. Loosen up ''fomacg''?

==== Research ====

* Decrease the number of rules applied (see [[User:David_Nemeskey/GSOC_proposal_2013#Detailed work plan|in the proposal]]).
* When and how can rules be merged?


=== Miscellaneous / Extra ===
=== Miscellaneous / Extra ===
Line 9: Line 517:
==== Hungarian CG grammar ====
==== Hungarian CG grammar ====


Write a simple CG grammar for Hungarian, somewhere around 50-150 rules.
''Write a simple CG grammar for Hungarian, somewhere around 50-150 rules.''


* Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.
* ''Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.''
* Read the contents of cg_material.zip.
* Read the contents of cg_material.zip.
* ''Install Apertium, VISL CG3 and a language pair (cy-en)''
* Study the CG grammar of an Apertium language.
* Study the CG grammar of an Apertium language.
* Write a Hungarian grammar that covers the sentences in [https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-hun-eng/texts/rasskaz.hun.txt this sample file]
* ''Write a Hungarian grammar that covers the sentences in [https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-hun-eng/texts/rasskaz.hun.txt this sample file]''
* The tags will be based on those in KR-code<ref>András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.</ref>. See the [[#Hunmorph converter|next task]].
** ''The tags will be based on those in KR-code<ref>András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.</ref>. See the [[#Hunmorph converter|next task]].''
* TODOs:
** ''add a sentence to the rasskaz file for the "az a" construct.''
** ''prevb disambiguation''
* ''The file is [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/hun.rlx here].''


==== Hunmorph converter ====
==== Hunmorph converter ====


Write a converter from ocamorph's output to Apertium's format.
''Write a converter from [[Hunmorph|ocamorph]]'s output to Apertium's format.''


* Again, use the sentences in [https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-hun-eng/texts/rasskaz.hun.txt this sample file] as reference.
* ''Again, use the sentences in [https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-hun-eng/texts/rasskaz.hun.txt this sample file] as reference.''
* While a C-based converter would definitely be possible, I opted for a foma-based (xfst -- lexc?) implementation, so that this task also serves for practice.
* ''While a C-based converter would definitely be possible, I opted for a foma-based (xfst -- lexc?) implementation, so that this task also serves for practice.''
* TODOs:
** ''some analyses are repeated in the output: fix them! -- Not a real fix, because the problem wasn't in the script, but in ocamorph, but I wrote a python script that discards repeated analyses.''
** ''hunmorph_to_apertium.foma does not handle compounds (+)''
** two-phase conversion (handle special words first, then the general part)
** ''a few non-general conversions:''
*** ''"mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg>''
*** ''"hova" => <itg>''
*** ''"így<CONJ>" => <cnjadv>''
*** ''"ekkor<ADV>" => <cnjadv>?''
*** ''"aztán<CONJ>" => <cnjadv>''
* ''Apparently a pure lexc/foma based implementation wasn't possible (at least with the one line -- one word output format I decided ask from hunmorph). The reason is that the tagset of hunmorph and Apertium does not match exactly, and therefore I needed to add exceptional rules for certain words and lexical categories. However, ''<code>flookup</code>'' returns all possible analyses, so in this case it returned both the exceptional and the regular translation. The current solution consists of four files:''
** ''[https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/kr_to_apertium_spec.lexc kr_to_apertium_spec.lexc] contains the words / lexical categories that need special treatment (''<code>vbser</code>'', ''<code>prn</code>''s, "''ez''"/"''az''", etc.)''
** ''[https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/kr_to_apertium.lexc kr_to_apertium.lexc] contains the rest of the categories, i.e. whose the translation was straightforward''
** ''[https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/kr_to_apertium.foma kr_to_apertium.foma] is a simple foma script that writes the previous two into the binary file'' <code>kr_to_apertium.fst</code>
** ''[https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/hunmorph_to_apertium.cpp hunmorph_to_apertium.cpp] loads the two FSTs from the binary file and applies them to the readings. First it tries to parse the reading with the ''<code>_spec</code>'' FST; if it fails, it reverts back to the general one. This mechanism ensures that all readings get only one translation, and it is also the correct one.''


==== ATT -> lttoolbox compiler ====
==== ATT -> lttoolbox compiler ====


Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component.
''Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component.''


* [[ATT format]]
* ''[[ATT format]]''
* "<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);"
* ''"<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);"''
** ''The actual implementation turned out to be a little more difficult. Anyway, the code is [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/att_to_ltt/att_to_ltt.cpp here].
** ''The transducer we used for testing, [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/att_to_ltt/kaz.att kaz.att], didn't work with <tt>lt-proc</tt>, so Fran told me to create two transducers instead of one: one for words ('''main''') and one for punctuation ('''final''').''


== References ==
== References ==

Latest revision as of 16:27, 29 October 2013

Tasks[edit]

XML format[edit]

See User:David_Nemeskey/CG_XML_brainstorming.

Rule applier[edit]

Rough execution flow:

  1. Load the rules
    1. DELIMITERS must also be converted to an FST and saved.
    2. Rule organization:
      • separate files in a directory?
      • one file per section?
      • one big file (needs another "index" file then)
  2. Read the input stream cohort by cohort
    • StreamReader class in fomacg_stream_reader.h
    • Read until a delimiter is found (see above)
  3. Convert the cohorts from the Apertium stream format to fomacg's format
    • Converter in fomacg_converter.h
    • apertium_to_fomacg.foma
    • conversion from wchar_t to utf-8 char
  4. Apply rules in the order of the sections they are defined in
  5. Convert the cohorts back to the Apertium stream format

Compiler[edit]

Code[edit]

  • Read the fomacg code[1] and understand what it does.
  • Add comments do a little refactoring (e.g. separate function for creating @*word@* FSTs).
  • Check and test if it works on all grammars, fix it if it isn't.

Development[edit]

  1. DELIMITERS must also be converted to an FST and saved.
    • I have run into a foma bug, which also corrupted the output of all rules that contained lemma tags. I have patched the code on my side and opened a bug report (two, actually).
  2. fomacg treats all tags as symbols. Usually this is OK, but for lemmas, it may not be optimal
    1. it blows up the number of symbols (if not the transducer) => turns out the transducer is smaller this way;
    2. might be slower to parse? Or faster => faster, if we use exploded strings, fomacg_proc becomes about 33% slower.
  3. If one tag is a postfix of another, and the longer tag is not mentioned in the rule, fomacg might trigger the wrong rule, believing that it sees the shorter tag. E.g. if there is a "predet " and a "det " tag, and the rule is SELECT (det), it will also select predets as well. This is because in fomacg, the symbol that corresponds to a tag is its name plus a space; in other words, only the right boundary of a symbol is checked.
    1. Change the symbols from "tag " to "<tag> " to ensure the "postfix property", except for word forms and lemmas, which are already enclosed in quotes.
    2. Keep as much of Apertium's format as possible?
  4. Insert ">>>" and "<<<" to before and after the sentence, respectively.
  5. Make the rules fail-fast?

Speed problems[edit]

This first version of fomacg_proc, where rules are separate FSTs applied in order, has some speed issues. With the 40-rule CG-2 subset of my Hungarian grammar, it disambiguates a 13760-token test corpus (the rasskaz sentences 32x concatenated) in 6.4 seconds. Under the same conditions, cg-proc needs 0.3 seconds.

Speed measurements (see below for details) summary[edit]
  1. apply_down
    1. original rules, strcmp-based application check: 6.4 seconds
    2. deleteX rules, strlen-based application check: 5.9 seconds
    3. deleteX rules, FSA (apply_down)-based application check: 9 - 11 seconds
    4. deleteX rules, FSA (apply_detmin_fsa)-based check: 1.47 seconds
    5. deleteX rules, FSA (apply_detmin_fsa)-based check, smallest-first condition trees: 1.35 seconds, running time: 0.95 seconds.
    6. deleteX rules, FSA (apply_detmin_fsa)-based check, smallest-first condition trees, apply_detmin_fst conversion: 1.3 seconds, running time: 0.87 seconds.
    7. deleteX rules, FSA (custom_detmin_fsa)-based check, smallest-first condition trees, apply_detmin_fst conversion: 1.1 seconds, running time: 0.75 seconds.
    8. deleteX rules, FSA (custom_detmin_fsa)-based check, smallest-first condition trees, apply_detmin_fst conversion both ways: 1 seconds, running time: 0.655 seconds.
    9. deleteX rules, FSA (string vector only custom_detmin_fsa)-based check, smallest-first condition trees, apply_detmin_fst conversion both ways: 1 seconds, running time: 0.648 seconds (but Breton gained a lot from this, see table!).
    10. deleteX rules, FSA (common_detmin_fsa with set)-based check, smallest-first condition trees, apply_detmin_fst conversion both ways: 1 seconds, running time: 0.623s seconds (but again, Breton gained a lot more from this).
    11. deleteX rules, FSA (common_detmin_fsa with vector)-based check, smallest-first condition trees, apply_detmin_fst conversion both ways: 0.93 seconds, running time: 0.559s seconds (again, Breton... you get the idea).
    12. deleteX rules, FSA (common_detmin_fsa with vector)-based check, common_apply_down-based rule application, smallest-first condition trees, apply_detmin_fst conversion both ways: 0.72 seconds, running time: 0.346s seconds.
    13. deleteX rules, FSA (common_detmin_fsa with vector)-based check, common_apply_down-based rule application, common_create_sigmatch-based string->int conversion, smallest-first condition trees, apply_detmin_fst conversion both ways: 0.655 seconds, running time: 0.285s seconds (note that this running time is less than the full execution time of VISL-GC, so we're on the right track! Unfortunately, we are not there with Breton yet.)
    14. Same as above, but after having optimised the input stream reader code and getting rid of deques: 0.615 seconds, running time: 0.245s seconds.
  2. fsm_compose
    1. simple rules, fsm_intersect-based application check: 45 seconds
    2. deleteX rules, fsa->statecount-based application check: 28.3 seconds
Table 0. Running times for different languages.
Method (ID in list above) hun init hun run br init br run
1 0s 6.4s N/A N/A
2 0s 5.9s N/A N/A
3 0s 9-11s N/A N/A
4 0s 1.47s N/A N/A
5 0.37s 0.95s N/A N/A
6 0.37s 0.87s 7.9s 4.9s
7 0.37s 0.75s 7.9s 3.54s
8 0.37s 0.655s 7.9s 3.13s
9 0.37s 0.648s 7.9s 2.73s
10 0.37s 0.623s 7.9s 2.12s
11 0.37s 0.559s 7.9s 1.566s
12 0.37s 0.346s 7.9s 1.088s
13 0.37s 0.285s 7.9s 0.96s
14 0.37s 0.245s 7.9s 0.916s
Simple ideas to speed it up[edit]
  1. do not use symbols for tags, etc (see above): execution time grows by 33%
  2. compose nets on the same priority level: the FSTs become too big too fast -- on some simpler rules:
    1. rule: 3.5 kB. 44 states, 176 arcs
    2. rules: 446.0 kB. 1965 states, 28482 arcs
    3. rules: 4.9 MB. 17303 states, 320664 arcs
    4. rules: 61.5 MB. 163945 states, 4029156 arcs
  3. union nets on the same priority level: the nets don't grow as fast as with compose, but still not useable:
    1. rule: 3.5 kB. 44 states, 176 arcs
    2. rules: 47.3 kB. 202 states, 2957 arcs
    3. rules: 238.8 kB. 856 states, 15208 arcs
    4. rules: 1.3 MB. 3988 states, 82261 arcs
    5. rules: 7.2 MB. 20296 states, 471188 arcs
    6. rules: 53.5 MB. 131639 states, 3504119 arcs
Clever ideas[edit]
  1. Use fail-fast FSTs. The idea was that these could speed up the computation because unapplicable rules don't have to copy the whole string over, they can return on the first error. I have tried to augment the ConstrainR and ConstrainS pattern in two ways:
    1. compose $"$A$ " at the end: worked, but the rule FSTs became larger and slower;
    2. instead of deleting $1$ and #1# from the string, I composed ["$1$ " => _ MarkedReading & NotMarkedReading] & $"$1$ " at the beginning. This didn't work, the rules only recognized cohorts, not sentences. And if I change the rule to ?* rule ?*, it becomes non-deterministic.
    3. Maybe this idea wasn't that clever after all -- the rule might have to read the whole sentence anyway...
  2. Do not pass the sentence to the rule, just the current word (+ the left & right contexts).
    1. Unfortunately, this does not mitigate the need of reading the sentence repeatedly over-and-over, because of right contexts: a rule can be enabled by another that changes something in a later word.
  3. Instead of applying the rules to the sentence string, convert the string to an (linear) FSA and compose the rules with it.</span
    1. Unfortunately, this is even slower: on one of my machines, instead of 6.5 seconds, it takes 47! Why? It could be because:
      1. All fsm_ functions are destructive. However, about half of the time (most, if the TODO above is implemented), I needed the old transducer so I had to fsm_copy it.
      2. Why indeed? According to callgrind (note: the numbers are not entirely accurate, because there was a bug in the program when I measured them. fomacg_to_fsa could be around 13%):
        1. 28.52% fomacg_to_fsa == fsm_concat
        2. 23.01% fsm_intersect
        3. 43.19% fsm_compose
      3. If we go deeper:
        1. 24.01% fsm_merge_sigma (called by all three)
        2. 24.01% fsm_minimize (called 82.87% of the time by fsm_concat, and 16.05% by fsm_intersect)
      4. What can we take home from this?
        1. About 50% of the time is spent in fsm_compose, which we cannot avoid. It might be possible, however, to write a custom <fsm_compose> that exploits the fact that one of its input is not a general FST, but a linear FSA.
        2. fsm_intersect takes about 50% less than fsm_compose (both are called once per rule), so trying the intersection-based method looks promising (though the sentence won't be linear in that case)
        3. The overhead of fsm_minimize could be avoided entirely by using a custom fsm_concat to construct the sentence. It is a linear automaton, so no minimisation is needed.
    2. Sometimes apply down'ing a rule works, but composing it to the sentence FSA doesn't. This was a bug in fomacg.
    3. Currently we are at 28.3 seconds
      1. fsm_concat_custom() does not call fsm_minimize().
      2. We use rules that delete the #X# readings.
  4. Mans's original rules replaced the #0# in front of filtered readings with an #X# . However, nothing prevents us from actually removing those readings... since this way the sentence becomes shorter with each rule applied, subsequent rule applications / checks become faster. I call this new behavior "deleteX", and it is realised by composing the transducer cg_DeleteX.foma that deletes the #X# 'd readings to all rules.
    1. As can be seen in Table 0, this is indeed the case.
    2. The situation gets a bit more tricky if we replace string-based apply_down() with common_apply_down() (see above).
      1. As we are dealing with numbers now, it turns out that we would now be better off with getting back to replacing #0# with #X# , as that wouldn't require creating and filling a new vector of Symbols.
      2. However, callgrind tells us that we spend 71.23% of our time checking rule conditions, and only about 6.75% applying rules; and the performance of the former is, again, heavily affected by the length of the sentence.
      3. So it may just be that we are still better off with deleteX rules. However, if we manage to decrease the number of condition checks, we might reconsider dropping them.
    3. As it turns out, the cg_DeleteX.foma is not enough; we are left with #X# -> 0 branches in our FSTs that are never traversed; yet, they take up space.
      1. The culprits were the initial filter pattern and the cg_match_cohort_careful() function in fomacg. Modifying these got us rid of the unnecessary branches. However, even though it isn't used anywhere now, #X# remained in the sigma of the FSTs; apparently foma does not remove unused symbols from the alphabet of the transducer created via composition.
      2. The effect of removing said branches can be seen in Table 5. As I could not recreate the condition trees I am using for the performance tests, I had to create new ones. I sorted the rules in alphabetical order, and created complete binary trees from them (3-level trees for Hungarian, 2-level trees for Breton). Therefore, the numbers in Table 5 cannot be directly compared with those in Table 0; however, the "before" numbers correspond to row 14 of Table 0.
  5. (Mans's suggestion): Create an automaton for each rule that doesn't change anything, but only accepts the sentence the associated rule would change. "This automaton can be determinized and minimized and applied *much* faster than the corresponding transducer."
    1. If the automaton is just used to check the applicability of a single rule, execution time goes up to 9 seconds. :(
      1. According to callgrind, apply_follow_next_arc is called 1,358,508 times when we used only FTSs, and 2,589,103 times with the new method.
      2. In the original test set (1/32nd of the performance test corpus), there were 3541 attempts at rule applications, 79 of which were successful.That means 3541 FST application in the first case, and 3541 FSA + 79 FST applications in the second.
      3. This is especially interesting if we know that the total number of states in the rule FSTs is 3,467, while in the condition FSAs, it is only 1,597. The total number of arcs in the FSAs also seem to be about half of that in the FSTs. So why?
    2. If, however, we use the new apply_detmin_fsa function to execute a deterministic minimal FSA (even with linear transition search), without all the cruft in apply_net, we are down to 1.45 seconds.
    3. It could be possible to check the applicability of a number of rules with a single automaton. Researching this right now...
      1. The best speed could probably be achieved by building FSAs from conditions numbering the powers of two. In this way, we could find the rule to apply in log(G) checks, thereby breaking the linearity barrier.
      2. However, I have found that the limit where the FSAs are still of managable size is about 1000 states. This (depending on the number of states of the original FSA) is usually reached after unioning 3-6 FSAs. This means that while log(G) is out of reach, we can still reduce our running time to about 1/3-1/6th of what it is now.
      3. Tests of different ways of building condition trees
        1. Flat: no trees are used
        2. Smallest-first: always merge the two smallest trees.
        3. Fixed height: full binary trees with a fixed number of levels. In Table 1, we use trees with 3 levels.
  6. Compress the condition FSAs, so that we can check more rules with a single FSA application.
    1. Try Automata Mista (Huet, 2003)
    2. Inter-automata bits (such as "skip this reading")
    3. Similar rule(parts), e.g. SELECT np IF "Mari"; and SELECT n IF "török";: the FSA is the same, the only difference are the tags
      1. flag diacritics?
      2. push-down automata?
  7. TODO Index rules by their required tags (word-form, target, C conditions, etc.), and skip those that cannot possible be activated by the sentence.
Table 1. The effect of different methods of grouping conditions on running time
Condition Total running time FST loading FST application Conditions checked Tree size File size
Flat (no tree) 1.476s 0.041s 1.435s 107,488 N/A 76 kB
Smallest first 1.355s 0.355s 0.9s 42,336 2-4 rules 4.4 MB
Fixed height 2.070s 1.167s 0.84s 36,832 4 rules 12 MB
Table 2. Running the algorithm on different languages
*Smallest first
**Hit physical memory limit
Language Number of rules Number of FSTs Size of FST file Memory consumed by foma when loading Memory consumed by foma Size of VISL-GC binary Initialisation Running time VISL-GC running time
Hungarian 35 95 3.6MB ? 2.7MB 8kB 0.41s 0.94s* 0.3s
Breton-French 226 602 39MB 1GB 666MB 36kB 7.9s 4.9s 0.8s
sme-fin 1172 3149 514MB 3550MB** 2476MB 184kB
Hacky hacks[edit]
  1. An apply_detmin_fst() instead of apply_down() for the apertium <-> fomacg converters
  2. Sigma optimisation
    1. apply_create_sigmatch() is responsible for about 54% of apply_detmin_fsa()'s time. Basically we read the input string each time we run a condition FSA.
    2. What we can do:
      1. Roll our own custom_create_sigmatch() that splits the input at spaces and treates each word/tag/etc. as a single symbol, not as a character sequence.
        1. Currently, only words/tags/etc. that occur in the rules are handled this way. Extending this idea to unknown tags (those, whose characters are classified as IDENTITY_SYMBOL) results in fewer transitions in the FSA, and only one (failed) look-up in the sigma trie for each unknown tag, as opposed to one for each of its letters.
        2. Furthermore, while the original custom_detmin_fsa() and custom_create_sigmatch() required that, apart from the symbol vector, the original string is passed to them too, we can actually get away without the latter. Instead of using a sigmatch table of the size of the string, we can use one of the size of the vector, which is much shorter. Maybe because of this (fewer memory re-allocations), this simple modification to the functions resulted in 14.5% speed increase for the Breton grammar.
        3. Table 3 below summarizes the results: we have achieved a 16-36% speed-up and decreased the running time on the Hungarian corpus to 0.75s. The last two columns shows the effect of factoring in the f2a direction of hack #1, as well as the modification to custom_detmin_fsa() described above, which allowed a further 1.5-14.5% speed-up.
      2. Sigma unification:
        1. Create an FSA that contains all symbols from all rules and conditions. Its sigma will be the "canonical"/"universal" one.
        2. Move around sigmas in all condition FSAs and rule FSTs so that the symbols above have the same codes.
        3. Write a common_detmin_fsa() that, if it sees a symbol unknown to the FSA in state, it treats it as an IDENTITY_SYMBOL if it can. There are several ways to achieve this end:
          1. Have a sigma set for all FSAs; if an input symbol is not in the set, replace it. Note that for this use case, unordered_set is slower than set because the number of elements in a single set is small.
          2. Have a mapping that translates all symbols in the universal sigma to the local one. For symbols in the local sigma this will be identity; for the rest, IDENTITY :)
          3. The latter can be implemented with a vector, which is much faster than a map, though its memory requirements are higher (O(|conditions| * |sigma|)).
          4. Results are in Table 4, below.
        4. Let the universal FSA create the sigmatch table, and pass its apply_handle to the others in common_detmin_fsa(). This enables us to convert the symbols to sigma ids only once per rule application.
      3. Get rid of strings for good:
        1. Write an common_apply_down() that
          1. handles nondeterminism (with the restriction that for an input there is only a single output)
          2. both the input and output are sigma ids
        2. Because we currently delete filtered readings instead of just #X# 'ing them,
          1. we have to pass a sequence of (sigma, symbol) pairs to it, or
          2. actually, we pass a sequence of Symbols, which store the sigma id, its position in the string and its length to avoid string creation overhead.
          3. we can just bring back the #X# .
      4. See Table 0 for the results; for the Hungarian, we beat VISL-GC a bit; for Breton, we need to find another 20%.
  3. apply_down() still accounts for a rather large percentage of the running time. This is because it is called not only by apply_rules(), but by the code that converts the apertium stream format to fomacg's format and vice versa.
    1. The latter FST is definitely deterministic. Using the new apply_detmin_fst() function instead of apply_down() decreased the running time of the Hungarian corpus from 0.94s to 0.87s.
    2. Isn't the rule FST deterministic as well?! ... 'course not.
Table 3. Improvements due to custom_detmin_fsa()
The numbers were achieved by using the smallest-first tree building method.

(*) see above; additionally, with hack #1 in the f2a direction too.
(**) compared to the regular custom_detmin_fsa, but with hack #1.
Language apply_detmin_fsa() custom_detmin_fsa() Speed-up custom_detmin_fsa()(*) Speed-up(**)
Hungarian 0.868s 0.745s 16.5% 0.648s 1.5%
Breton-French 4.82s 3.54s 36% 2.73s 14.5%
Table 4. Improvements due to common_detmin_fsa()
The numbers were achieved by using the smallest-first tree building method.
Language custom_detmin_fsa() common_detmin_fsa() w/ set Speed-up custom_detmin_fsa() w/ vector Speed-up
Hungarian 0.648s 0.623s 4% 0.559s 16%
Breton-French 2.73s 2.12s 29% 1.566s 74%
Table 5. Improvements in memory usage and running time due to removing #X# from the initial filter pattern.
Language Before -- memory Before -- initialisation Before -- runtime After -- memory After -- initialisation After -- runtime
Hungarian 7.1MB 0.721s 0.238s 6.5MB 0.658s 0.232s
Breton-French 7.4MB 1.814s 1.105s 6MB 1.67s 1.007s

Size problems[edit]

As mentioned above, union'ing the rule condition FSAs is a good way to speed up the rule application process. Unfortunately, the resulting automata may be exponential in size, making FSAs that check many rules at once unfeasible. Unless, of course, we can reduce their memory footprint. This section documents the experiments we conducted to towards this goal.

  1. Automata Mista
    1. Due to Gérard Huet
      1. Huet, G. (2004). Automata mista. Verification: Theory and Practice, 271-273.
      2. Huet, G. (2005). A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger. Journal of Functional Programming, 15(4), 573-614.
      3. While the code implementating AM is included in the papers, it is not immediately usable and is rather hard to read, because
        1. the code is in ML (with the occasional OCaml library call), which isn't very intuitive
        2. it is written in a functional style, which, when translated line-by-line to Python, leads to abysmal performance
        3. the author couldn't have been bothered to explain what certain function arguments or variables mean
    2. Promise
      1. Huet reports huge savings; for example, the Sanskrit morphological analyser, an FSA of 200,000 states, could be represented by a trie of 7,500 nodes.
      2. Then again, FSAs derived from word lists have a structure that gives them naturally to tries. Whether the method fares similarly well on generic FSAs is yet to be seen.
    3. Implementation
      1. I have translated the caml code to Python. If it seems promising, I'll be rewriting it in C++.
    4. Evaluation
      1. I fed the vocabulary of the Longman dictionary (more specifically, the strangely short variant that we have) to it. From the 47,051 tokens (words, numbers, g@rb@ge), the trie builder built a raw trie (with only forward sharing) of 150,222 nodes. Running the sharing algorithm (2/p9) on it reduced the number of nodes to 42,540 -- a 71.68% percent reduction.
      2. Of course, due to how the trie is built (there is an empty node at the end of each path), about 47050 of those nodes are trivial to merge; this decreases the reduction to a more humble 58.77%.
      3. Collapsing nonaccepting, nonbranching trie paths (not described in the papers) results in a further reduction of 54.8%, bringing down the number of nodes to 19,213. However, I don't see a way to decrease this number further, and the compression level is not nearly the same as the one reported for the Sanskrit analyzer.
      4. At the same time, foma creates an 42,540-state automaton from the same word list, which hints at the not very surprising fact that "sharing" is basically the same as minimising a deterministic automaton. As such, the only advantage AM can boast of is that states can be collapsed, but I am not sure that we are better off, due to the additional bookkeeping required (sequence of letters on arcs).
  2. Decrease the alphabet
    1. We cannot hope for large savings from this one, but | seems to be completely superfluous: the end of a reading is clearly marked by the beginning of the next one (#0# ) or the end of the cohort (#EOC# ).
      1. There is a possibility that this change might make the rules more complicated; still, we should try it.
    2. It is not necessary to save all FSAs to disk. In case of a full binary tree with three levels, nodes 3, 5 and 7 (BFS ordering) are superfluous: if node 1 matches the sentence, but node 2 does not, it is obvious that node 3 would, and so on. The rule applier has already been built with this idea in mind, but the redundant trees are still save to file and read into memory.
  3. Delete the sigma tries from the condition automata and rules.
    1. See table 6. for results on some of the rule sets. It is obvious from the data that the reduction is greater for rule sets where the condition trees are shallow; as the conditions are merged, the growth in the number of states and transitions becomes more substantial -- in the extreme case the sigma of a joined condition might not grow at all, but the size of the state space always does considerably.
Table 6. Improvements in memory usage due to removing the sigma trie from the condition and rule FSMs.
Memory consumption is measured as percentages of the 4GB system memory.
Rule set Before After Reduction
apertium-br-fr.br-fr_cg2_orig.fst 23.5% 17.8% 24.25%
apertium-br-fr.br-fr_cg2_l2.fst 9.6% 4.4% 54.16%
apertium-br-fr.br-fr_cg2_l1.fst 5.1% 1.3% 74.5%
hun_cg2_l1.fst 0.5% 0.1% 80%
hun_cg2_l3.fst 2.1% 1.5% 28.57%
apertium-sme-fin.fin-sme_cg2_l1.fst 21% 4.1% 80.47%
apertium-sme-fin.fin-sme_cg2_l2.fst 32.3% 8.9% 72.44%

TODO?'s[edit]

  1. cg-comp is not so strict as fomacg about the position part in contextual tests, e.g. it accepts 1*, not just *1. Loosen up fomacg?

Research[edit]

  • Decrease the number of rules applied (see in the proposal).
  • When and how can rules be merged?

Miscellaneous / Extra[edit]

Hungarian CG grammar[edit]

Write a simple CG grammar for Hungarian, somewhere around 50-150 rules.

  • Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.
  • Read the contents of cg_material.zip.
  • Install Apertium, VISL CG3 and a language pair (cy-en)
  • Study the CG grammar of an Apertium language.
  • Write a Hungarian grammar that covers the sentences in this sample file
    • The tags will be based on those in KR-code[2]. See the next task.
  • TODOs:
    • add a sentence to the rasskaz file for the "az a" construct.
    • prevb disambiguation
  • The file is here.

Hunmorph converter[edit]

Write a converter from ocamorph's output to Apertium's format.

  • Again, use the sentences in this sample file as reference.
  • While a C-based converter would definitely be possible, I opted for a foma-based (xfst -- lexc?) implementation, so that this task also serves for practice.
  • TODOs:
    • some analyses are repeated in the output: fix them! -- Not a real fix, because the problem wasn't in the script, but in ocamorph, but I wrote a python script that discards repeated analyses.
    • hunmorph_to_apertium.foma does not handle compounds (+)
    • two-phase conversion (handle special words first, then the general part)
    • a few non-general conversions:
      • "mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg>
      • "hova" => <itg>
      • "így<CONJ>" => <cnjadv>
      • "ekkor<ADV>" => <cnjadv>?
      • "aztán<CONJ>" => <cnjadv>
  • Apparently a pure lexc/foma based implementation wasn't possible (at least with the one line -- one word output format I decided ask from hunmorph). The reason is that the tagset of hunmorph and Apertium does not match exactly, and therefore I needed to add exceptional rules for certain words and lexical categories. However, flookup returns all possible analyses, so in this case it returned both the exceptional and the regular translation. The current solution consists of four files:
    • kr_to_apertium_spec.lexc contains the words / lexical categories that need special treatment (vbser, prns, "ez"/"az", etc.)
    • kr_to_apertium.lexc contains the rest of the categories, i.e. whose the translation was straightforward
    • kr_to_apertium.foma is a simple foma script that writes the previous two into the binary file kr_to_apertium.fst
    • hunmorph_to_apertium.cpp loads the two FSTs from the binary file and applies them to the readings. First it tries to parse the reading with the _spec FST; if it fails, it reverts back to the general one. This mechanism ensures that all readings get only one translation, and it is also the correct one.

ATT -> lttoolbox compiler[edit]

Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component.

  • ATT format
  • "<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);"
    • The actual implementation turned out to be a little more difficult. Anyway, the code is here.
    • The transducer we used for testing, kaz.att, didn't work with lt-proc, so Fran told me to create two transducers instead of one: one for words (main) and one for punctuation (final).

References[edit]

  1. Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.
  2. András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.