Difference between revisions of "User:David Nemeskey/GSOC progress 2013"
(96 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
=== XML format === |
=== XML format === |
||
See [[User:David_Nemeskey/CG_XML_brainstorming]]. |
|||
=== Rule applier === |
|||
Rough execution flow: |
|||
# Load the rules |
|||
## DELIMITERS must also be converted to an FST and saved. |
|||
## Rule organization: |
|||
##* separate files in a directory? |
|||
##* one file per section? |
|||
##* one big file (needs another "index" file then) |
|||
# Read the input stream cohort by cohort |
|||
#* ''<code>StreamReader</code> class in ''fomacg_stream_reader.h |
|||
#* Read until a delimiter is found (see above) |
|||
# Convert the cohorts from the Apertium stream format to fomacg's format |
|||
#* ''<code>Converter</code> in ''fomacg_converter.h |
|||
#* ''apertium_to_fomacg.foma'' |
|||
#* conversion from <code>wchar_t</code> to ''utf-8'' <code>char</code> |
|||
# Apply rules in the order of the sections they are defined in |
|||
# Convert the cohorts back to the Apertium stream format |
|||
=== Compiler === |
=== Compiler === |
||
Line 8: | Line 30: | ||
* Read the [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/fomacg/ fomacg] code<ref>Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.</ref> and understand what it does. |
* Read the [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/fomacg/ fomacg] code<ref>Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.</ref> and understand what it does. |
||
* Add comments do a little refactoring (e.g. separate function for creating ''@*word@*'' FSTs). |
* ''Add comments do a little refactoring (e.g. separate function for creating ''@*word@*'' FSTs).'' |
||
* Check and test if it works on all grammars, fix it if it isn't. |
* Check and test if it works on all grammars, fix it if it isn't. |
||
==== Development ==== |
|||
# ''DELIMITERS must also be converted to an FST and saved.'' |
|||
#* ''I have run into a foma bug, which also corrupted the output of all rules that contained lemma tags. I have patched the code on my side and opened a [http://code.google.com/p/foma/issues/detail?id=49 bug report] ([http://code.google.com/p/foma/issues/detail?id=50 two, actually]).'' |
|||
# '' ''fomacg'' treats all tags as symbols. Usually this is OK, but for lemmas, it may not be optimal'' |
|||
## ''it blows up the number of symbols (if not the transducer) => turns out the transducer is smaller this way;'' |
|||
## ''might be slower to parse? Or faster => faster, if we use exploded strings, ''fomacg_proc'' becomes about 33% slower.'' |
|||
# ''If one tag is a postfix of another, and the longer tag is not mentioned in the rule, ''fomacg'' might trigger the wrong rule, believing that it sees the shorter tag. E.g. if there is a <code>"predet "</code> and a <code>"det "</code> tag, and the rule is <code>SELECT (det)</code>, it will also select <code>predet</code>s as well. This is because in ''fomacg'', the symbol that corresponds to a tag is its name plus a space; in other words, only the right boundary of a symbol is checked.'' |
|||
## ''Change the symbols from <code>"tag "</code> to <code>"<tag> "</code> to ensure the "postfix property", except for word forms and lemmas, which are already enclosed in quotes.'' |
|||
## Keep as much of Apertium's format as possible? |
|||
# ''Insert ">>>" and "<<<" to before and after the sentence, respectively.'' |
|||
# Make the rules fail-fast? |
|||
==== Speed problems ==== |
|||
This first version of ''fomacg_proc'', where rules are separate FSTs applied in order, has some speed issues. With the 40-rule CG-2 subset of my Hungarian grammar, it disambiguates a 13760-token test corpus (the ''rasskaz'' sentences 32x concatenated) in 6.4 seconds. Under the same conditions, ''cg-proc'' needs 0.3 seconds. |
|||
===== Speed measurements (see below for details) summary ===== |
|||
# <code>apply_down</code> |
|||
## original rules, <code>strcmp</code>-based application check: 6.4 seconds |
|||
## '''deleteX rules''', '''<code>strlen</code>'''-based application check: 5.9 seconds |
|||
## deleteX rules, '''FSA (<code>apply_down</code>)'''-based application check: 9 - 11 seconds |
|||
## deleteX rules, FSA ('''<code>apply_detmin_fsa</code>''')-based check: 1.47 seconds |
|||
## deleteX rules, FSA (<code>apply_detmin_fsa</code>)-based check, '''smallest-first condition trees''': 1.35 seconds, running time: 0.95 seconds. |
|||
## deleteX rules, FSA (<code>apply_detmin_fsa</code>)-based check, smallest-first condition trees, '''<code>apply_detmin_fst</code> conversion''': 1.3 seconds, running time: 0.87 seconds. |
|||
## deleteX rules, FSA ('''<code>custom_detmin_fsa</code>''')-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion: 1.1 seconds, running time: 0.75 seconds. |
|||
## deleteX rules, FSA (<code>custom_detmin_fsa</code>)-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion '''both ways''': 1 seconds, running time: 0.655 seconds. |
|||
## deleteX rules, FSA ('''string vector only''' <code>custom_detmin_fsa</code>)-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: 1 seconds, running time: 0.648 seconds (but Breton gained a lot from this, see table!). |
|||
## deleteX rules, FSA ('''<code>common_detmin_fsa</code> with set''')-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: 1 seconds, running time: '''0.623s''' seconds (but again, Breton gained a lot more from this). |
|||
## deleteX rules, FSA (<code>common_detmin_fsa</code> with '''vector''')-based check, smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: '''0.93''' seconds, running time: 0.559s seconds (again, Breton... you get the idea). |
|||
## deleteX rules, FSA (<code>common_detmin_fsa</code> with '''vector''')-based check, '''<code>common_apply_down</code>-based rule application''', smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: 0.72 seconds, running time: 0.346s seconds. |
|||
## deleteX rules, FSA (<code>common_detmin_fsa</code> with '''vector''')-based check, <code>common_apply_down</code>-based rule application, '''<code>common_create_sigmatch</code>-based string->int conversion''', smallest-first condition trees, <code>apply_detmin_fst</code> conversion both ways: 0.655 seconds, running time: 0.285s seconds (''note that '''this running time is less than the full execution time of VISL-GC''', so we're on the right track! Unfortunately, we are not there with Breton yet.'') |
|||
## Same as above, but after having optimised the input stream reader code and getting rid of deques: '''0.615''' seconds, running time: '''0.245s''' seconds. |
|||
# <code>fsm_compose</code> |
|||
## simple rules, <code>fsm_intersect</code>-based application check: 45 seconds |
|||
## deleteX rules, <code>fsa->statecount</code>-based application check: '''28.3''' seconds |
|||
<div align="center"> |
|||
{| class="wikitable" |
|||
|+ align="bottom" | ''Table 0. Running times for different languages.'' |
|||
! Method (ID in list above) |
|||
! hun init |
|||
! hun run |
|||
! br init |
|||
! br run |
|||
|- |
|||
|1 |
|||
|0s |
|||
|6.4s |
|||
|N/A |
|||
|N/A |
|||
|- |
|||
|2 |
|||
|0s |
|||
|5.9s |
|||
|N/A |
|||
|N/A |
|||
|- |
|||
|3 |
|||
|0s |
|||
|9-11s |
|||
|N/A |
|||
|N/A |
|||
|- |
|||
|4 |
|||
|0s |
|||
|1.47s |
|||
|N/A |
|||
|N/A |
|||
|- |
|||
|5 |
|||
|'''0.37s''' |
|||
|0.95s |
|||
|N/A |
|||
|N/A |
|||
|- |
|||
|6 |
|||
|0.37s |
|||
|0.87s |
|||
|'''7.9s''' |
|||
|4.9s |
|||
|- |
|||
|7 |
|||
|0.37s |
|||
|0.75s |
|||
|7.9s |
|||
|3.54s |
|||
|- |
|||
|8 |
|||
|0.37s |
|||
|0.655s |
|||
|7.9s |
|||
|3.13s |
|||
|- |
|||
|9 |
|||
|0.37s |
|||
|0.648s |
|||
|7.9s |
|||
|2.73s |
|||
|- |
|||
|10 |
|||
|0.37s |
|||
|0.623s |
|||
|7.9s |
|||
|2.12s |
|||
|- |
|||
|11 |
|||
|0.37s |
|||
|0.559s |
|||
|7.9s |
|||
|1.566s |
|||
|- |
|||
|12 |
|||
|0.37s |
|||
|0.346s |
|||
|7.9s |
|||
|1.088s |
|||
|- |
|||
|13 |
|||
|0.37s |
|||
|0.285s |
|||
|7.9s |
|||
|0.96s |
|||
|- |
|||
|14 |
|||
|0.37s |
|||
|'''0.245s''' |
|||
|7.9s |
|||
|'''0.916s''' |
|||
|} |
|||
</div> |
|||
===== Simple ideas to speed it up ===== |
|||
# <span style="color:#c07070">do not use symbols for tags, etc (see above): </span><span style="color:#900000">''execution time '''grows''' by 33%''</span> |
|||
# <span style="color:#c07070"><code>compose net</code>s on the same priority level: ''the FSTs become too big too fast -- on some simpler rules:'' |
|||
## rule: 3.5 kB. 44 states, 176 arcs |
|||
## rules: 446.0 kB. 1965 states, 28482 arcs |
|||
## rules: 4.9 MB. 17303 states, 320664 arcs |
|||
## rules: 61.5 MB. 163945 states, 4029156 arcs</span> |
|||
# <span style="color:#c07070"><code>union net</code>s on the same priority level: ''the nets don't grow as fast as with <code>compose</code>, but still not useable:'' |
|||
## rule: 3.5 kB. 44 states, 176 arcs |
|||
## rules: 47.3 kB. 202 states, 2957 arcs |
|||
## rules: 238.8 kB. 856 states, 15208 arcs |
|||
## rules: 1.3 MB. 3988 states, 82261 arcs |
|||
## rules: 7.2 MB. 20296 states, 471188 arcs |
|||
## rules: 53.5 MB. 131639 states, 3504119 arcs</span> |
|||
===== Clever ideas ===== |
|||
# <span style="color:#c07070">''Use fail-fast FSTs. The idea was that these could speed up the computation because unapplicable rules don't have to copy the whole string over, they can return on the first error. I have tried to augment the <code>ConstrainR</code> and <code>ConstrainS</code> pattern in two ways: |
|||
## compose <code>$"$A$ "</code> at the end: worked, but the rule FSTs became larger and slower;'' |
|||
## ''instead of deleting <code>$1$</code> and <code>#1#</code> from the string, I composed <code>["$1$ " => _ MarkedReading & NotMarkedReading] & $"$1$ "</code> at the beginning. This didn't work, the rules only recognized cohorts, not sentences. And if I change the rule to <code>?* rule ?*</code>, it becomes non-deterministic.'' |
|||
## ''Maybe this idea wasn't that clever after all -- the rule might have to read the whole sentence anyway...''</span> |
|||
# Do not pass the sentence to the rule, just the current word (+ the left & right contexts). |
|||
## Unfortunately, this does not mitigate the need of reading the sentence repeatedly over-and-over, because of right contexts: a rule can be enabled by another that changes something in a later word. |
|||
# <span style="color:#c07070">''Instead of applying the rules to the sentence string, convert the string to an (linear) FSA and compose the rules with it.''</span |
|||
## <span style="color:#900000">'''''Unfortunately, this is even slower: on one of my machines, instead of 6.5 seconds, it takes 47!'''</span> <span style="color:#c07070">Why? It could be because:'' |
|||
### ''All <code>fsm_</code> functions are destructive. However, about half of the time (most, if the TODO above is implemented), I needed the old transducer so I had to <code>fsm_copy</code> it.'' |
|||
### ''Why indeed? According to callgrind ('''note: the numbers are not entirely accurate, because there was a bug in the program when I measured them. <code>fomacg_to_fsa</code> could be around 13%'''):'' |
|||
#### ''28.52% <code>fomacg_to_fsa</code> == <code>fsm_concat</code>'' |
|||
#### ''23.01% <code>fsm_intersect</code>'' |
|||
#### ''43.19% <code>fsm_compose</code>'' |
|||
### ''If we go deeper:'' |
|||
#### ''24.01% <code>fsm_merge_sigma</code> (called by all three)'' |
|||
#### ''24.01% <code>fsm_minimize</code> (called 82.87% of the time by <code>fsm_concat</code>, and 16.05% by <code>fsm_intersect</code>)'' |
|||
### ''What can we take home from this?'' |
|||
#### ''About 50% of the time is spent in <code>fsm_compose</code>, which we cannot avoid. It might be possible, however, to write a custom <fsm_compose> that exploits the fact that one of its input is not a general FST, but a linear FSA.'' |
|||
#### ''<code>fsm_intersect</code> takes about 50% less than fsm_compose</code> (both are called once per rule), so trying the intersection-based method looks promising (though the sentence won't be linear in that case)'' |
|||
#### ''The overhead of <code>fsm_minimize</code> could be avoided entirely by using a custom <code>fsm_concat</code> to construct the sentence. It is a linear automaton, so no minimisation is needed.'' |
|||
## ''Sometimes ''apply down'''ing a rule works, but composing it to the sentence FSA doesn't. This was a bug in fomacg.'' |
|||
## ''Currently we are at 28.3 seconds'' |
|||
### ''<code>fsm_concat_custom()</code> does not call <code>fsm_minimize()</code>.'' |
|||
### ''We use rules that delete the <tt>#X#</tt> readings.''</span> |
|||
# <span style="color:#009000">''Mans's original rules replaced the <code>#0# </code> in front of filtered readings with an <code>#X# </code>. However, nothing prevents us from actually ''removing'' those readings... since this way the sentence becomes shorter with each rule applied, subsequent rule applications / checks become faster. I call this new behavior "'''deleteX'''", and it is realised by composing the transducer ''cg_DeleteX.foma'' that deletes the <code>#X# </code>'d readings to all rules.'' |
|||
## As can be seen in Table 0, this is indeed the case. |
|||
## The situation gets a bit more tricky if we replace string-based <code>apply_down()</code> with <code>common_apply_down()</code> (see above). |
|||
### ''As we are dealing with numbers now, it turns out that we would now be better off with getting back to replacing <code>#0# </code> with <code>#X# </code>, as that wouldn't require creating and filling a new <code>vector</code> of <code>Symbol</code>s.'' |
|||
### However, <code>callgrind</code> tells us that we spend 71.23% of our time checking rule conditions, and only about 6.75% applying rules; and the performance of the former is, again, heavily affected by the length of the sentence. |
|||
### ''So it may just be that we are still better off with deleteX rules. However, if we manage to decrease the number of condition checks, we might reconsider dropping them.'' |
|||
## ''As it turns out, the ''cg_DeleteX.foma'' is not enough; we are left with <code>#X# -> 0</code> branches in our FSTs that are never traversed; yet, they take up space. |
|||
### ''The culprits were the ''initial filter'' pattern and the <code>cg_match_cohort_careful()</code> function in fomacg. Modifying these got us rid of the unnecessary branches. However, even though it isn't used anywhere now, <code>#X# </code> remained in the sigma of the FSTs; apparently ''foma'' does not remove unused symbols from the alphabet of the transducer created via composition.'' |
|||
### ''The effect of removing said branches can be seen in Table 5. As I could not recreate the condition trees I am using for the performance tests, I had to create new ones. I sorted the rules in alphabetical order, and created complete binary trees from them (3-level trees for Hungarian, 2-level trees for Breton). Therefore, the numbers in Table 5 cannot be directly compared with those in Table 0; however, the "before" numbers correspond to row 14 of Table 0.''</span> |
|||
# <span style="color:#009000">''(Mans's suggestion): Create an automaton for each rule that doesn't change anything, but only accepts the sentence the associated rule would change. "This automaton can be determinized and minimized and applied *much* faster than the corresponding transducer."''</span> |
|||
## <span style="color:#c07070">If the automaton is just used to check the applicability of a single rule, </span><span style="color:#900000">'''execution time goes up to 9 seconds'''. :(</span> |
|||
### <span style="color:#c07070">''According to ''callgrind'', <code>apply_follow_next_arc</code> is called 1,358,508 times when we used only FTSs, and 2,589,103 times with the new method.''</span> |
|||
### <span style="color:#c07070">''In the original test set (1/32nd of the performance test corpus), there were 3541 attempts at rule applications, 79 of which were successful.That means 3541 FST application in the first case, and 3541 FSA + 79 FST applications in the second.''</span> |
|||
### <span style="color:#c07070">''This is especially interesting if we know that the total number of states in the rule FSTs is 3,467, while in the condition FSAs, it is only 1,597. The total number of arcs in the FSAs also seem to be about half of that in the FSTs. So why?''</span> |
|||
## <span style="color:#007000">''If, however, we use the new <code>apply_detmin_fsa</code> function to execute a deterministic minimal FSA (even with linear transition search), without all the cruft in <code>apply_net</code>, we are down to '''1.45 seconds'''.'' |
|||
## <span style="color:#909000">''It could be possible to check the applicability of a number of rules with a single automaton. Researching this right now...'' |
|||
### ''The best speed could probably be achieved by building FSAs from conditions numbering the powers of two. In this way, we could find the rule to apply in ''<tt>log(G)</tt>'' checks, thereby breaking the linearity barrier.'' |
|||
### ''However, I have found that the limit where the FSAs are still of managable size is about 1000 states. This (depending on the number of states of the original FSA) is usually reached after unioning 3-6 FSAs. This means that while ''<tt>log(G)</tt>'' is out of reach, we can still reduce our running time to about 1/3-1/6th of what it is now.'' |
|||
### ''Tests of different ways of building condition trees'' |
|||
#### ''Flat: no trees are used'' |
|||
#### ''Smallest-first: always merge the two smallest trees.'' |
|||
#### ''Fixed height: full binary trees with a fixed number of levels. In Table 1, we use trees with 3 levels.''</span> |
|||
# Compress the condition FSAs, so that we can check more rules with a single FSA application. |
|||
## Try Automata Mista (Huet, 2003) |
|||
## Inter-automata bits (such as "skip this reading") |
|||
## Similar rule(parts), e.g. <code>SELECT np IF "Mari";</code> and <code>SELECT n IF "török";</code>: the FSA is the same, the only difference are the tags |
|||
### flag diacritics? |
|||
### push-down automata? |
|||
# '''TODO''' Index rules by their required tags (word-form, target, C conditions, etc.), and skip those that cannot possible be activated by the sentence. |
|||
<div align="center"> |
|||
{| class="wikitable" |
|||
|+ align="bottom" | ''Table 1. The effect of different methods of grouping conditions on running time'' |
|||
! Condition |
|||
! Total running time |
|||
! FST loading |
|||
! FST application |
|||
! Conditions checked |
|||
! Tree size |
|||
! File size |
|||
|- |
|||
|Flat (no tree) |
|||
|1.476s |
|||
|0.041s |
|||
|1.435s |
|||
|107,488 |
|||
|N/A |
|||
|76 kB |
|||
|- |
|||
|Smallest first |
|||
|1.355s |
|||
|<span style="color:#b00000">'''0.355s'''</span> |
|||
|0.9s |
|||
|42,336 |
|||
|2-4 rules |
|||
|<span style="color:#b00000">'''4.4 MB'''</span> |
|||
|- |
|||
|Fixed height |
|||
|2.070s |
|||
|<span style="color:#b00000">'''1.167s'''</span> |
|||
|'''0.84s''' |
|||
|36,832 |
|||
|4 rules |
|||
|<span style="color:#b00000">'''12 MB'''</span> |
|||
|} |
|||
{| class="wikitable" |
|||
|+ align="bottom" | ''Table 2. Running the algorithm on different languages<br>*Smallest first<br>**Hit physical memory limit'' |
|||
! Language |
|||
! Number of rules |
|||
! Number of FSTs |
|||
! Size of FST file |
|||
! Memory consumed by foma when loading |
|||
! Memory consumed by foma |
|||
! Size of VISL-GC binary |
|||
! Initialisation |
|||
! Running time |
|||
! VISL-GC running time |
|||
|- |
|||
|Hungarian |
|||
|35 |
|||
|95 |
|||
|3.6MB |
|||
|? |
|||
|2.7MB |
|||
|8kB |
|||
|0.41s |
|||
|0.94s* |
|||
|0.3s |
|||
|- |
|||
|Breton-French |
|||
|226 |
|||
|602 |
|||
|39MB |
|||
|1GB |
|||
|666MB |
|||
|36kB |
|||
|7.9s |
|||
|4.9s |
|||
|0.8s |
|||
|- |
|||
|sme-fin |
|||
|1172 |
|||
|3149 |
|||
|<span style="color:#b00000">'''514MB'''</span> |
|||
|3550MB** |
|||
|2476MB |
|||
|184kB |
|||
|} |
|||
</div> |
|||
===== Hacky hacks ===== |
|||
# <span style="color:#007000">''An <code>apply_detmin_fst()</code> instead of <code>apply_down()</code> for the apertium <-> fomacg converters''</span> |
|||
# Sigma optimisation |
|||
## <code>apply_create_sigmatch()</code> is responsible for about 54% of <code>apply_detmin_fsa()</code>'s time. Basically we read the input string each time we run a condition FSA. |
|||
## What we can do: |
|||
### <span style="color:#007000">''Roll our own <code>custom_create_sigmatch()</code> that splits the input at spaces and treates each word/tag/etc. as a single symbol, not as a character sequence.'' |
|||
#### ''Currently, only words/tags/etc. that occur in the rules are handled this way. Extending this idea to unknown tags (those, whose characters are classified as <code>IDENTITY_SYMBOL</code>) results in fewer transitions in the FSA, and only one (failed) look-up in the sigma trie for each unknown tag, as opposed to one for each of its letters.''</span> |
|||
#### ''Furthermore, while the original <code>custom_detmin_fsa()</code> and <code>custom_create_sigmatch()</code> required that, apart from the symbol vector, the original string is passed to them too, we can actually get away without the latter. Instead of using a <code>sigmatch</code> table of the size of the string, we can use one of the size of the vector, which is much shorter. Maybe because of this (fewer memory re-allocations), this simple modification to the functions resulted in '''14.5%''' speed increase for the Breton grammar.'' |
|||
#### Table 3 below summarizes the results: we have achieved a '''16-36% speed-up''' and decreased the running time on the Hungarian corpus to '''0.75s'''. The last two columns shows the effect of factoring in the f2a direction of hack #1, as well as the modification to <code>custom_detmin_fsa()</code> described above, which allowed a '''further 1.5-14.5% speed-up'''. |
|||
### Sigma unification: |
|||
#### <span style="color:#007000">''Create an FSA that contains all symbols from all rules and conditions. Its sigma will be the "canonical"/"universal" one.''</span> |
|||
#### <span style="color:#007000">''Move around sigmas in all condition FSAs and rule FSTs so that the symbols above have the same codes.''</span> |
|||
#### <span style="color:#007000">''Write a <code>common_detmin_fsa()</code> that, if it sees a symbol unknown to the FSA in state, it treats it as an <code>IDENTITY_SYMBOL</code> if it can. There are several ways to achieve this end:</span> |
|||
##### <span style="color:#007000">''Have a <code>sigma</code> set for all FSAs; if an input symbol is not in the set, replace it. Note that for this use case, <code>unordered_set</code> is '''slower''' than <code>set</code> because the number of elements in a single set is small.''</span> |
|||
##### Have a mapping that translates all symbols in the universal sigma to the local one. For symbols in the local sigma this will be identity; for the rest, <code>IDENTITY</code> :) |
|||
##### <span style="color:#007000">''The latter can be implemented with a <code>vector</code>, which is much faster than a <code>map</code>, though its memory requirements are higher (<code>O(|conditions| * |sigma|)</code>).''</span> |
|||
##### Results are in Table 4, below. |
|||
#### <span style="color:#007000">''Let the universal FSA create the sigmatch table, and pass its <code>apply_handle</code> to the others in <code>common_detmin_fsa()</code>. This enables us to convert the symbols to sigma ids only once per rule application.''</span> |
|||
### <span style="color:#007000">''Get rid of strings for good:''</span> |
|||
#### <span style="color:#007000">''Write an <code>common_apply_down()</code> that''</span> |
|||
##### <span style="color:#007000">''handles nondeterminism (with the restriction that for an input there is only a single output)''</span> |
|||
##### <span style="color:#007000">''both the input and output are sigma ids''</span> |
|||
#### <span style="color:#007000">''Because we currently delete filtered readings instead of just <code>#X# </code>'ing them,''</span> |
|||
##### <span style="color:#007000">''we have to pass a sequence of <code>(sigma, symbol)</code> pairs to it, or''</span> |
|||
##### <span style="color:#007000">''actually, we pass a sequence of <code>Symbol</code>s, which store the sigma id, its position in the string and its length to avoid string creation overhead.''</span> |
|||
##### we can just bring back the <code>#X# </code>. |
|||
### <span style="color:#007000">''See Table 0 for the results; for the Hungarian, we beat VISL-GC a bit; for Breton, we need to find another 20%.''</span> |
|||
# <code>apply_down()</code> still accounts for a rather large percentage of the running time. This is because it is called not only by <code>apply_rules()</code>, but by the code that converts the apertium stream format to fomacg's format and vice versa. |
|||
## <span style="color:#007000">''The latter FST is definitely deterministic. Using the new <code>apply_detmin_fst()</code> function instead of <code>apply_down()</code> decreased the running time of the Hungarian corpus '''from 0.94s to 0.87s'''.''</span> |
|||
## <span style="color:#c07070">''Isn't the rule FST deterministic as well?!</span> ... <span style="color:#900000">'course not.''</span> |
|||
<div align="center"> |
|||
{| class="wikitable" |
|||
|+ align="bottom" | ''Table 3. Improvements due to custom_detmin_fsa()<br>The numbers were achieved by using the smallest-first tree building method.''<br>''(*) see above; additionally, with hack #1 in the f2a direction too.''<br> |
|||
''(**) compared to the regular <code>custom_detmin_fsa</code>, but with hack #1.'' |
|||
! Language |
|||
! <code>apply_detmin_fsa()</code> |
|||
! <code>custom_detmin_fsa()</code> |
|||
! Speed-up |
|||
! <code>custom_detmin_fsa()</code>(*) |
|||
! Speed-up(**) |
|||
|- |
|||
|Hungarian |
|||
|0.868s |
|||
|0.745s |
|||
|16.5% |
|||
|0.648s |
|||
|1.5% |
|||
|- |
|||
|Breton-French |
|||
|4.82s |
|||
|3.54s |
|||
|36% |
|||
|2.73s |
|||
|'''14.5%''' |
|||
|} |
|||
</div> |
|||
<div align="center"> |
|||
{| class="wikitable" |
|||
|+ align="bottom" | ''Table 4. Improvements due to common_detmin_fsa()<br>The numbers were achieved by using the smallest-first tree building method.'' |
|||
! Language |
|||
! <code>custom_detmin_fsa()</code> |
|||
! <code>common_detmin_fsa() w/ set</code> |
|||
! Speed-up |
|||
! <code>custom_detmin_fsa() w/ vector</code> |
|||
! Speed-up |
|||
|- |
|||
|Hungarian |
|||
|0.648s |
|||
|0.623s |
|||
|4% |
|||
|0.559s |
|||
|16% |
|||
|- |
|||
|Breton-French |
|||
|2.73s |
|||
|2.12s |
|||
|29% |
|||
|1.566s |
|||
|'''74%''' |
|||
|} |
|||
</div> |
|||
<div align="center"> |
|||
{| class="wikitable" |
|||
|+ align="bottom" | ''Table 5. Improvements in memory usage and running time due to removing <code>#X# </code> from the initial filter pattern.'' |
|||
! Language |
|||
! Before -- memory |
|||
! Before -- initialisation |
|||
! Before -- runtime |
|||
! After -- memory |
|||
! After -- initialisation |
|||
! After -- runtime |
|||
|- |
|||
|Hungarian |
|||
|7.1MB |
|||
|0.721s |
|||
|0.238s |
|||
|6.5MB |
|||
|0.658s |
|||
|0.232s |
|||
|- |
|||
|Breton-French |
|||
|7.4MB |
|||
|1.814s |
|||
|1.105s |
|||
|6MB |
|||
|1.67s |
|||
|1.007s |
|||
|} |
|||
</div> |
|||
==== Size problems ==== |
|||
As mentioned above, union'ing the rule condition FSAs is a good way to speed up the rule application process. Unfortunately, the resulting automata may be exponential in size, making FSAs that check many rules at once unfeasible. Unless, of course, we can reduce their memory footprint. This section documents the experiments we conducted to towards this goal. |
|||
# Automata Mista |
|||
## Due to Gérard Huet |
|||
### Huet, G. (2004). Automata mista. Verification: Theory and Practice, 271-273. |
|||
### Huet, G. (2005). A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger. Journal of Functional Programming, 15(4), 573-614. |
|||
### While the code implementating AM is included in the papers, it is not immediately usable and is rather hard to read, because |
|||
#### the code is in ML (with the occasional OCaml library call), which isn't very intuitive |
|||
#### it is written in a functional style, which, when translated line-by-line to Python, leads to abysmal performance |
|||
#### the author couldn't have been bothered to explain what certain function arguments or variables mean |
|||
## Promise |
|||
### Huet reports huge savings; for example, the Sanskrit morphological analyser, an FSA of 200,000 states, could be represented by a trie of 7,500 nodes. |
|||
### Then again, FSAs derived from word lists have a structure that gives them naturally to tries. Whether the method fares similarly well on generic FSAs is yet to be seen. |
|||
## Implementation |
|||
### <span style="color:#009000">''I have translated the caml code to Python. If it seems promising, I'll be rewriting it in C++.''</span> |
|||
## Evaluation |
|||
### <span style="color:#009000">''I fed the vocabulary of the Longman dictionary (more specifically, the strangely short variant that we have) to it. From the 47,051 tokens (words, numbers, g@rb@ge), the trie builder built a raw trie (with only forward sharing) of 150,222 nodes. Running the sharing algorithm (2/p9) on it reduced the number of nodes to 42,540 -- a 71.68% percent reduction.''</span> |
|||
### <span style="color:#009000">''Of course, due to how the trie is built (there is an empty node at the end of each path), about 47050 of those nodes are trivial to merge; this decreases the reduction to a more humble '''58.77%'''.''</span> |
|||
### <span style="color:#009000">''Collapsing nonaccepting, nonbranching trie paths (not described in the papers) results in a further reduction of '''54.8%''', bringing down the number of nodes to 19,213. However, I don't see a way to decrease this number further, and the compression level is not nearly the same as the one reported for the Sanskrit analyzer.''</span> |
|||
### <span style="color:#009000">''At the same time, foma creates an 42,540-state automaton from the same word list, which hints at the not very surprising fact that "sharing" is basically the same as minimising a deterministic automaton. As such, the only advantage AM can boast of is that states can be collapsed, but I am not sure that we are better off, due to the additional bookkeeping required (sequence of letters on arcs).''</span> |
|||
# Decrease the alphabet |
|||
## We cannot hope for large savings from this one, but <code>| </code> seems to be completely superfluous: the end of a reading is clearly marked by the beginning of the next one (<code>#0# </code>) or the end of the cohort (<code>#EOC# </code>). |
|||
### There is a possibility that this change might make the rules more complicated; still, we should try it. |
|||
## It is not necessary to save all FSAs to disk. In case of a full binary tree with three levels, nodes 3, 5 and 7 (BFS ordering) are superfluous: if node 1 matches the sentence, but node 2 does not, it is obvious that node 3 would, and so on. The rule applier has already been built with this idea in mind, but the redundant trees are still save to file and read into memory. |
|||
# <span style="color:#009000">''Delete the sigma tries from the condition automata and rules.''</span> |
|||
## <span style="color:#009000">''See table 6. for results on some of the rule sets. It is obvious from the data that the reduction is greater for rule sets where the condition trees are shallow; as the conditions are merged, the growth in the number of states and transitions becomes more substantial -- in the extreme case the sigma of a joined condition might not grow at all, but the size of the state space always does considerably.''</span> |
|||
<div align="center"> |
|||
{| class="wikitable" |
|||
|+ align="bottom" | ''Table 6. Improvements in memory usage due to removing the sigma trie from the condition and rule FSMs.<br>Memory consumption is measured as percentages of the 4GB system memory.'' |
|||
! Rule set |
|||
! Before |
|||
! After |
|||
! Reduction |
|||
|- |
|||
|apertium-br-fr.br-fr_cg2_orig.fst |
|||
|23.5% |
|||
|17.8% |
|||
|24.25% |
|||
|- |
|||
|apertium-br-fr.br-fr_cg2_l2.fst |
|||
|9.6% |
|||
|4.4% |
|||
|54.16% |
|||
|- |
|||
|apertium-br-fr.br-fr_cg2_l1.fst |
|||
|5.1% |
|||
|1.3% |
|||
|74.5% |
|||
|- |
|||
|hun_cg2_l1.fst |
|||
|0.5% |
|||
|0.1% |
|||
|80% |
|||
|- |
|||
|hun_cg2_l3.fst |
|||
|2.1% |
|||
|1.5% |
|||
|28.57% |
|||
|- |
|||
|apertium-sme-fin.fin-sme_cg2_l1.fst |
|||
|21% |
|||
|4.1% |
|||
|80.47% |
|||
|- |
|||
|apertium-sme-fin.fin-sme_cg2_l2.fst |
|||
|32.3% |
|||
|8.9% |
|||
|72.44% |
|||
|} |
|||
</div> |
|||
==== TODO?'s ==== |
|||
# ''cg-comp'' is not so strict as ''fomacg'' about the position part in contextual tests, e.g. it accepts <code>1*</code>, not just <code>*1</code>. Loosen up ''fomacg''? |
|||
==== Research ==== |
==== Research ==== |
||
Line 20: | Line 517: | ||
==== Hungarian CG grammar ==== |
==== Hungarian CG grammar ==== |
||
Write a simple CG grammar for Hungarian, somewhere around 50-150 rules. |
''Write a simple CG grammar for Hungarian, somewhere around 50-150 rules.'' |
||
* ''Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.'' |
* ''Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.'' |
||
Line 26: | Line 523: | ||
* ''Install Apertium, VISL CG3 and a language pair (cy-en)'' |
* ''Install Apertium, VISL CG3 and a language pair (cy-en)'' |
||
* Study the CG grammar of an Apertium language. |
* Study the CG grammar of an Apertium language. |
||
* Write a Hungarian grammar that covers the sentences in [https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-hun-eng/texts/rasskaz.hun.txt this sample file] |
* ''Write a Hungarian grammar that covers the sentences in [https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-hun-eng/texts/rasskaz.hun.txt this sample file]'' |
||
** ''The tags will be based on those in KR-code<ref>András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.</ref>. See the [[#Hunmorph converter|next task]].'' |
** ''The tags will be based on those in KR-code<ref>András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.</ref>. See the [[#Hunmorph converter|next task]].'' |
||
* TODOs: |
* TODOs: |
||
** ''add a sentence to the rasskaz file for the "az a" construct.'' |
** ''add a sentence to the rasskaz file for the "az a" construct.'' |
||
** prevb disambiguation |
** ''prevb disambiguation'' |
||
* ''The file is [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/hun.rlx here].'' |
|||
==== Hunmorph converter ==== |
==== Hunmorph converter ==== |
||
Line 40: | Line 538: | ||
* TODOs: |
* TODOs: |
||
** ''some analyses are repeated in the output: fix them! -- Not a real fix, because the problem wasn't in the script, but in ocamorph, but I wrote a python script that discards repeated analyses.'' |
** ''some analyses are repeated in the output: fix them! -- Not a real fix, because the problem wasn't in the script, but in ocamorph, but I wrote a python script that discards repeated analyses.'' |
||
** ''hunmorph_to_apertium.foma does not handle compounds (+)'' |
|||
** two-phase conversion (handle special words first, then the general part) |
|||
** ''a few non-general conversions:'' |
|||
*** ''"mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg>'' |
|||
*** ''"hova" => <itg>'' |
|||
*** ''"így<CONJ>" => <cnjadv>'' |
|||
*** ''"ekkor<ADV>" => <cnjadv>?'' |
|||
*** ''"aztán<CONJ>" => <cnjadv>'' |
|||
* ''Apparently a pure lexc/foma based implementation wasn't possible (at least with the one line -- one word output format I decided ask from hunmorph). The reason is that the tagset of hunmorph and Apertium does not match exactly, and therefore I needed to add exceptional rules for certain words and lexical categories. However, ''<code>flookup</code>'' returns all possible analyses, so in this case it returned both the exceptional and the regular translation. The current solution consists of four files:'' |
|||
** ''[https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/kr_to_apertium_spec.lexc kr_to_apertium_spec.lexc] contains the words / lexical categories that need special treatment (''<code>vbser</code>'', ''<code>prn</code>''s, "''ez''"/"''az''", etc.)'' |
|||
** ''[https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/kr_to_apertium.lexc kr_to_apertium.lexc] contains the rest of the categories, i.e. whose the translation was straightforward'' |
|||
** ''[https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/kr_to_apertium.foma kr_to_apertium.foma] is a simple foma script that writes the previous two into the binary file'' <code>kr_to_apertium.fst</code> |
|||
** ''[https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/hunmorph_to_apertium.cpp hunmorph_to_apertium.cpp] loads the two FSTs from the binary file and applies them to the readings. First it tries to parse the reading with the ''<code>_spec</code>'' FST; if it fails, it reverts back to the general one. This mechanism ensures that all readings get only one translation, and it is also the correct one.'' |
|||
==== ATT -> lttoolbox compiler ==== |
==== ATT -> lttoolbox compiler ==== |
||
Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component. |
''Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component.'' |
||
* [[ATT format]] |
* ''[[ATT format]]'' |
||
* "<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);" |
* ''"<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);"'' |
||
** ''The actual implementation turned out to be a little more difficult. Anyway, the code is [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/att_to_ltt/att_to_ltt.cpp here]. |
|||
** ''The transducer we used for testing, [https://svn.code.sf.net/p/apertium/svn/incubator/apertium-hun-eng/dev/att_to_ltt/kaz.att kaz.att], didn't work with <tt>lt-proc</tt>, so Fran told me to create two transducers instead of one: one for words ('''main''') and one for punctuation ('''final''').'' |
|||
== References == |
== References == |
Latest revision as of 16:27, 29 October 2013
Contents
Tasks[edit]
XML format[edit]
See User:David_Nemeskey/CG_XML_brainstorming.
Rule applier[edit]
Rough execution flow:
- Load the rules
- DELIMITERS must also be converted to an FST and saved.
- Rule organization:
- separate files in a directory?
- one file per section?
- one big file (needs another "index" file then)
- Read the input stream cohort by cohort
StreamReader
class in fomacg_stream_reader.h- Read until a delimiter is found (see above)
- Convert the cohorts from the Apertium stream format to fomacg's format
Converter
in fomacg_converter.h- apertium_to_fomacg.foma
- conversion from
wchar_t
to utf-8char
- Apply rules in the order of the sections they are defined in
- Convert the cohorts back to the Apertium stream format
Compiler[edit]
Code[edit]
- Read the fomacg code[1] and understand what it does.
- Add comments do a little refactoring (e.g. separate function for creating @*word@* FSTs).
- Check and test if it works on all grammars, fix it if it isn't.
Development[edit]
- DELIMITERS must also be converted to an FST and saved.
- I have run into a foma bug, which also corrupted the output of all rules that contained lemma tags. I have patched the code on my side and opened a bug report (two, actually).
- fomacg treats all tags as symbols. Usually this is OK, but for lemmas, it may not be optimal
- it blows up the number of symbols (if not the transducer) => turns out the transducer is smaller this way;
- might be slower to parse? Or faster => faster, if we use exploded strings, fomacg_proc becomes about 33% slower.
- If one tag is a postfix of another, and the longer tag is not mentioned in the rule, fomacg might trigger the wrong rule, believing that it sees the shorter tag. E.g. if there is a
"predet "
and a"det "
tag, and the rule isSELECT (det)
, it will also selectpredet
s as well. This is because in fomacg, the symbol that corresponds to a tag is its name plus a space; in other words, only the right boundary of a symbol is checked.- Change the symbols from
"tag "
to"<tag> "
to ensure the "postfix property", except for word forms and lemmas, which are already enclosed in quotes. - Keep as much of Apertium's format as possible?
- Change the symbols from
- Insert ">>>" and "<<<" to before and after the sentence, respectively.
- Make the rules fail-fast?
Speed problems[edit]
This first version of fomacg_proc, where rules are separate FSTs applied in order, has some speed issues. With the 40-rule CG-2 subset of my Hungarian grammar, it disambiguates a 13760-token test corpus (the rasskaz sentences 32x concatenated) in 6.4 seconds. Under the same conditions, cg-proc needs 0.3 seconds.
Speed measurements (see below for details) summary[edit]
apply_down
- original rules,
strcmp
-based application check: 6.4 seconds - deleteX rules,
strlen
-based application check: 5.9 seconds - deleteX rules, FSA (
apply_down
)-based application check: 9 - 11 seconds - deleteX rules, FSA (
apply_detmin_fsa
)-based check: 1.47 seconds - deleteX rules, FSA (
apply_detmin_fsa
)-based check, smallest-first condition trees: 1.35 seconds, running time: 0.95 seconds. - deleteX rules, FSA (
apply_detmin_fsa
)-based check, smallest-first condition trees,apply_detmin_fst
conversion: 1.3 seconds, running time: 0.87 seconds. - deleteX rules, FSA (
custom_detmin_fsa
)-based check, smallest-first condition trees,apply_detmin_fst
conversion: 1.1 seconds, running time: 0.75 seconds. - deleteX rules, FSA (
custom_detmin_fsa
)-based check, smallest-first condition trees,apply_detmin_fst
conversion both ways: 1 seconds, running time: 0.655 seconds. - deleteX rules, FSA (string vector only
custom_detmin_fsa
)-based check, smallest-first condition trees,apply_detmin_fst
conversion both ways: 1 seconds, running time: 0.648 seconds (but Breton gained a lot from this, see table!). - deleteX rules, FSA (
common_detmin_fsa
with set)-based check, smallest-first condition trees,apply_detmin_fst
conversion both ways: 1 seconds, running time: 0.623s seconds (but again, Breton gained a lot more from this). - deleteX rules, FSA (
common_detmin_fsa
with vector)-based check, smallest-first condition trees,apply_detmin_fst
conversion both ways: 0.93 seconds, running time: 0.559s seconds (again, Breton... you get the idea). - deleteX rules, FSA (
common_detmin_fsa
with vector)-based check,common_apply_down
-based rule application, smallest-first condition trees,apply_detmin_fst
conversion both ways: 0.72 seconds, running time: 0.346s seconds. - deleteX rules, FSA (
common_detmin_fsa
with vector)-based check,common_apply_down
-based rule application,common_create_sigmatch
-based string->int conversion, smallest-first condition trees,apply_detmin_fst
conversion both ways: 0.655 seconds, running time: 0.285s seconds (note that this running time is less than the full execution time of VISL-GC, so we're on the right track! Unfortunately, we are not there with Breton yet.) - Same as above, but after having optimised the input stream reader code and getting rid of deques: 0.615 seconds, running time: 0.245s seconds.
- original rules,
fsm_compose
- simple rules,
fsm_intersect
-based application check: 45 seconds - deleteX rules,
fsa->statecount
-based application check: 28.3 seconds
- simple rules,
Method (ID in list above) | hun init | hun run | br init | br run |
---|---|---|---|---|
1 | 0s | 6.4s | N/A | N/A |
2 | 0s | 5.9s | N/A | N/A |
3 | 0s | 9-11s | N/A | N/A |
4 | 0s | 1.47s | N/A | N/A |
5 | 0.37s | 0.95s | N/A | N/A |
6 | 0.37s | 0.87s | 7.9s | 4.9s |
7 | 0.37s | 0.75s | 7.9s | 3.54s |
8 | 0.37s | 0.655s | 7.9s | 3.13s |
9 | 0.37s | 0.648s | 7.9s | 2.73s |
10 | 0.37s | 0.623s | 7.9s | 2.12s |
11 | 0.37s | 0.559s | 7.9s | 1.566s |
12 | 0.37s | 0.346s | 7.9s | 1.088s |
13 | 0.37s | 0.285s | 7.9s | 0.96s |
14 | 0.37s | 0.245s | 7.9s | 0.916s |
Simple ideas to speed it up[edit]
- do not use symbols for tags, etc (see above): execution time grows by 33%
compose net
s on the same priority level: the FSTs become too big too fast -- on some simpler rules:- rule: 3.5 kB. 44 states, 176 arcs
- rules: 446.0 kB. 1965 states, 28482 arcs
- rules: 4.9 MB. 17303 states, 320664 arcs
- rules: 61.5 MB. 163945 states, 4029156 arcs
union net
s on the same priority level: the nets don't grow as fast as withcompose
, but still not useable:- rule: 3.5 kB. 44 states, 176 arcs
- rules: 47.3 kB. 202 states, 2957 arcs
- rules: 238.8 kB. 856 states, 15208 arcs
- rules: 1.3 MB. 3988 states, 82261 arcs
- rules: 7.2 MB. 20296 states, 471188 arcs
- rules: 53.5 MB. 131639 states, 3504119 arcs
Clever ideas[edit]
- Use fail-fast FSTs. The idea was that these could speed up the computation because unapplicable rules don't have to copy the whole string over, they can return on the first error. I have tried to augment the
ConstrainR
andConstrainS
pattern in two ways:- compose
$"$A$ "
at the end: worked, but the rule FSTs became larger and slower; - instead of deleting
$1$
and#1#
from the string, I composed["$1$ " => _ MarkedReading & NotMarkedReading] & $"$1$ "
at the beginning. This didn't work, the rules only recognized cohorts, not sentences. And if I change the rule to?* rule ?*
, it becomes non-deterministic. - Maybe this idea wasn't that clever after all -- the rule might have to read the whole sentence anyway...
- compose
- Do not pass the sentence to the rule, just the current word (+ the left & right contexts).
- Unfortunately, this does not mitigate the need of reading the sentence repeatedly over-and-over, because of right contexts: a rule can be enabled by another that changes something in a later word.
- Instead of applying the rules to the sentence string, convert the string to an (linear) FSA and compose the rules with it.</span
- Unfortunately, this is even slower: on one of my machines, instead of 6.5 seconds, it takes 47! Why? It could be because:
- All
fsm_
functions are destructive. However, about half of the time (most, if the TODO above is implemented), I needed the old transducer so I had tofsm_copy
it. - Why indeed? According to callgrind (note: the numbers are not entirely accurate, because there was a bug in the program when I measured them.
fomacg_to_fsa
could be around 13%):- 28.52%
fomacg_to_fsa
==fsm_concat
- 23.01%
fsm_intersect
- 43.19%
fsm_compose
- 28.52%
- If we go deeper:
- 24.01%
fsm_merge_sigma
(called by all three) - 24.01%
fsm_minimize
(called 82.87% of the time byfsm_concat
, and 16.05% byfsm_intersect
)
- 24.01%
- What can we take home from this?
- About 50% of the time is spent in
fsm_compose
, which we cannot avoid. It might be possible, however, to write a custom <fsm_compose> that exploits the fact that one of its input is not a general FST, but a linear FSA. fsm_intersect
takes about 50% less than fsm_compose (both are called once per rule), so trying the intersection-based method looks promising (though the sentence won't be linear in that case)- The overhead of
fsm_minimize
could be avoided entirely by using a customfsm_concat
to construct the sentence. It is a linear automaton, so no minimisation is needed.
- About 50% of the time is spent in
- All
- Sometimes apply down'ing a rule works, but composing it to the sentence FSA doesn't. This was a bug in fomacg.
- Currently we are at 28.3 seconds
fsm_concat_custom()
does not callfsm_minimize()
.- We use rules that delete the #X# readings.
- Unfortunately, this is even slower: on one of my machines, instead of 6.5 seconds, it takes 47! Why? It could be because:
- Mans's original rules replaced the
#0#
in front of filtered readings with an#X#
. However, nothing prevents us from actually removing those readings... since this way the sentence becomes shorter with each rule applied, subsequent rule applications / checks become faster. I call this new behavior "deleteX", and it is realised by composing the transducer cg_DeleteX.foma that deletes the#X#
'd readings to all rules.- As can be seen in Table 0, this is indeed the case.
- The situation gets a bit more tricky if we replace string-based
apply_down()
withcommon_apply_down()
(see above).- As we are dealing with numbers now, it turns out that we would now be better off with getting back to replacing
#0#
with#X#
, as that wouldn't require creating and filling a newvector
ofSymbol
s. - However,
callgrind
tells us that we spend 71.23% of our time checking rule conditions, and only about 6.75% applying rules; and the performance of the former is, again, heavily affected by the length of the sentence. - So it may just be that we are still better off with deleteX rules. However, if we manage to decrease the number of condition checks, we might reconsider dropping them.
- As we are dealing with numbers now, it turns out that we would now be better off with getting back to replacing
- As it turns out, the cg_DeleteX.foma is not enough; we are left with
#X# -> 0
branches in our FSTs that are never traversed; yet, they take up space.- The culprits were the initial filter pattern and the
cg_match_cohort_careful()
function in fomacg. Modifying these got us rid of the unnecessary branches. However, even though it isn't used anywhere now,#X#
remained in the sigma of the FSTs; apparently foma does not remove unused symbols from the alphabet of the transducer created via composition. - The effect of removing said branches can be seen in Table 5. As I could not recreate the condition trees I am using for the performance tests, I had to create new ones. I sorted the rules in alphabetical order, and created complete binary trees from them (3-level trees for Hungarian, 2-level trees for Breton). Therefore, the numbers in Table 5 cannot be directly compared with those in Table 0; however, the "before" numbers correspond to row 14 of Table 0.
- The culprits were the initial filter pattern and the
- (Mans's suggestion): Create an automaton for each rule that doesn't change anything, but only accepts the sentence the associated rule would change. "This automaton can be determinized and minimized and applied *much* faster than the corresponding transducer."
- If the automaton is just used to check the applicability of a single rule, execution time goes up to 9 seconds. :(
- According to callgrind,
apply_follow_next_arc
is called 1,358,508 times when we used only FTSs, and 2,589,103 times with the new method. - In the original test set (1/32nd of the performance test corpus), there were 3541 attempts at rule applications, 79 of which were successful.That means 3541 FST application in the first case, and 3541 FSA + 79 FST applications in the second.
- This is especially interesting if we know that the total number of states in the rule FSTs is 3,467, while in the condition FSAs, it is only 1,597. The total number of arcs in the FSAs also seem to be about half of that in the FSTs. So why?
- According to callgrind,
- If, however, we use the new
apply_detmin_fsa
function to execute a deterministic minimal FSA (even with linear transition search), without all the cruft inapply_net
, we are down to 1.45 seconds. - It could be possible to check the applicability of a number of rules with a single automaton. Researching this right now...
- The best speed could probably be achieved by building FSAs from conditions numbering the powers of two. In this way, we could find the rule to apply in log(G) checks, thereby breaking the linearity barrier.
- However, I have found that the limit where the FSAs are still of managable size is about 1000 states. This (depending on the number of states of the original FSA) is usually reached after unioning 3-6 FSAs. This means that while log(G) is out of reach, we can still reduce our running time to about 1/3-1/6th of what it is now.
- Tests of different ways of building condition trees
- Flat: no trees are used
- Smallest-first: always merge the two smallest trees.
- Fixed height: full binary trees with a fixed number of levels. In Table 1, we use trees with 3 levels.
- If the automaton is just used to check the applicability of a single rule, execution time goes up to 9 seconds. :(
- Compress the condition FSAs, so that we can check more rules with a single FSA application.
- Try Automata Mista (Huet, 2003)
- Inter-automata bits (such as "skip this reading")
- Similar rule(parts), e.g.
SELECT np IF "Mari";
andSELECT n IF "török";
: the FSA is the same, the only difference are the tags- flag diacritics?
- push-down automata?
- TODO Index rules by their required tags (word-form, target, C conditions, etc.), and skip those that cannot possible be activated by the sentence.
Condition | Total running time | FST loading | FST application | Conditions checked | Tree size | File size |
---|---|---|---|---|---|---|
Flat (no tree) | 1.476s | 0.041s | 1.435s | 107,488 | N/A | 76 kB |
Smallest first | 1.355s | 0.355s | 0.9s | 42,336 | 2-4 rules | 4.4 MB |
Fixed height | 2.070s | 1.167s | 0.84s | 36,832 | 4 rules | 12 MB |
Language | Number of rules | Number of FSTs | Size of FST file | Memory consumed by foma when loading | Memory consumed by foma | Size of VISL-GC binary | Initialisation | Running time | VISL-GC running time |
---|---|---|---|---|---|---|---|---|---|
Hungarian | 35 | 95 | 3.6MB | ? | 2.7MB | 8kB | 0.41s | 0.94s* | 0.3s |
Breton-French | 226 | 602 | 39MB | 1GB | 666MB | 36kB | 7.9s | 4.9s | 0.8s |
sme-fin | 1172 | 3149 | 514MB | 3550MB** | 2476MB | 184kB |
Hacky hacks[edit]
- An
apply_detmin_fst()
instead ofapply_down()
for the apertium <-> fomacg converters - Sigma optimisation
apply_create_sigmatch()
is responsible for about 54% ofapply_detmin_fsa()
's time. Basically we read the input string each time we run a condition FSA.- What we can do:
- Roll our own
custom_create_sigmatch()
that splits the input at spaces and treates each word/tag/etc. as a single symbol, not as a character sequence.- Currently, only words/tags/etc. that occur in the rules are handled this way. Extending this idea to unknown tags (those, whose characters are classified as
IDENTITY_SYMBOL
) results in fewer transitions in the FSA, and only one (failed) look-up in the sigma trie for each unknown tag, as opposed to one for each of its letters. - Furthermore, while the original
custom_detmin_fsa()
andcustom_create_sigmatch()
required that, apart from the symbol vector, the original string is passed to them too, we can actually get away without the latter. Instead of using asigmatch
table of the size of the string, we can use one of the size of the vector, which is much shorter. Maybe because of this (fewer memory re-allocations), this simple modification to the functions resulted in 14.5% speed increase for the Breton grammar. - Table 3 below summarizes the results: we have achieved a 16-36% speed-up and decreased the running time on the Hungarian corpus to 0.75s. The last two columns shows the effect of factoring in the f2a direction of hack #1, as well as the modification to
custom_detmin_fsa()
described above, which allowed a further 1.5-14.5% speed-up.
- Currently, only words/tags/etc. that occur in the rules are handled this way. Extending this idea to unknown tags (those, whose characters are classified as
- Sigma unification:
- Create an FSA that contains all symbols from all rules and conditions. Its sigma will be the "canonical"/"universal" one.
- Move around sigmas in all condition FSAs and rule FSTs so that the symbols above have the same codes.
- Write a
common_detmin_fsa()
that, if it sees a symbol unknown to the FSA in state, it treats it as anIDENTITY_SYMBOL
if it can. There are several ways to achieve this end:- Have a
sigma
set for all FSAs; if an input symbol is not in the set, replace it. Note that for this use case,unordered_set
is slower thanset
because the number of elements in a single set is small. - Have a mapping that translates all symbols in the universal sigma to the local one. For symbols in the local sigma this will be identity; for the rest,
IDENTITY
:) - The latter can be implemented with a
vector
, which is much faster than amap
, though its memory requirements are higher (O(|conditions| * |sigma|)
). - Results are in Table 4, below.
- Have a
- Let the universal FSA create the sigmatch table, and pass its
apply_handle
to the others incommon_detmin_fsa()
. This enables us to convert the symbols to sigma ids only once per rule application.
- Get rid of strings for good:
- Write an
common_apply_down()
that- handles nondeterminism (with the restriction that for an input there is only a single output)
- both the input and output are sigma ids
- Because we currently delete filtered readings instead of just
#X#
'ing them,- we have to pass a sequence of
(sigma, symbol)
pairs to it, or - actually, we pass a sequence of
Symbol
s, which store the sigma id, its position in the string and its length to avoid string creation overhead. - we can just bring back the
#X#
.
- we have to pass a sequence of
- Write an
- See Table 0 for the results; for the Hungarian, we beat VISL-GC a bit; for Breton, we need to find another 20%.
- Roll our own
apply_down()
still accounts for a rather large percentage of the running time. This is because it is called not only byapply_rules()
, but by the code that converts the apertium stream format to fomacg's format and vice versa.- The latter FST is definitely deterministic. Using the new
apply_detmin_fst()
function instead ofapply_down()
decreased the running time of the Hungarian corpus from 0.94s to 0.87s. - Isn't the rule FST deterministic as well?! ... 'course not.
- The latter FST is definitely deterministic. Using the new
Language | apply_detmin_fsa()
|
custom_detmin_fsa()
|
Speed-up | custom_detmin_fsa() (*)
|
Speed-up(**) |
---|---|---|---|---|---|
Hungarian | 0.868s | 0.745s | 16.5% | 0.648s | 1.5% |
Breton-French | 4.82s | 3.54s | 36% | 2.73s | 14.5% |
Language | custom_detmin_fsa()
|
common_detmin_fsa() w/ set
|
Speed-up | custom_detmin_fsa() w/ vector
|
Speed-up |
---|---|---|---|---|---|
Hungarian | 0.648s | 0.623s | 4% | 0.559s | 16% |
Breton-French | 2.73s | 2.12s | 29% | 1.566s | 74% |
Language | Before -- memory | Before -- initialisation | Before -- runtime | After -- memory | After -- initialisation | After -- runtime |
---|---|---|---|---|---|---|
Hungarian | 7.1MB | 0.721s | 0.238s | 6.5MB | 0.658s | 0.232s |
Breton-French | 7.4MB | 1.814s | 1.105s | 6MB | 1.67s | 1.007s |
Size problems[edit]
As mentioned above, union'ing the rule condition FSAs is a good way to speed up the rule application process. Unfortunately, the resulting automata may be exponential in size, making FSAs that check many rules at once unfeasible. Unless, of course, we can reduce their memory footprint. This section documents the experiments we conducted to towards this goal.
- Automata Mista
- Due to Gérard Huet
- Huet, G. (2004). Automata mista. Verification: Theory and Practice, 271-273.
- Huet, G. (2005). A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger. Journal of Functional Programming, 15(4), 573-614.
- While the code implementating AM is included in the papers, it is not immediately usable and is rather hard to read, because
- the code is in ML (with the occasional OCaml library call), which isn't very intuitive
- it is written in a functional style, which, when translated line-by-line to Python, leads to abysmal performance
- the author couldn't have been bothered to explain what certain function arguments or variables mean
- Promise
- Huet reports huge savings; for example, the Sanskrit morphological analyser, an FSA of 200,000 states, could be represented by a trie of 7,500 nodes.
- Then again, FSAs derived from word lists have a structure that gives them naturally to tries. Whether the method fares similarly well on generic FSAs is yet to be seen.
- Implementation
- I have translated the caml code to Python. If it seems promising, I'll be rewriting it in C++.
- Evaluation
- I fed the vocabulary of the Longman dictionary (more specifically, the strangely short variant that we have) to it. From the 47,051 tokens (words, numbers, g@rb@ge), the trie builder built a raw trie (with only forward sharing) of 150,222 nodes. Running the sharing algorithm (2/p9) on it reduced the number of nodes to 42,540 -- a 71.68% percent reduction.
- Of course, due to how the trie is built (there is an empty node at the end of each path), about 47050 of those nodes are trivial to merge; this decreases the reduction to a more humble 58.77%.
- Collapsing nonaccepting, nonbranching trie paths (not described in the papers) results in a further reduction of 54.8%, bringing down the number of nodes to 19,213. However, I don't see a way to decrease this number further, and the compression level is not nearly the same as the one reported for the Sanskrit analyzer.
- At the same time, foma creates an 42,540-state automaton from the same word list, which hints at the not very surprising fact that "sharing" is basically the same as minimising a deterministic automaton. As such, the only advantage AM can boast of is that states can be collapsed, but I am not sure that we are better off, due to the additional bookkeeping required (sequence of letters on arcs).
- Due to Gérard Huet
- Decrease the alphabet
- We cannot hope for large savings from this one, but
|
seems to be completely superfluous: the end of a reading is clearly marked by the beginning of the next one (#0#
) or the end of the cohort (#EOC#
).- There is a possibility that this change might make the rules more complicated; still, we should try it.
- It is not necessary to save all FSAs to disk. In case of a full binary tree with three levels, nodes 3, 5 and 7 (BFS ordering) are superfluous: if node 1 matches the sentence, but node 2 does not, it is obvious that node 3 would, and so on. The rule applier has already been built with this idea in mind, but the redundant trees are still save to file and read into memory.
- We cannot hope for large savings from this one, but
- Delete the sigma tries from the condition automata and rules.
- See table 6. for results on some of the rule sets. It is obvious from the data that the reduction is greater for rule sets where the condition trees are shallow; as the conditions are merged, the growth in the number of states and transitions becomes more substantial -- in the extreme case the sigma of a joined condition might not grow at all, but the size of the state space always does considerably.
Rule set | Before | After | Reduction |
---|---|---|---|
apertium-br-fr.br-fr_cg2_orig.fst | 23.5% | 17.8% | 24.25% |
apertium-br-fr.br-fr_cg2_l2.fst | 9.6% | 4.4% | 54.16% |
apertium-br-fr.br-fr_cg2_l1.fst | 5.1% | 1.3% | 74.5% |
hun_cg2_l1.fst | 0.5% | 0.1% | 80% |
hun_cg2_l3.fst | 2.1% | 1.5% | 28.57% |
apertium-sme-fin.fin-sme_cg2_l1.fst | 21% | 4.1% | 80.47% |
apertium-sme-fin.fin-sme_cg2_l2.fst | 32.3% | 8.9% | 72.44% |
TODO?'s[edit]
- cg-comp is not so strict as fomacg about the position part in contextual tests, e.g. it accepts
1*
, not just*1
. Loosen up fomacg?
Research[edit]
- Decrease the number of rules applied (see in the proposal).
- When and how can rules be merged?
Miscellaneous / Extra[edit]
Hungarian CG grammar[edit]
Write a simple CG grammar for Hungarian, somewhere around 50-150 rules.
- Read Pasi Tapnainen's The Constraint Grammar Parser CG-2.
- Read the contents of cg_material.zip.
- Install Apertium, VISL CG3 and a language pair (cy-en)
- Study the CG grammar of an Apertium language.
- Write a Hungarian grammar that covers the sentences in this sample file
- TODOs:
- add a sentence to the rasskaz file for the "az a" construct.
- prevb disambiguation
- The file is here.
Hunmorph converter[edit]
Write a converter from ocamorph's output to Apertium's format.
- Again, use the sentences in this sample file as reference.
- While a C-based converter would definitely be possible, I opted for a foma-based (xfst -- lexc?) implementation, so that this task also serves for practice.
- TODOs:
- some analyses are repeated in the output: fix them! -- Not a real fix, because the problem wasn't in the script, but in ocamorph, but I wrote a python script that discards repeated analyses.
- hunmorph_to_apertium.foma does not handle compounds (+)
- two-phase conversion (handle special words first, then the general part)
- a few non-general conversions:
- "mikor<CONJ>" => <cnjadv>, "mikor<ADV>" => <itg>
- "hova" => <itg>
- "így<CONJ>" => <cnjadv>
- "ekkor<ADV>" => <cnjadv>?
- "aztán<CONJ>" => <cnjadv>
- Apparently a pure lexc/foma based implementation wasn't possible (at least with the one line -- one word output format I decided ask from hunmorph). The reason is that the tagset of hunmorph and Apertium does not match exactly, and therefore I needed to add exceptional rules for certain words and lexical categories. However,
flookup
returns all possible analyses, so in this case it returned both the exceptional and the regular translation. The current solution consists of four files:- kr_to_apertium_spec.lexc contains the words / lexical categories that need special treatment (
vbser
,prn
s, "ez"/"az", etc.) - kr_to_apertium.lexc contains the rest of the categories, i.e. whose the translation was straightforward
- kr_to_apertium.foma is a simple foma script that writes the previous two into the binary file
kr_to_apertium.fst
- hunmorph_to_apertium.cpp loads the two FSTs from the binary file and applies them to the readings. First it tries to parse the reading with the
_spec
FST; if it fails, it reverts back to the general one. This mechanism ensures that all readings get only one translation, and it is also the correct one.
- kr_to_apertium_spec.lexc contains the words / lexical categories that need special treatment (
ATT -> lttoolbox compiler[edit]
Write an ATT FST format reading for lttoolbox. A useful practice for moving from foma to lttoolbox. Since lttoolbox lacks some of the functionaty needed, the compiler will most likely stay in foma, but lttoolbox might work as the runtime component.
- ATT format
- "<spectie> the ATT->lttoolbox thing should be a simple as : beer = t.insertSingleTransduction(alphabet(L'e',L'e'), beer);"
References[edit]
- ↑ Hulden, Mans. 2011. Constraint Grammar parsing with left and right sequential finite transducers. In: Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pages 39--47.
- ↑ András Kornai, Péter Rebrus, Péter Vajda, Péter Halácsy, András Rung, Viktor Trón. 2004. Általános célú morfológiai elemző kimeneti formalizmusa (The output formalism of a general-purpose morphological analyzer). In: Proceedings of the 2nd Hungarian Computational Linguistics Conference.