Difference between revisions of "Task ideas for Google Code-in/Intersection of ATT format transducers"

From Apertium
Jump to navigation Jump to search
(unmarkdonw)
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
'''This has been implemented, see the tool [[lt-trim]].'''
  +
 
{{TOCD}}
 
{{TOCD}}
 
==Objective==
 
==Objective==
Line 71: Line 73:
 
===Steps===
 
===Steps===
   
[[Image:imonodix.png|thumb|400px|center]] [[Image:prefixes.png|thumb|400px|center]]
+
[[Image:imonodix.png|thumb|400px|center]][[Image:prefixes.png|thumb|400px|center|The bilingual dictionary (or one side of it). Note that the loops at the end show that these are ''prefixes'', so will match any prefix of a lexical unit. '''Note: this loop should have the symbols of the monodix!''' ]]
 
[[Image:trimmed.png|thumb|400px|center]]
 
[[Image:trimmed.png|thumb|400px|center]]
   
Line 81: Line 83:
   
 
===Implementation in python===
 
===Implementation in python===
  +
  +
# Come up with a data structure for storing the transducers
  +
# Write a function to convert a transducer to a prefix transducer (e.g. make the final loopback over any symbol)
  +
# Write the intersection algorithm
   
 
===Implementation in C++===
 
===Implementation in C++===
   
  +
# Look up the <code>Transducer</code> class in [[lttoolbox]]
===Front-end program===
 
  +
# Write a function to convert a transducer to a prefix transducer (e.g. make the final loopback over any symbol)
  +
# Write the intersection algorithm as a method.
  +
 
===Front-end program in C++===
  +
  +
# Write a command-line program in the style of <code>lt-comp</code> and <code>lt-proc</code> to load two transducers and perform the intersection.
  +
  +
===Compound support in python===
  +
Languages like German write compound nouns without spaces, and our analysers allow going "back to the start" if they've seen a full noun but there's still more of the word to read. So if ''zeugnis'' and ''verweigerungs'' and ''recht'' are in the analyser, and marked as compoundable, ''zeugnisverweigerungsrecht'' would be possible to analyse. We mark an analysis as compoundable by putting a special tag on it, either &lt;compound-R&gt; for words that can be the final part, or &lt;compound-only-L&gt; for words that can be a non-final part. To give a simpler example, say "aabbcc" is a compound consisting of nouns "aa", "bb" and "cc", the analyser would have this transition table:
  +
<pre>
  +
0 1 a a
  +
1 4 a a
  +
0 2 b b
  +
2 4 b b
  +
0 3 c c
  +
3 6 c c
  +
4 5 ε <n>
  +
5 9 ε <compound-only-L>
  +
6 7 ε <n>
  +
7 9 ε <compound-R>
  +
9
  +
</pre>
  +
  +
Before words reach the bidix, they are split on the compound borders, so the bidix would see aa&lt;n&gt; bb&lt;n&gt; cc&lt;n&gt;, ie. a corresponding bidix that has all the entries of the analyser has this transition table:
  +
<pre>
  +
0 1 a a
  +
1 4 a a
  +
0 2 b b
  +
2 4 b b
  +
0 3 c c
  +
3 4 c c
  +
4 5 ε <n>
  +
5
  +
</pre>
  +
  +
The compound tag can appear anywhere in the analysis, including before all other tags, so intersection shouldn't remove a path until it has seen a non-compound tag.
  +
  +
===Compound support in C++===
  +
  +
===#-type multiword support in python===
  +
The '''#''' is used for "multiwords" (words that consist of several words), when you have inflection in the middle of the word, and an invariant part at the end.
  +
  +
In the analyser, the # appears after the the inflected part has been fully read, before the invariant part, so if we want to have both "coffee with milk" and "coffees with milk" as possible single-unit inputs, the analyser would have these possible outputs:
  +
  +
<pre>coffee<n><sg># with milk
  +
coffee<n><pl># with milk</pre>
  +
  +
But in the bidix, the invariant part is represented before the tags, ie.
  +
<pre>coffee# with milk<n><sg>
  +
coffee# with milk<n><pl></pre>
  +
  +
To intersect the analyser with the bidix, we actually need to intersect it with a transformed bidix that has the #-part moved after the tags.
  +
  +
So the task is to take a transducer, and transform it to one where any string starting with # is moved before any of the tags.
  +
  +
(This would then be input to the intersection.)
   
  +
===#-type multiword support in C++===
   
 
==Further reading==
 
==Further reading==
Line 91: Line 154:
 
* [http://www.cs.um.edu.mt/gordon.pace/Research/Software/Relic/Transformations/FSA/intersection.html Intersection of Finite State Automata]
 
* [http://www.cs.um.edu.mt/gordon.pace/Research/Software/Relic/Transformations/FSA/intersection.html Intersection of Finite State Automata]
 
* [https://svn.code.sf.net/p/apertium/svn/branches/lt-trim lt-trim (implemented with HFST)]
 
* [https://svn.code.sf.net/p/apertium/svn/branches/lt-trim lt-trim (implemented with HFST)]
  +
* [[Talk:Automatically trimming a monodix]]
   
 
[[Category:Tasks for Google Code-in|Intersection of ATT format transducers]]
 
[[Category:Tasks for Google Code-in|Intersection of ATT format transducers]]

Latest revision as of 14:01, 17 March 2020

This has been implemented, see the tool lt-trim.

Objective[edit]

The objective of these tasks is to write code to intersect two finite-state transducers. One transducer is a morphological dictionary, the other transducer is a bilingual dictionary which is converted into prefixes.

The intersection of the morphological dictionary with the prefixes of the bilingual dictionary will give us the set of strings in the morphological dictionary which have translations in the bilingual dictionary.

Example[edit]

Input[edit]

Monolingual transducer Bilingual transducer
0	1	b	b	
1	2	e	e	
1	3	a	a	
1	4	u	u	
2	5	e	e	
2	6	d	d	
3	6	t	t	
4	6	g	g	
5	6	r	r	
6	7	ε	<n>	
6	8	s	<n>	
7	9	ε	<sg>	
8	9	ε	<pl>	
9
0	1	b	o	
0	2	b	s	
1	3	e	h	
2	4	a	a	
3	5	d	e	
4	6	t	g	
5	7	<n>	<n>	
6	8	<n>	u	
8	9	ε	z	
9	10	ε	a	
10	11	ε	r	
11	7	ε	<n>	
7

Output[edit]

Trimmed monolingual transducer
0	1	b	b
1	2	e	e
1	3	a	a
2	4	d	d
3	4	t	t
4	5	s	<n>
4	6	ε	<n>
5	7	ε	<pl>
6	7	ε	<sg>
7

Steps[edit]

Imonodix.png
The bilingual dictionary (or one side of it). Note that the loops at the end show that these are prefixes, so will match any prefix of a lexical unit. Note: this loop should have the symbols of the monodix!
Trimmed.png
  1. Load both transducers
  2. Convert the bilingual dictionary transducer into prefixes by selecting only one side of it and adding a loop.
  3. Perform the intersection:

Tasks[edit]

Implementation in python[edit]

  1. Come up with a data structure for storing the transducers
  2. Write a function to convert a transducer to a prefix transducer (e.g. make the final loopback over any symbol)
  3. Write the intersection algorithm

Implementation in C++[edit]

  1. Look up the Transducer class in lttoolbox
  2. Write a function to convert a transducer to a prefix transducer (e.g. make the final loopback over any symbol)
  3. Write the intersection algorithm as a method.

Front-end program in C++[edit]

  1. Write a command-line program in the style of lt-comp and lt-proc to load two transducers and perform the intersection.

Compound support in python[edit]

Languages like German write compound nouns without spaces, and our analysers allow going "back to the start" if they've seen a full noun but there's still more of the word to read. So if zeugnis and verweigerungs and recht are in the analyser, and marked as compoundable, zeugnisverweigerungsrecht would be possible to analyse. We mark an analysis as compoundable by putting a special tag on it, either <compound-R> for words that can be the final part, or <compound-only-L> for words that can be a non-final part. To give a simpler example, say "aabbcc" is a compound consisting of nouns "aa", "bb" and "cc", the analyser would have this transition table:

0	1	a	a
1	4	a	a
0	2	b	b
2	4	b	b
0	3	c	c
3	6	c	c
4	5	ε	<n>
5	9	ε	<compound-only-L>
6	7	ε	<n>
7	9	ε	<compound-R>
9

Before words reach the bidix, they are split on the compound borders, so the bidix would see aa<n> bb<n> cc<n>, ie. a corresponding bidix that has all the entries of the analyser has this transition table:

0	1	a	a
1	4	a	a
0	2	b	b
2	4	b	b
0	3	c	c
3	4	c	c
4	5	ε	<n>
5

The compound tag can appear anywhere in the analysis, including before all other tags, so intersection shouldn't remove a path until it has seen a non-compound tag.

Compound support in C++[edit]

#-type multiword support in python[edit]

The # is used for "multiwords" (words that consist of several words), when you have inflection in the middle of the word, and an invariant part at the end.

In the analyser, the # appears after the the inflected part has been fully read, before the invariant part, so if we want to have both "coffee with milk" and "coffees with milk" as possible single-unit inputs, the analyser would have these possible outputs:

coffee<n><sg># with milk
coffee<n><pl># with milk

But in the bidix, the invariant part is represented before the tags, ie.

coffee# with milk<n><sg>
coffee# with milk<n><pl>

To intersect the analyser with the bidix, we actually need to intersect it with a transformed bidix that has the #-part moved after the tags.

So the task is to take a transducer, and transform it to one where any string starting with # is moved before any of the tags.

(This would then be input to the intersection.)

#-type multiword support in C++[edit]

Further reading[edit]