Difference between revisions of "User:Krvoje/Application2012"

From Apertium
Jump to navigation Jump to search
 
(47 intermediate revisions by the same user not shown)
Line 5: Line 5:
   
 
krvoje on IRC: #apertium
 
krvoje on IRC: #apertium
  +
 
 
=Why is it you are interested in machine translation?=
 
=Why is it you are interested in machine translation?=
   
Line 26: Line 26:
 
==What do you plan to do?==
 
==What do you plan to do?==
   
I will design an XML formalism for writing disambiguation rules, a validator for it, a compiler for representing the rules as a finite-state transducer that integrates with the lttoolbox API, and a processor which applies the rules to an Apertium input stream.
+
I will design an XML formalism for writing disambiguation rules, a validator for it, a compiler for representing the rules as a finite-state transducer, and a processor which applies the rules to an Apertium input stream. The syntax will mostly resemble Apertium's transfer or LanguageTools' disambiguator module, since it is the most natural syntax for the approach I will be taking.
   
The compiler and the processor will be written in C++, based on the designs of Apertium's transfer module, and the lexical selection module, and will use the lttoolbox API.
+
The compiler and the processor will be written in C++, using lttoolbox API, based on the designs of Apertium's transfer module, and the lexical selection module. My initial intention was to use the lttoolbox API for the design and compilation of example rules as well. While it provides a great formalism for morphological transducers, it proved to be less flexible and expressive than I needed for writing the examples. Since expressing disambiguation rules requires more complicated regular expressions than are currently supported in lttoolbox, the first part of my work will be to write a module for compiling a larger subset of finite-state calculus to lttoolbox transducers. The formalism will be compatible with foma/xfst.
   
  +
Another important issue is performance and complexity of the resulting transducers, for which I will do a literature review before the beginning of the coding period, and design the implementation so that it does not blow out of proportion. Regarding minimization, both foma and lttoolbox use Brzozowski's minimization algorithm (foma also uses Hopcrofts'), so those will be my starting points.
The disambiguation module will be documented with use-case examples in various languages, based on my current experience with developing Constraint Grammar for Croatian.
 
  +
  +
Although I am already quite familiar with the modules of lttoolbox API for compiling and operating on transducers, and these are the parts that will be the focus of my work, the API is rather sparsely documented, and there is a factor of impredictability in how much time certain tasks will consume. For this reason I have assigned a larger time frame to the part of work regarding the API.
  +
  +
The disambiguation module will be documented with use-case examples of varying complexity in various languages, based on my current experience with disambiguating Croatian texts with Constraint Grammar.
  +
  +
===Examples of rule formalism===
  +
  +
The following are few examples of the proposed XML formalism, together with a rudimentary transducer implementation in the finite-state calculus of foma<ref>http://code.google.com/p/foma/</ref>. The transducers operate on simple patterns, and although they perform validation of the Apertium's stream format, no account was given to the robustness rule of Constraint Grammar (keeping the last remaining reading).
  +
The source code of the script can be seen [http://wiki.apertium.org/wiki/User:Krvoje/Foma_script_for_testing_finite-state_disambiguation here], or checked out from my [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2012/krvoje/fst/ folder] on SVN.
  +
  +
==== A removal rule ====
  +
<pre>
  +
<rule n="preposition_not_accusative">
  +
<lu>
  +
<tag n="pr"/>
  +
<tag n="acc"/>
  +
<remove>
  +
<tag n="acc"/> <!--Delete all readings containing <acc> from this LU-->
  +
</remove>
  +
</lu>
  +
<lu>
  +
<tag n="n"/>
  +
<tag n="loc"/>
  +
</lu>
  +
</rule>
  +
</pre>
  +
  +
This rule matches the pattern ["LU containing the tag <pr> and the tag <acc>"] ["LU containing the tags <n> and <loc>"] and removes all readings containing the tag <acc>.
  +
  +
Code snippet from the foma script:
  +
<pre>
  +
define RuleRemoveAccusative [ [Preposition & Accusative & Locative] .o. MarkSuspicious(Contains({<acc>}))]
  +
[ Noun & Locative ] ;
  +
</pre>
  +
  +
The transducer matches two LU's, with tags as described in the XML formalism, and the first match is given by composition to the MarkSuspicious transducer, which marks readings containing <acc> as pending removal. ( Preposition, Accusative etc. are transducers to match an LU containing the respective tags, they are defined earlier for better legibility).
  +
  +
The entire transducer applying this rule to the stream is made as the following composition.
  +
  +
<pre>
  +
ValidInput .o. PrepareInput .o. RuleRemoveAccusative .o. Remove .o. CleanMarks ;
  +
</pre>
  +
  +
Where ValidInput validates the input stream according to the Apertium stream format, PrepareInput assigns temporary tags, and the Remove and CleanMarks transducers perform the removal of readings marked for removal, and the cleanup of the temporary tags.
  +
  +
==== A select rule example ====
  +
<pre>
  +
<rule n="noun_is_locative">
  +
<lu>
  +
<tag n="pr"/>
  +
<tag n="loc"/>
  +
</lu>
  +
<lu>
  +
<tag n="n"/>
  +
<tag n="loc"/>
  +
<tag n="dat"/>
  +
<select>
  +
<tag n="loc"/> <!--Delete all readings except the ones containing <loc> from this LU-->
  +
</selec>
  +
</lu>
  +
</rule>
  +
</pre>
  +
  +
The rule matches the pattern ["LU containing <pr> and <loc>"] ["LU containing <n>, <loc> and <dat>"], and selects all readings with <loc> in the second LU.
  +
  +
The foma snippet for this rule is:
  +
  +
<pre>
  +
define RuleSelectLocative [ Preposition & Locative ]
  +
[ [Noun & Locative & Dative] .o. MarkSelection(Contains({<loc>})) ] ;
  +
</pre>
  +
  +
Here the second match is given to the transducer MarkSelection, which marks all the readings containing <loc> for keeping, and all other readings for removal.
  +
  +
The entire rule is similarly made as the composition:
  +
  +
<pre>
  +
ValidInput .o. PrepareInput .o. RuleSelectLocative .o. Remove .o. CleanMarks ;
  +
</pre>
  +
  +
==== Combination of both rules ====
  +
  +
These two rules can easily be combined in one that is more compact:
  +
  +
<pre>
  +
<rule n="combined_rule">
  +
<lu>
  +
<tag n="pr"/>
  +
<tag n="loc"/>
  +
<tag n="acc"/>
  +
<remove>
  +
<tag n="acc"/> <!-- Delete all readings containing <acc> from this LU -->
  +
</remove>
  +
</lu>
  +
<lu>
  +
<tag n="n"/>
  +
<tag n="loc"/>
  +
<tag n="dat"/>
  +
<select>
  +
<tag n="loc"/> <!-- Delete all readings except ones containing <loc> from this LU -->
  +
</selec>
  +
</lu>
  +
</rule>
  +
</pre>
  +
  +
This rule performs a similar match, and the transducers for <select> and <remove> are composed to both of the cohorts, yielding a more compact representation than the former two rule variants:
  +
  +
<pre>
  +
define RuleCombine [ [Preposition & Locative & Accusative] .o. MarkSuspicious(Contains({<acc>}))]
  +
[ [Noun & Locative & Dative] .o. MarkSelection(Contains({<loc>})) ] ;
  +
</pre>
  +
  +
The entire rule is again a composition with auxilliary transducers.
  +
  +
<pre>
  +
ValidInput .o. PrepareInput .o. RuleCombine .o. Remove .o. CleanMarks ;
  +
</pre>
  +
  +
=== Comments on the rule formalism ===
  +
  +
The rule formalism will be taylored according to my current practical experience with disambiguation (my master thesis is a CG for Croatian), thus containing functionality e.g. for defining structures analogous to categories in apertium-transfer, allowing easier expressing of morphological agreement, as well as elements for long distance matching, logical and set operations etc.
  +
  +
My main role models are LanguageTool<ref>http://www.languagetool.org/</ref>, and Apertium's lexical selection tools<ref>http://wiki.apertium.org/wiki/Constraint-based_lexical_selection_module</ref>, as well as Apertium's transfer module that use a similar pattern/action paradigm, so the syntax will more closely resemble theirs. The aim will not be to mimic Constraint Grammar, but to make the formalism as expressive as possible in given frames.
   
 
=Work already done=
 
=Work already done=
 
==Community bonding period==
 
==Community bonding period==
  +
- written a program for the coding challenge
 
  +
- written the program for the coding challenge
   
 
- started familiarising myself with lttoolbox, written a small program that composes strings and regexes into an FST
 
- started familiarising myself with lttoolbox, written a small program that composes strings and regexes into an FST
  +
  +
- done an implementation of example disambiguation rules in the calculus of foma
   
 
=Work To do=
 
=Work To do=
 
==Before the coding period:==
 
==Before the coding period:==
   
- explore the API
+
- explore the lttoolbox API in more detail
   
- write a simple prototype, that implements a simple hardcoded rule (e.g. preposition-based case disambiguation for Serbo-Croatian)
+
- <s>write a simple prototype, that implements a simple hardcoded rule (e.g. preposition-based case disambiguation for Serbo-Croatian)</s>
  +
  +
- do a review of the literature on minimization algorithms for FST's
   
 
==The coding period:==
 
==The coding period:==
- Week 1: A thorough design of the XML formalism along with a validator. The basis for functionality will be CG-2, and the syntax will be based on apertium-lex-tools.
 
   
  +
- Week 1-2: Writing prototypes of rules in foma, test different approaches and minimization algorithms to see which give the best performance
The XML formalism will in effect contain a subset of Constraint Grammar rules, as much as it is possible to express with finite-state transducers.
 
I am currently writing my master thesis on disambiguation with Constraint Grammar and I am quite familiar with principles of morphological disambiguation, so I will base the design of this formalism on my experience with Constraint Grammar.
 
As well as this, I will look into other finite-state NLP systems like LangugageTool, IceNLP and Apertium' lexical selection tools and use them as role models.
 
   
  +
- Week 3-4: Writing and testing additions to lttoolbox for compiling FST's from finite-state calculus, based on the results from the testing
- '''Deliverable #1''' : A complete XML formalism for expressing finite-state disambiguation rules, along with preliminary documentation.
 
   
  +
- '''Deliverable #1''': The module for lttoolbox for parsing a finite-state calculus (compatible with foma and xfst) and compiling FST's.
- Week 2-8: Writing a compiler and a stream processor and integrate it with lttoolbox.
 
   
  +
- Week 5: Writing the description of the XML formalism and writing a validator for it, expressing the rules in finite-state calculus.
The syntax of the disambiguation will be based on the syntax of the lexical selection module. The funcionality will be expanded by adding a negation element (to implement funcionality analogous to barriers and negation in CG), and adding elements to match one or more LU's (analogous to scanning in CG, or the +,? and * quantifiers in regular expressions syntax), as well as a "remove" element. The sets and lists used in Constraint Grammar will be defined analogously to categories in Apertium's transfer by using a combination of regular expressions and category items.
 
   
- '''Deliverable #2''' : A compiler and processor for the complete formalism
+
- '''Deliverable #2''': A complete XML formalism for expressing finite-state disambiguation rules, with validation and formulation of rules in finite-state calculus.
   
- Week 9-12: Testing and polishing the system, writing the documentation, along with use-case examples on various languages.
+
- Week 6-7: Writing the compiler for the XML format with the new modules of lttoolbox API, and a stream processor for the compiled rules
   
  +
- Week 8-9: Writing rule examples and testing the compiler and the stream processor with them
For testing of the rules I will use examples of rules of varying complexity based on my current experience with Constraint Grammar.
 
The documentation will contain use-case examples in form of tutorials, and will be based on my current work with Constraint Grammar for Croatian. The examples will be based mostly on the rules I am currently developing for Croatian, as well as on Constraint Grammar rules available for other languages.
 
   
  +
- Week 10: Writing use-case examples, based on the final format of the XML formalism
- '''Deliverable #3''' : The complete disambiguation system, with a compiler, a processor, and the documentation.
 
  +
  +
- Week 11: Regression testing
  +
  +
- Week 12: Final cleanup of the code, and the documentation
  +
  +
- '''Deliverable #3''': The complete disambiguation system, with a compiler, a processor, and the documentation containing use-case examples.
   
 
==Non-GSoC activities==
 
==Non-GSoC activities==
  +
Lectures for 12 hours a week, no other fixed commitments at present.
TODO:
 
   
 
=Bio=
 
=Bio=
Line 78: Line 208:
 
I have worked on the language pair apertium-sh-mk for the GSoC of 2011., and have been a mentor for Google Code-In 2011 for several tasks involving that and similar language pairs.
 
I have worked on the language pair apertium-sh-mk for the GSoC of 2011., and have been a mentor for Google Code-In 2011 for several tasks involving that and similar language pairs.
   
Regarding the technologies used in machine translation we I've been enrolled in courses with finite state machines, and context free grammars (implementation of a parser using yacc+flex), and machine learning.
+
Regarding the technologies used in machine translation I've been enrolled in courses with finite state machines, and context free grammars (implementation of a parser using yacc+flex), and machine learning.
   
  +
=References=
[[Category:GSoC 2012 Student Proposals]]
 
  +
<references/>
 
==References==
 
   
 
* Janne Peltonen: [http://dspace.utlib.ee/dspace/bitstream/handle/10062/19300/peltonen.pdf? A Finite State Constraint Grammar Parser]
 
* Janne Peltonen: [http://dspace.utlib.ee/dspace/bitstream/handle/10062/19300/peltonen.pdf? A Finite State Constraint Grammar Parser]
 
* Mans Hulden: [http://aclweb.org/anthology-new/W/W11/W11-4406.pdf Constraint Grammar parsing with left and right sequential finite transducers]
 
* Mans Hulden: [http://aclweb.org/anthology-new/W/W11/W11-4406.pdf Constraint Grammar parsing with left and right sequential finite transducers]
* Rojas, Forcada, Sánchez [http://www.sepln.org/revistaSEPLN/revista/35/07.pdf Construcción i minimización eficiente de transductores de letras a partir de diccionarios con paradigmas]
+
* Rojas, Forcada, Sánchez: [http://www.sepln.org/revistaSEPLN/revista/35/07.pdf Construcción i minimización eficiente de transductores de letras a partir de diccionarios con paradigmas]
  +
* Karttunen: [http://www.xrce.xerox.com/content/download/20522/147557/mltt-95-03.pdf The Replace Operator]
  +
* Kempe, Karttunen [http://acl.ldc.upenn.edu/C/C96/C96-2105.pdf Parallel Replacement in Finite State Calculus]
   
[[Category:Serbo-Croatian and Macedonian|*]]
+
[[Category:GSoC 2012 Student Proposals]]

Latest revision as of 19:10, 5 April 2012

GSoC application: Rule-based finite-state disambiguation[edit]

Hrvoje Peradin

hperadin@gmail.com,

krvoje on IRC: #apertium

Why is it you are interested in machine translation?[edit]

It's a perfect combination of Computer Science and Linguistics. I am very fascinated with languages, both natural or artificial. With MT it fascinates me to see a natural message transfered across a language barrier via only a process of computation. While the results are rarely perfect, it takes down communication barriers, and opens up new opportunities for learning and communication.

Why is it that you are interested in the Apertium project?[edit]

I have worked on a language pair in last year's GSoC, and it gave me great insights on rule based NLP. It gave me an invaluable chance to do real-life work on an immensely interesting topic, and to create an open-source resource. It was a great experience that taught me a lot about software development and NLP - and it also gave me the theme for my master thesis. So, I could say Apertium has a special place in my heart, and I would love to continue working on it.

Which of the published tasks are you interested in?[edit]

Writing the module for rule-based finite-state disambiguation.

Why should Google and Apertium sponsor it?[edit]

The module is intended to supplement the current bigram tagger, and Constraint Grammar, by implementing constraint based-disambiguation in a finite-state manner. Since the most common disambiguation rules can be expressed in a finite-state way, this will greatly improve speed of disambiguation, and will be beneficial for working with large texts.

How and whom it will benefit in society?[edit]

It will provide a fast tool for rule-based disambiguation, which will enable faster processing of larger corpora, and potentialy help improve translation quality in any language pair in Apertium.

What do you plan to do?[edit]

I will design an XML formalism for writing disambiguation rules, a validator for it, a compiler for representing the rules as a finite-state transducer, and a processor which applies the rules to an Apertium input stream. The syntax will mostly resemble Apertium's transfer or LanguageTools' disambiguator module, since it is the most natural syntax for the approach I will be taking.

The compiler and the processor will be written in C++, using lttoolbox API, based on the designs of Apertium's transfer module, and the lexical selection module. My initial intention was to use the lttoolbox API for the design and compilation of example rules as well. While it provides a great formalism for morphological transducers, it proved to be less flexible and expressive than I needed for writing the examples. Since expressing disambiguation rules requires more complicated regular expressions than are currently supported in lttoolbox, the first part of my work will be to write a module for compiling a larger subset of finite-state calculus to lttoolbox transducers. The formalism will be compatible with foma/xfst.

Another important issue is performance and complexity of the resulting transducers, for which I will do a literature review before the beginning of the coding period, and design the implementation so that it does not blow out of proportion. Regarding minimization, both foma and lttoolbox use Brzozowski's minimization algorithm (foma also uses Hopcrofts'), so those will be my starting points.

Although I am already quite familiar with the modules of lttoolbox API for compiling and operating on transducers, and these are the parts that will be the focus of my work, the API is rather sparsely documented, and there is a factor of impredictability in how much time certain tasks will consume. For this reason I have assigned a larger time frame to the part of work regarding the API.

The disambiguation module will be documented with use-case examples of varying complexity in various languages, based on my current experience with disambiguating Croatian texts with Constraint Grammar.

Examples of rule formalism[edit]

The following are few examples of the proposed XML formalism, together with a rudimentary transducer implementation in the finite-state calculus of foma[1]. The transducers operate on simple patterns, and although they perform validation of the Apertium's stream format, no account was given to the robustness rule of Constraint Grammar (keeping the last remaining reading). The source code of the script can be seen here, or checked out from my folder on SVN.

A removal rule[edit]

<rule n="preposition_not_accusative">
	<lu>
		<tag n="pr"/>
		<tag n="acc"/>
	     	<remove> 
			<tag n="acc"/> <!--Delete all readings containing <acc> from this LU-->
		</remove>
	</lu>
	<lu>
		<tag n="n"/>
		<tag n="loc"/>
	</lu>
</rule>

This rule matches the pattern ["LU containing the tag <pr> and the tag <acc>"] ["LU containing the tags <n> and <loc>"] and removes all readings containing the tag <acc>.

Code snippet from the foma script:

define RuleRemoveAccusative  [ [Preposition & Accusative & Locative] .o. MarkSuspicious(Contains({<acc>}))] 
       			     [ Noun & Locative ] ;

The transducer matches two LU's, with tags as described in the XML formalism, and the first match is given by composition to the MarkSuspicious transducer, which marks readings containing <acc> as pending removal. ( Preposition, Accusative etc. are transducers to match an LU containing the respective tags, they are defined earlier for better legibility).

The entire transducer applying this rule to the stream is made as the following composition.

ValidInput .o. PrepareInput .o. RuleRemoveAccusative .o. Remove .o. CleanMarks ;

Where ValidInput validates the input stream according to the Apertium stream format, PrepareInput assigns temporary tags, and the Remove and CleanMarks transducers perform the removal of readings marked for removal, and the cleanup of the temporary tags.

A select rule example[edit]

<rule n="noun_is_locative">
	<lu>
		<tag n="pr"/>
		<tag n="loc"/>
	</lu>
	<lu>
		<tag n="n"/>
		<tag n="loc"/>
		<tag n="dat"/>
		<select>
			<tag n="loc"/> <!--Delete all readings except the ones containing <loc> from this LU-->
		</selec>
	</lu>
</rule>

The rule matches the pattern ["LU containing <pr> and <loc>"] ["LU containing <n>, <loc> and <dat>"], and selects all readings with <loc> in the second LU.

The foma snippet for this rule is:

define RuleSelectLocative [ Preposition & Locative ] 
       			  [ [Noun & Locative & Dative] .o. MarkSelection(Contains({<loc>})) ] ;

Here the second match is given to the transducer MarkSelection, which marks all the readings containing <loc> for keeping, and all other readings for removal.

The entire rule is similarly made as the composition:

 ValidInput .o. PrepareInput .o. RuleSelectLocative .o. Remove .o. CleanMarks ;

Combination of both rules[edit]

These two rules can easily be combined in one that is more compact:

<rule n="combined_rule">
	<lu>
		<tag n="pr"/>
		<tag n="loc"/>
		<tag n="acc"/>
		<remove>
			<tag n="acc"/> <!-- Delete all readings containing <acc> from this LU -->
		</remove>
	</lu>
	<lu>
		<tag n="n"/>
		<tag n="loc"/>
		<tag n="dat"/>
		<select>
			<tag n="loc"/> <!-- Delete all readings except ones containing <loc> from this LU -->
		</selec>
	</lu>
</rule>

This rule performs a similar match, and the transducers for <select> and <remove> are composed to both of the cohorts, yielding a more compact representation than the former two rule variants:

define RuleCombine  [ [Preposition & Locative & Accusative] .o. MarkSuspicious(Contains({<acc>}))]
		    [ [Noun & Locative & Dative] .o.  MarkSelection(Contains({<loc>})) ] ;

The entire rule is again a composition with auxilliary transducers.

 ValidInput .o. PrepareInput .o. RuleCombine .o. Remove .o. CleanMarks ;

Comments on the rule formalism[edit]

The rule formalism will be taylored according to my current practical experience with disambiguation (my master thesis is a CG for Croatian), thus containing functionality e.g. for defining structures analogous to categories in apertium-transfer, allowing easier expressing of morphological agreement, as well as elements for long distance matching, logical and set operations etc.

My main role models are LanguageTool[2], and Apertium's lexical selection tools[3], as well as Apertium's transfer module that use a similar pattern/action paradigm, so the syntax will more closely resemble theirs. The aim will not be to mimic Constraint Grammar, but to make the formalism as expressive as possible in given frames.

Work already done[edit]

Community bonding period[edit]

- written the program for the coding challenge

- started familiarising myself with lttoolbox, written a small program that composes strings and regexes into an FST

- done an implementation of example disambiguation rules in the calculus of foma

Work To do[edit]

Before the coding period:[edit]

- explore the lttoolbox API in more detail

- write a simple prototype, that implements a simple hardcoded rule (e.g. preposition-based case disambiguation for Serbo-Croatian)

- do a review of the literature on minimization algorithms for FST's

The coding period:[edit]

- Week 1-2: Writing prototypes of rules in foma, test different approaches and minimization algorithms to see which give the best performance

- Week 3-4: Writing and testing additions to lttoolbox for compiling FST's from finite-state calculus, based on the results from the testing

- Deliverable #1: The module for lttoolbox for parsing a finite-state calculus (compatible with foma and xfst) and compiling FST's.

- Week 5: Writing the description of the XML formalism and writing a validator for it, expressing the rules in finite-state calculus.

- Deliverable #2: A complete XML formalism for expressing finite-state disambiguation rules, with validation and formulation of rules in finite-state calculus.

- Week 6-7: Writing the compiler for the XML format with the new modules of lttoolbox API, and a stream processor for the compiled rules

- Week 8-9: Writing rule examples and testing the compiler and the stream processor with them

- Week 10: Writing use-case examples, based on the final format of the XML formalism

- Week 11: Regression testing

- Week 12: Final cleanup of the code, and the documentation

- Deliverable #3: The complete disambiguation system, with a compiler, a processor, and the documentation containing use-case examples.

Non-GSoC activities[edit]

Lectures for 12 hours a week, no other fixed commitments at present.

Bio[edit]

I am an Graduate student of Computer Science and Mathematics at the Faculty of Science, University of Zagreb.

During my courses I have worked with C/C++, Python, C\#, Java, JavaScript, PHP + CSS + HTML, XML, SQL, Coq... Besides Coq, I also have a basic knowledge of functional programming through Haskell and the GF formalism. Currently I am writing my master thesis on disambiguation for the Croatian language with Constraint Grammar.

I have worked on the language pair apertium-sh-mk for the GSoC of 2011., and have been a mentor for Google Code-In 2011 for several tasks involving that and similar language pairs.

Regarding the technologies used in machine translation I've been enrolled in courses with finite state machines, and context free grammars (implementation of a parser using yacc+flex), and machine learning.

References[edit]