Difference between revisions of "Integration and tagset conversion with Giellatekno"

From Apertium
Jump to navigation Jump to search
 
(21 intermediate revisions by 3 users not shown)
Line 8: Line 8:
 
===Giellatekno side===
 
===Giellatekno side===
   
* <code>giella-xxx/tools/mt/apertium</code>: This is where all the relabelled transducers live.
+
* <code>giella-xxx/tools/mt/apertium</code>:<br/> This is where all the relabelled transducers live.
** <code>giella-xxx/tools/mt/apertium/tagsets</code>: These are the different files used for relabelling the transducers.
+
** <code>giella-xxx/tools/mt/apertium/tagsets</code>:<br/> These are the files used for relabelling the transducers.
*** <code>giella-xxx/tools/mt/apertium/tagsets/apertium.postproc.relabel</code>: This file is used for 1:1 tag conversions. For example if you want to change {{tag|cc}} to {{tag|cnjcoo}}.
+
*** <code>giella-xxx/tools/mt/apertium/tagsets/README.txt</code>:<br/> This README-file explains the different files in the folder
*** <code>giella-xxx/tools/mt/apertium/tagsets/modify-tags.regex</code>: This file is used for 1:1, 1:n and n:1 tag conversions. For example if you want to change {{tag|sg3}} to {{tag|p3><sg}}. This is also used if you need to do context-sensitive replacement, for example for specific lemmas <code>,<clb></code> to <code>,<cm></code>.
+
*** <code>giella-xxx/tools/mt/apertium/tagsets/modify-tags.regex</code>:<br/> '''modify-tags.regex''' is used for 1:1, 1:n and n:1 tag conversions. For example if you want to change {{tag|sg3}} to {{tag|p3><sg}}. This is also used if you need to do context-sensitive replacement, for example for specific lemmas <code>,<clb></code> to <code>,<cm></code>.
  +
*** <code>giella-xxx/tools/mt/apertium/tagsets/gt2apertium.cg3relabel</code>:<br/> '''gt2apertium.cg3relabel''' is used for converting the CG file tags to Apertium tags.<br/> The format is "MAP (GTTag) (apertiumtag);". You can also map one tag into multiple (or into any CG set, really) like "MAP (GTTag) (apertiumtag1 apertiumtag2) OR (apertiumtag3)", but you can't map multiple tags into one in a single statement.
  +
  +
Note that 1:1 relabellings that only do lowercase and turn / into _ are handled by a script, so most tags '''don't''' need to go into modify-tags.regex.
  +
But some 1-1 tags are added/changed outside the lexc, e.g. the "VV" tags used on derivation sources, so these need to go in modify-tags.regex.
   
 
===Apertium side===
 
===Apertium side===
   
  +
* The '''Makefile.am''' trimming rule <code>.deps/%.autobil.prefixes</code> needs to mention all the equivalences between regular PoS tags and the ex-derivation PoS tags that appear when a word changes part of speech in derivation, e.g. a verb→noun derivation will get tagged ex_vblex instead of vblex, so the Makefile needs to include <code><nowiki>[ "<vblex>" -> [ "<vblex>" | "<ex_vblex>" ] ]</nowiki></code> in the line that creates <code>.deps/$*.derivpos.hfst</code> – this hopefully shouldn't need changing too often.
* <code>apertium-xxx-yyy/gt2apertium.cg3r</code>: This file is used for converting the CG file tags to Apertium tags. You may have to convert tags in more than one place.
 
   
==Relevant files==
+
==Writing tag relabelling files==
   
;<code>apertium.postproc.relabel</code>
+
===<code>modify-tags.regex</code>===
   
  +
This file is for tag changes that go beyond simple lowercasing and turning / into _.
This file is a simple file that is two column, tab separated and processed by <code>hfst-substitute</code>, if you want to change a single tag or remove a single tag, then this is the file:
 
   
 
<pre>
 
<pre>
  +
[
<cc> <cnjcoo>
 
 
[ %<clb%> -> %<cm%> || %, _ ] .o.
  +
[ %<n%> %<prop%> -> %<np%> ]
 
] ;
 
</pre>
 
</pre>
   
  +
This sample changes the sequence {{tag|n><prop}} to {{tag|np}}, and the sequence <code>,<clb></code> to <code>,<cm></code>. You can also split off tags into words, so for example:
;<code>modify-tags.regex</code>
 
 
This file is more complicated and allows you more flexibility
 
   
 
<pre>
 
<pre>
 
[
 
[
[ %<clb%> -> %<cm%> || %, _ ] .o.
+
[ %<qst%> -> %+ k o %<qst%> ] .o.
  +
] ;
[ %<n%> %<prop%> -> %<np%> ]
 
] ;
 
 
</pre>
 
</pre>
   
This sample changes the sequence {{tag|n><prop}} to {{tag|np}}, and the sequence <code>,<clb></code> to <code>,<cm></code>.
+
Would turn <code><qst></code> into <code>+ko<qst></code>.
   
'''NOTE:''' <code>modify-tags.regex</code> is run '''after''' <code>apertium.postproc.relabel</code> so keep this in mind.
+
'''NOTE:''' <code>modify-tags.regex</code> is run '''after''' the generated lowercasing (and / to _) conversions, (see generated file <code>apertium.relabel</code>), so keep this in mind.
   
;<code>gt2apertium.cg3r</code>
 
   
 
===<code>gt2apertium.cg3r</code>===
''todo''
 
  +
  +
This file relabels the compiled CG. The "input" side should be single tags that look exactly like they are in the cg3 source file, and the "output" side should be one or more apertium tags. SET's are possible on the "output" side, but only single tags on the "input" side.
  +
  +
The format is <code>MAP input output;</code>, e.g. <code>MAP (Sg) (sg);</code>.
  +
  +
A more complicated example: <code>MAP (N) (np) OR (n);</code> turns all N tags in the compiled CG into a SET of (np) or (n) (since the Giellatekno N tag can be either proper or non-proper nouns).
  +
  +
  +
''This file needs to contain all single-tag changes as well; there are no auto-generated relabellings''
   
 
==Testing and troubleshooting==
 
==Testing and troubleshooting==
Line 98: Line 110:
   
 
If the word comes out with the wrong tag here, then again the fault is probably in one of the lexc source files, you can find them in <code>giella-xxx/src/morphology/</code>
 
If the word comes out with the wrong tag here, then again the fault is probably in one of the lexc source files, you can find them in <code>giella-xxx/src/morphology/</code>
  +
  +
==Pro tips==
  +
  +
* '''Don't''' do tag substitution in your constraint grammar. That means no using <code>SUBSTITUTE</code> rules to change tags. If you are missing a reading that you would like to select, add it to your morphological analyser.
  +
* '''Do''' convert to Apertium-style and analysis style wherever humanly feasible.
   
 
[[Category:Development]]
 
[[Category:Development]]
  +
[[Category:Giellatekno]]

Latest revision as of 17:56, 12 March 2016

One language pair setup nowadays is using transducers from Giellatekno and pair-specific data in Apertium. This is a tricky set up because there is a lot of machinery around the tagset conversion.

Let's assume you're using giella-xxx, giella-yyy and apertium-xxx-yyy, what are the relevant files ?

Files[edit]

Giellatekno side[edit]

  • giella-xxx/tools/mt/apertium:
    This is where all the relabelled transducers live.
    • giella-xxx/tools/mt/apertium/tagsets:
      These are the files used for relabelling the transducers.
      • giella-xxx/tools/mt/apertium/tagsets/README.txt:
        This README-file explains the different files in the folder
      • giella-xxx/tools/mt/apertium/tagsets/modify-tags.regex:
        modify-tags.regex is used for 1:1, 1:n and n:1 tag conversions. For example if you want to change <sg3> to <p3><sg>. This is also used if you need to do context-sensitive replacement, for example for specific lemmas ,<clb> to ,<cm>.
      • giella-xxx/tools/mt/apertium/tagsets/gt2apertium.cg3relabel:
        gt2apertium.cg3relabel is used for converting the CG file tags to Apertium tags.
        The format is "MAP (GTTag) (apertiumtag);". You can also map one tag into multiple (or into any CG set, really) like "MAP (GTTag) (apertiumtag1 apertiumtag2) OR (apertiumtag3)", but you can't map multiple tags into one in a single statement.

Note that 1:1 relabellings that only do lowercase and turn / into _ are handled by a script, so most tags don't need to go into modify-tags.regex. But some 1-1 tags are added/changed outside the lexc, e.g. the "VV" tags used on derivation sources, so these need to go in modify-tags.regex.

Apertium side[edit]

  • The Makefile.am trimming rule .deps/%.autobil.prefixes needs to mention all the equivalences between regular PoS tags and the ex-derivation PoS tags that appear when a word changes part of speech in derivation, e.g. a verb→noun derivation will get tagged ex_vblex instead of vblex, so the Makefile needs to include [ "<vblex>" -> [ "<vblex>" | "<ex_vblex>" ] ] in the line that creates .deps/$*.derivpos.hfst – this hopefully shouldn't need changing too often.

Writing tag relabelling files[edit]

modify-tags.regex[edit]

This file is for tag changes that go beyond simple lowercasing and turning / into _.

[
  [ %<clb%> -> %<cm%> || %, _ ] .o.
  [ %<n%> %<prop%> -> %<np%> ]
] ;

This sample changes the sequence <n><prop> to <np>, and the sequence ,<clb> to ,<cm>. You can also split off tags into words, so for example:

[
  [ %<qst%> -> %+ k o %<qst%> ] .o.
] ; 

Would turn <qst> into +ko<qst>.

NOTE: modify-tags.regex is run after the generated lowercasing (and / to _) conversions, (see generated file apertium.relabel), so keep this in mind.


gt2apertium.cg3r[edit]

This file relabels the compiled CG. The "input" side should be single tags that look exactly like they are in the cg3 source file, and the "output" side should be one or more apertium tags. SET's are possible on the "output" side, but only single tags on the "input" side.

The format is MAP input output;, e.g. MAP (Sg) (sg);.

A more complicated example: MAP (N) (np) OR (n); turns all N tags in the compiled CG into a SET of (np) or (n) (since the Giellatekno N tag can be either proper or non-proper nouns).


This file needs to contain all single-tag changes as well; there are no auto-generated relabellings

Testing and troubleshooting[edit]

A lot of the time it takes a lot of time and patience to get the tags as they should be. Here are some tips for checking which file you need to look in.

Apertium side[edit]

Check the trimmed analyser[edit]

$ echo , | hfst-lookup xxx-yyy.automorf.hfst 
,	,<cm>	0,000000

If this comes out as an unknown word, then the fault is probably in your bilingual dictionary apertium-xxx-yyy.xxx-yyy.dix. You can grep for the analysis using something like:

$ lt-expand apertium-xxx-yyy.xxx-yyy.dix | grep '<cm>'
,<cm>:,<cm>

If the word comes out but has the wrong tag, then the fault is probably in one of the Giellatekno relabel scripts, see the files as described above in giella-xxx/tools/mt/apertium/tagsets.

Check the untrimmed analyser[edit]

$ echo , | hfst-lookup .deps/xxx.automorf.hfst
,	,<cm>	0,000000

If the word comes out with the wrong tag here, then again the fault is probably in one of the Giellatekno relabel scripts, see the files as described above in giella-xxx/tools/mt/apertium/tagsets.

Giellatekno side[edit]

Check the relabelled analyser[edit]

$ echo , | hfst-lookup tools/mt/apertium/analyser-mt-apertium-desc.yyy.hfstol 
,	,<cm>	0,000000

If the word comes out with the wrong tag here, then again the fault is probably in one of the Giellatekno relabel scripts, see the files as described above in giella-xxx/tools/mt/apertium/tagsets.

If you have fixed the relabel scripts and it still isn't working, then it could be because you are not calling the relabel scripts in tools/mt/apertium/tagsets/Makefile.am, or it could be because you have some multicharacter symbol which is not declared in src/morphology/root.lexc.

Check the unrelabelled analyser[edit]

$ echo , | hfst-lookup  src/analyser-gt-desc.hfstol 
,	,+CLB	0,000000

If the word comes out with the wrong tag here, then again the fault is probably in one of the lexc source files, you can find them in giella-xxx/src/morphology/

Pro tips[edit]

  • Don't do tag substitution in your constraint grammar. That means no using SUBSTITUTE rules to change tags. If you are missing a reading that you would like to select, add it to your morphological analyser.
  • Do convert to Apertium-style and analysis style wherever humanly feasible.