Difference between revisions of "Lexikograf.sh"

From Apertium
Jump to navigation Jump to search
(Created page with '== Intention == This little script aims to facilitate adding new words to dictionaries of a language pair which uses HFST. == Usage == It is intended to be used the following w…')
 
 
Line 6: Line 6:
It is intended to be used the following way.
It is intended to be used the following way.


Suppose that morphological transducer for Tatar in Tatar-Bashkir pair doesn't recognise the word 'укытучы', which means "teacher". We have to add this word to:
Suppose that morphological transducer for Tatar in Tatar-Bashkir pair doesn't recognize the word 'укытучы', which means "teacher". We have to add this word to:


# <code>apertium-tt-ba.tt.lexc</code>,
# <code>apertium-tt-ba.tt.lexc</code>,
# translate it in <code>apertium-tt-ba.tt-ba.dix</code><ref>We are adding a word which is not recognized by the morphological analyzer of Tatar, and therefore not present in tt.lexc. But this does not necessarily mean that this word and it's translation are also missing in tt-ba.dix and ba.lexc. So it will be a good idea to check it first and only then generate entries</ref> and
# translate it in <code>apertium-tt-ba.tt-ba.dix</code> and
# add the translation to <code>apertium-tt-ba.ba.lexc</code>. For such close related languages like Tatar and Bashkir most likely all the categories and the names of continuation classes will remain the same.
# add the translation to <code>apertium-tt-ba.ba.lexc</code>. For such close related languages like Tatar and Bashkir most likely all the categories and the names of continuation classes will remain the same.


Line 26: Line 26:
<code>
<code>
<nowiki><e><p><l>укытучы<s n="n"/></l><r><s n="n"/></r></p></e></nowiki>
<nowiki><e><p><l>укытучы<s n="n"/></l><r><s n="n"/></r></p></e></nowiki>
</code><ref>The tags used in tt-ba.dix should be matched against Root Lexicon names, so that script knows, which symbol it has to put in bidix inside of <nowiki><s n="?"></nowiki>. A simpler solution would be to leave this tags empty, but this is definitively not good</ref>
</code>


<code>
<code>
Line 39: Line 39:


== Code ==
== Code ==
Here is the script itself. If you like the idea of it and think that it can indeed be usefull, feel free to modify it, as it was written 1. by an absolute newbie and 2. won't work with any other language pair without modifying it. Define some variables etc. Be bold :)
Here is the script itself. If you like the idea of it and think that it indeed could be useful, feel free to modify it. Define some variables, so it can be used with any language pair etc. Be bold :)


<pre>
<pre>
#!/bin/bash
#!/bin/bash


# This little tools aims to faciltate adding new stems to dictionaries of a language pair using HFST
# This little tools aims to facilitate adding new stems to dictionaries of a language pair using HFST
# by reducing amount of typing work
# by reducing amount of typing work
# see http://wiki.apertium.org/wiki/Lexikograf.sh for more details
# see http://wiki.apertium.org/wiki/Lexikograf.sh for more details

Latest revision as of 00:10, 18 March 2012

Intention[edit]

This little script aims to facilitate adding new words to dictionaries of a language pair which uses HFST.

Usage[edit]

It is intended to be used the following way.

Suppose that morphological transducer for Tatar in Tatar-Bashkir pair doesn't recognize the word 'укытучы', which means "teacher". We have to add this word to:

  1. apertium-tt-ba.tt.lexc,
  2. translate it in apertium-tt-ba.tt-ba.dix[1] and
  3. add the translation to apertium-tt-ba.ba.lexc. For such close related languages like Tatar and Bashkir most likely all the categories and the names of continuation classes will remain the same.

In a directory containing lexikograf.sh there should be a file called add-them.txt, to which you add words exactly in the same form as you would add them to lexc file of a given language, in our case

укытучы NLEX ; [2]

After it, if you run lexikograf.sh it will generate three text files: to-tatar-lexc.txt, to-bidix.txt and to-bashkir-lexc.txt. In our case, each file will contain only one line:

!укытучы NLEX ; ! ""

<e><p><l>укытучы<s n="n"/></l><r><s n="n"/></r></p></e> [3]

! NLEX ; ! "укытучы" respectively.

After it, one obviously has to do the following:

  • add Bashkir translations in to-bashkir-lexc.txt (and additional tags if they aren't matching entirely, e.g. different gender, difference in transitivity of verbs etc. Hard to imagine in Tatar-Bashkir pair actually):
    • <e><p><l>укытучы<s n="n"/></l><r>уҡытыусы<s n="n"/></r></p></e>
  • add this translation to to-bashkir-lexc.txt. Again, you can write the word only once, the next script, duplicate.sh will take care of it and make from entries !укытучы NLEX ; ! "" and !уҡытыусы NLEX ; ! "укытучы" entries !укытучы:укытучы NLEX ; ! "" and !уҡытыусы:уҡытыусы NLEX ; ! "укытучы" respectively. This lines remain commented out so that you cannot add them to the actual .lexc files without proofreading them.

Code[edit]

Here is the script itself. If you like the idea of it and think that it indeed could be useful, feel free to modify it. Define some variables, so it can be used with any language pair etc. Be bold :)

#!/bin/bash

# This little tools aims to facilitate adding new stems to dictionaries of a language pair using HFST
# by reducing amount of typing work
# see http://wiki.apertium.org/wiki/Lexikograf.sh for more details

# for every line of add-them.txt
#do

## match the first word (=initial letters till the first space) of the line and store it in $one
### the first space can be escaped by a %. In such case skip it and match all letters till the second space

## match rest of the line to $rest

# print '$LINE ! ""' to to-tatar-lexc.txt

# print '<e><p><l>$one<s n="n"/></l><r><s n="n"/></r></p></e>" to to-bidix.txt

# print '$rest ! "$one"' to to-bashkir.lexc

# duplicate words in lexc entries and comment them out    

Notes[edit]

  1. We are adding a word which is not recognized by the morphological analyzer of Tatar, and therefore not present in tt.lexc. But this does not necessarily mean that this word and it's translation are also missing in tt-ba.dix and ba.lexc. So it will be a good idea to check it first and only then generate entries
  2. Actually укытучы:укытучы Ninfl ; is a better way of writing this and statement above that you should add stems to add-them.txt exactly in the same form as you would add them to the actual .lexc file is not absolutely true. But such writing obviously saves time. Don't worry, lexikograf.sh will generate entries in the preferred format
  3. The tags used in tt-ba.dix should be matched against Root Lexicon names, so that script knows, which symbol it has to put in bidix inside of <s n="?">. A simpler solution would be to leave this tags empty, but this is definitively not good