This little script aims to facilitate adding new words to dictionaries of a language pair which uses HFST.
It is intended to be used the following way.
Suppose that morphological transducer for Tatar in Tatar-Bashkir pair doesn't recognize the word 'укытучы', which means "teacher". We have to add this word to:
- translate it in
- add the translation to
apertium-tt-ba.ba.lexc. For such close related languages like Tatar and Bashkir most likely all the categories and the names of continuation classes will remain the same.
In a directory containing
lexikograf.sh there should be a file called
add-them.txt, to which you add words exactly in the same form as you would add them to lexc file of a given language, in our case
укытучы NLEX ;
After it, if you run lexikograf.sh it will generate three text files:
to-bashkir-lexc.txt. In our case, each file will contain only one line:
!укытучы NLEX ; ! ""
<e><p><l>укытучы<s n="n"/></l><r><s n="n"/></r></p></e>
! NLEX ; ! "укытучы"
After it, one obviously has to do the following:
- add Bashkir translations in
to-bashkir-lexc.txt(and additional tags if they aren't matching entirely, e.g. different gender, difference in transitivity of verbs etc. Hard to imagine in Tatar-Bashkir pair actually):
<e><p><l>укытучы<s n="n"/></l><r>уҡытыусы<s n="n"/></r></p></e>
- add this translation to
to-bashkir-lexc.txt. Again, you can write the word only once, the next script,
duplicate.shwill take care of it and make from entries
!укытучы NLEX ; ! ""and
!уҡытыусы NLEX ; ! "укытучы"entries
!укытучы:укытучы NLEX ; ! ""and
!уҡытыусы:уҡытыусы NLEX ; ! "укытучы"respectively. This lines remain commented out so that you cannot add them to the actual .lexc files without proofreading them.
Here is the script itself. If you like the idea of it and think that it indeed could be useful, feel free to modify it. Define some variables, so it can be used with any language pair etc. Be bold :)
#!/bin/bash # This little tools aims to facilitate adding new stems to dictionaries of a language pair using HFST # by reducing amount of typing work # see http://wiki.apertium.org/wiki/Lexikograf.sh for more details # for every line of add-them.txt #do ## match the first word (=initial letters till the first space) of the line and store it in $one ### the first space can be escaped by a %. In such case skip it and match all letters till the second space ## match rest of the line to $rest # print '$LINE ! ""' to to-tatar-lexc.txt # print '<e><p><l>$one<s n="n"/></l><r><s n="n"/></r></p></e>" to to-bidix.txt # print '$rest ! "$one"' to to-bashkir.lexc # duplicate words in lexc entries and comment them out
- We are adding a word which is not recognized by the morphological analyzer of Tatar, and therefore not present in tt.lexc. But this does not necessarily mean that this word and it's translation are also missing in tt-ba.dix and ba.lexc. So it will be a good idea to check it first and only then generate entries
укытучы:укытучы Ninfl ;is a better way of writing this and statement above that you should add stems to add-them.txt exactly in the same form as you would add them to the actual .lexc file is not absolutely true. But such writing obviously saves time. Don't worry, lexikograf.sh will generate entries in the preferred format
- The tags used in tt-ba.dix should be matched against Root Lexicon names, so that script knows, which symbol it has to put in bidix inside of <s n="?">. A simpler solution would be to leave this tags empty, but this is definitively not good