Lexikograf.sh

From Apertium
Revision as of 22:04, 17 March 2012 by Ilnar.salimzyan (talk | contribs) (Created page with '== Intention == This little script aims to facilitate adding new words to dictionaries of a language pair which uses HFST. == Usage == It is intended to be used the following w…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Intention

This little script aims to facilitate adding new words to dictionaries of a language pair which uses HFST.

Usage

It is intended to be used the following way.

Suppose that morphological transducer for Tatar in Tatar-Bashkir pair doesn't recognise the word 'укытучы', which means "teacher". We have to add this word to:

  1. apertium-tt-ba.tt.lexc,
  2. translate it in apertium-tt-ba.tt-ba.dix and
  3. add the translation to apertium-tt-ba.ba.lexc. For such close related languages like Tatar and Bashkir most likely all the categories and the names of continuation classes will remain the same.

In a directory containing lexikograf.sh there should be a file called add-them.txt, to which you add words exactly in the same form as you would add them to lexc file of a given language, in our case

укытучы NLEX ; [1]

After it, if you run lexikograf.sh it will generate three text files: to-tatar-lexc.txt, to-bidix.txt and to-bashkir-lexc.txt. In our case, each file will contain only one line:

!укытучы NLEX ; ! ""

<e><p><l>укытучы<s n="n"/></l><r><s n="n"/></r></p></e>

! NLEX ; ! "укытучы" respectively.

After it, one obviously has to do the following:

  • add Bashkir translations in to-bashkir-lexc.txt (and additional tags if they aren't matching entirely, e.g. different gender, difference in transitivity of verbs etc. Hard to imagine in Tatar-Bashkir pair actually):
    • <e><p><l>укытучы<s n="n"/></l><r>уҡытыусы<s n="n"/></r></p></e>
  • add this translation to to-bashkir-lexc.txt. Again, you can write the word only once, the next script, duplicate.sh will take care of it and make from entries !укытучы NLEX ; ! "" and !уҡытыусы NLEX ; ! "укытучы" entries !укытучы:укытучы NLEX ; ! "" and !уҡытыусы:уҡытыусы NLEX ; ! "укытучы" respectively. This lines remain commented out so that you cannot add them to the actual .lexc files without proofreading them.

Code

Here is the script itself. If you like the idea of it and think that it can indeed be usefull, feel free to modify it, as it was written 1. by an absolute newbie and 2. won't work with any other language pair without modifying it. Define some variables etc. Be bold :)

#!/bin/bash

# This little tools aims to faciltate adding new stems to dictionaries of a language pair using HFST
# by reducing amount of typing work
# see http://wiki.apertium.org/wiki/Lexikograf.sh for more details

# for every line of add-them.txt
#do

## match the first word (=initial letters till the first space) of the line and store it in $one
### the first space can be escaped by a %. In such case skip it and match all letters till the second space

## match rest of the line to $rest

# print '$LINE ! ""' to to-tatar-lexc.txt

# print '<e><p><l>$one<s n="n"/></l><r><s n="n"/></r></p></e>" to to-bidix.txt

# print '$rest ! "$one"' to to-bashkir.lexc

# duplicate words in lexc entries and comment them out    

Notes

  1. Actually укытучы:укытучы Ninfl ; is a better way of writing this and statement above that you should add stems to add-them.txt exactly in the same form as you would add them to the actual .lexc file is not absolutely true. But such writing obviously saves time. Don't worry, lexikograf.sh will generate entries in the preferred format