Difference between revisions of "English and Kazakh"
| Line 210: | Line 210: | ||
| What are these systems going to be used for and how does this affect design? | What are these systems going to be used for and how does this affect design? | ||
| It is quite unlikely that a system like this will ever be used for postediting the output into publishable text. The output is quite unlikely to be useful, particularly for sentences longer than a few words, as it would be very difficult to get the right word order. | |||
| It might be used, however, for: | |||
| * (1) interactive MT as in http://www.dlsi.ua.es/bbcat/?slang=eng&tlang=kaz ? | |||
| * (2) fuzzy-match repair (when a translator using a computer-aided translation system gets a very good fuzzy match from a translaton memory, MT output can be intelligently used to find which parts of the target side need to be changed an actually change them (a thesis at the Universitat d'Alacant). This is because short segments may get very good translations. | |||
| * (3) assimilation or gisting (understanding what a text is about); the evaluation of this may be tricky but some Apertiumers have had interesting ideas: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/4867 | |||
| Indeed, evaluation may be tricky in general. | |||
| Development should take these applications into account: | |||
| * (1) and (2): getting good translations for short segments (2, 3, 4 words) can be very helpful here | |||
| * (3): the idea here would be not to pay attention to features that do not impair understanding (e.g, English articles can be deleted; "of Kazakh constituents order acceptable may be", etc.). good translations for short phrases (linguistically motivated segments) could be the key here | |||
| ==== Questions we had, open issues ==== | ==== Questions we had, open issues ==== | ||
Revision as of 14:26, 22 November 2013
Starting work on Apertium English to Kazakh
These notes are basically for Anel, Aizhan and Assem who have started to develop this language pair... And Aida too...
Installing what is needed
Operating System
Install a suitable GNU/Linux system such as Debian, Ubuntu, Mint...
Install build essentials, etc.
Open a terminal window and type
sudo apt-get install subversion build-essential g++ pkg-config gawk libxml2 \ libxml2-dev libxml2-utils xsltproc flex automake autoconf libtool libpcre3-dev \ cmake libicu-dev libboost-dev libgoogle-perftools-dev bison libreadline-dev zlib1g-dev
enter your password and Wait till the packages are downloaded and installed.
If you don't already have a directory for sources, make one in your home directory and enter it:
cd ~ mkdir Source cd Source
Install HFST
This language pair uses the Helsinki Finite State Toolkit for Kazakh generation, so we need to install it, and its dependencies. (But OpenFST is now included with HFST, so there is no longer a need to install OpenFST separately.)
Install Foma
- Main article: Foma
svn checkout http://foma.googlecode.com/svn/trunk/foma/ foma cd foma make sudo make install cd ..
Install HFST
- Main article: HFST
svn co https://svn.code.sf.net/p/hfst/code/trunk/hfst3 cd hfst3/ ./autogen.sh scripts/generate-cc-files.sh # It's OK if this step fails ./configure --enable-lexc --with-foma --disable-tagger --enable-proc make sudo make install sudo ldconfig cd ..
Troubleshooting
When doing "make" with old autotools (pre 1.14?)
make[5]: *** No rule to make target `xre_parse.hh', needed by `xre_lex.ll'. Stop.
Run scripts/generate-cc-files.sh and then make again.
Install VISLCG3
- Main article: Apertium and Constraint Grammar
svn co http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3 cd vislcg3 ./cmake.sh make -j3 sudo make install cd ..
Download apertium, lttoolbox and eng-kaz data from SVN
- Main article: Minimal installation from SVN
cd ~/Source svn co https://svn.code.sf.net/p/apertium/svn/trunk/lttoolbox svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-lex-tools svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz svn co https://svn.code.sf.net/p/apertium/svn/incubator/apertium-eng-kaz
Compile and install lttoolbox
cd lttoolbox/ PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh make sudo make install sudo ldconfig
Compile and install apertium
cd .. cd apertium/ PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh make sudo make install sudo ldconfig
Compile and install apertium-lex-tools
cd .. cd apertium-lex-tools PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh make sudo make install sudo ldconfig
Install Kazakh language
cd .. cd apertium-kaz ./autogen.sh make
Install English--Kazakh language pair data from incubator
cd .. cd apertium-eng-kaz/ PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh --with-lang2=$HOME/Source/apertium-kaz make
Troubleshooting
If you get:
lt-comp: error while loading shared libraries: liblttoolbox3-3.2.so.0: cannot open shared object file: No such file or directory
Then you should do:
sudo ldconfig
Browse SVN
Here you can look at changes that have been made:
http://sourceforge.net/p/apertium/svn/HEAD/tree/incubator/apertium-eng-kaz/
Contact
IRC
Open up XChat (normally "Programs -> Internet -> XChat IRC") and type:
/server irc.freenode.net /join #apertium /join #hfst
To install xchat:
sudo apt-get install xchat
In Windows:
http://www.silverex.org/download/
Chat logs/archives: http://alpha.visl.sdu.dk/~tino/pisg/freenode/logs/
Mailing list
Email: apertium-turkic@lists.sourceforge.net
http://blog.gmane.org/gmane.science.linguistics.turkic.mt
November 2013 to-do list
- Check the constraint-grammar file for strange rules and also rules that may not be correct. Try to understand the rules we have.
- Write .t1x and .t2x code to deal with case (capitalization)
- Make sure all new rules have the correct superblank management
- That is, where a rule reorders items, all superblanks in the reordered area should go before the reordered area
 
- The copula, t2x rules: past copula with PP ("I was from Kazakhstan"), negative copula, adverbial adjective phrases in copula "The man is very large"
- NPs with adverbs, particularly "very" ("Three very beautiful children")
- 1-word chunks to always provide a translation for any English word
- even for stranded prepositions, so that we have a translation for "in" similar to "inside" etc., as if they were adverbs
 
- noun-noun compounds in NPs and PPs: add the most frequent ones to t1x (hard to do in t2x as prepositions are solved in t1x)
- interrogative sentences: Yes/no (-ba) and informative ("Where is my Kazakh dictionary"?) → probably work for t2x
- 'which' as a determiner → precedes noun ("which house" → "қайсы үй") or genitive construct ("үйдің қайсысы")
 
- relatives (simple relatives: "the book that I wrote", "the book which I wrote" → "Мен жазған кітап"; adverbial relatives "when he came" → "Ол келгенде" [uses locative!])
- check some changes made to punctuation regular expressions in the English dictionary to solve mismatches with the Kazakh dictionary
- "-ing" is hard (check the appropriate section in Tagging_guidelines_for_English. This gives problems in "I like playing football" vs. "I like flying birds" and will be hard. Transitivity could be a clue? What to do in t1x and what in t2x? Also "Flying planes can be dangerous", famous ambiguity). Try to get as much as possible done with CG rules.
- Negative pronous ("yesh" forms)
- write lexical selection to generate "yesh-" forms from "any-" forms in negatives or "bir" forms in questions, e.g. "do" "not" vblex.inf "anything" (dictionaries should be populated with alternatives)
 
- Choosing auxiliaries for present continuous ("be" → "bol" (default), "zhatyr", "otyr" ,etc.)
- deciding t1x versus t2x:
- NPs and PPs in t1x as long as possible (hard design choice, tedious work, code repetition, but...)
 
- Some adjective phrases like "num "years old"".
- Comparative constructs (more ADJ than NP → NP-dat karaganda ADJ-comp)
- Adverbial phrases: think on how to treat them similarly to PP in .t2x ("very quickly" is not that different from "in the park" when it comes to t2x reordering)
- (Partly done) Pseudo-modals "finish" "start" "love" "hate" "enjoy", "like" which take -ing and sometimes "to".... Three possibilities to deal with them: (1) a long def-cat, (2) def-list and tests in rules, and (3) Jim's <exception> (dangerous!). Route: change new-gen-simple-verb macro with a deflist, to generate VP_psmod, translate "-ing" into NPs (as they take case,etc.) and write .t2x rules. Careful: -ing desambiguation not too goodd
What to take care of when writing rules
What are these systems going to be used for and how does this affect design?
It is quite unlikely that a system like this will ever be used for postediting the output into publishable text. The output is quite unlikely to be useful, particularly for sentences longer than a few words, as it would be very difficult to get the right word order.
It might be used, however, for:
- (1) interactive MT as in http://www.dlsi.ua.es/bbcat/?slang=eng&tlang=kaz ?
- (2) fuzzy-match repair (when a translator using a computer-aided translation system gets a very good fuzzy match from a translaton memory, MT output can be intelligently used to find which parts of the target side need to be changed an actually change them (a thesis at the Universitat d'Alacant). This is because short segments may get very good translations.
- (3) assimilation or gisting (understanding what a text is about); the evaluation of this may be tricky but some Apertiumers have had interesting ideas: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/4867
Indeed, evaluation may be tricky in general.
Development should take these applications into account:
- (1) and (2): getting good translations for short segments (2, 3, 4 words) can be very helpful here
- (3): the idea here would be not to pay attention to features that do not impair understanding (e.g, English articles can be deleted; "of Kazakh constituents order acceptable may be", etc.). good translations for short phrases (linguistically motivated segments) could be the key here
Questions we had, open issues
- morphology of reflexives ("Öz") → Mikel has to talk to Jonathan and Ilnar to make it work as in apertium-kir or as bir-bir reciprocals. Make morphology describe the real morphotactics of these forms.
- why do we have gender in Kazakh morphologies when gender is not represented? Make morphology describe the real morphotactics of these forms??? (Fran gave reasons for not doing so, check apertium-turkic)
November 2013 work done
Aida will complete and document this list
- Regression test is completed with new sentences 426/426
- Structural transfer
- Reported speech sentences
- Conditionals(First and Second)
- "be" + adjective in present and past : "You are/were beautiful"
- "be" + PP: "I am from Kazakhstan"
- Rule for demonstrative pronouns
- Rule for negative pronouns("nothing,nobody, anything(only for negative sentences)") changing verb to negative
- Def-list of pseudo-mod verbs(I LIKE/ENJOY/LOVE/HATE/START/FINISH playing) int .t1x and choosing them in gen-simple-verb macro
- Rule for "-ing" words as NP<subst> in .t1x,for example, I like playing.
- "ing" + NP in .t2x
- Rule for "would"(.t1x) + NP(.t2x)
 
- Lexical selection
- One rule for "residence"
- Rules in CG
 
- Dictionary work
- Put some country and city names into apertium-eng-kaz.eng-kaz.dix as NP-TOP
- Added missing pronouns to bilingual dictionary
- Corrected verbs which iv to tv(tv to iv)
- Changed "would" <inf> to <past> in eng.dix
 
Old stuff scheduled for removal
Some of this information is outdated and needs work, but make sure that everything is there before removing this part.
Postpositions
Apparently Kazakh has 5 kinds of postpositions, according to the case of the NP they follow. Some of those following genitive may be interpreted as "nouns" with a case, such as
бақшаның астында
garden-of bottom-in
garden.gen bottom.loc
"under the garden"
where астын is roughly the noun "bottom", much as in Basque "ortu-a-ren azpi-an" "azpi" is a noun.
With nominative (or base form)
Check this list:
- арқылы through
- туралы about
- секілді similarly to
- жөнінде about
With genitive
- астынан from below
- астында above (top-its-in)
- жанынан from beside (side-its-from)
- жанында beside (side-its-in)
With dative
- қарай (towards)
- арналған (intended for)
With ablative
- кейін behind, after
With instrumental
- қатар beside
- бірге together with
Starting work on Apertium Kazakh to English
General ideas
Try to translate as literally as possible in the first prototypes (do not have too many .t2x rules)
Make the most of existing CG-based PoS tagging (wait for instructions on how to use the apertium-kaz.kaz.rlx in apertium-kaz)
Detecting NPs and PPs
There is a lot of stuff in apertium-eng-kaz.kaz-eng.t1x already! We have to study, and check the following.
Main kinds of NPs:
- accusative and nominative → no preposition
- genitive → two solutions: N's N or N of N (attention genitive chains)
- dative → what should one do? (tricky)
- locative, ablative (make list) → PPs
- what to do with possessives (particularly 1st and 2nd person) to avoid double possessives in sentences with "mening", etc.
- Менің бақшам → my garden of me (!)
 
 
- what to do with possessives (particularly 1st and 2nd person) to avoid double possessives in sentences with "mening", etc.
Composition of NPs: n, adj n, num n, num adj n, ...
Things to take care of:
- Decide if noun-based postpositions: artynda, ustinde, keyin, etc. will be detected in t1x. A list of lemmas would be necessary, or changes to bilingual dictionaries
- Make sure we generate plurals for numbers
- articles (use third-person possessives as a hint to generate definite articles)
- kalaning baqshasy → the garden of city
 
Detecting VPs
- simple verbs: decide on reasonable equivalents
- some may be hard to decide, such as generating future simple in English different from present
- present or past perfect to generate. Only present perfect
 
- compound forms based on zhatyr, otyr, etc.
- generating negatives (have negative VPs detected separately or use logic (choose) inside t1x rules
- gender in third-person pronouns (including 'öz' reflexives)
Loose list of problems
- constructions based on infinite verbs (participles, etc.) (the problem of generating tense)
- reinserting the verb to be when the copula is missing
November 2013 work done
Aida will complete this part
- Regression test added
- Transfer
- Continuous tenses, simple tenses, negatives
- Subject pronouns (gender still an open issue)
- Nouns and adjectives
- Deleted n.attr from adjective definition
 
 
- Dictionaties
- changed "жатқан жоқ" in kaz.lexc and eng-kaz.dix to vaux-negative to catch negative present continuous
- added some words
 

