Latest revision as of 06:26, 27 May 2021

1 Starting work on Apertium English to Kazakh
2 Starting work on Apertium Kazakh to English
- 2.1 General ideas
3 Aida Sundetova's GSoC 2014: Adopting an unreleased English-Kazakh language pair
4 Work done before November
- 4.1 Results
  - 4.1.1 Vocabulary
- 4.2 Future work
5 November 2014 to do list
- 5.1 Example of transfer with apertium-eng-kaz
  - 5.1.1 Need to do
- 5.2 Work to do generally for English-Kazakh

Starting work on Apertium English to Kazakh[edit]

These notes are basically for Anel, Aizhan and Assem who have started to develop this language pair... And Aida too...

Installing what is needed[edit]

Operating System[edit]

Install a suitable GNU/Linux system such as Debian, Ubuntu, Mint...

Install vislcg3, hfst, apertium, lttoolbox essentials, etc.[edit]

Open a terminal window and type

curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash

sudo apt-get -f install locales build-essential \
 automake subversion pkg-config \
 gawk libtool apertium-all-dev

enter your password when asked, and Wait till the packages are downloaded and installed.

If you don't already have a directory for sources, make one in your home directory and enter it:

cd ~
mkdir Source
cd Source

Download apertium, lttoolbox and eng-kaz data from SVN[edit]

Main article: Minimal installation from SVN

cd ~/Source
git clone https://github.com/apertium/apertium-tools.git
git clone https://github.com/apertium/apertium-kaz.git
git clone https://github.com/apertium/apertium-eng-kaz.git

Install Kazakh language[edit]

cd ..
cd apertium-kaz
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh
make

Install English--Kazakh language pair data from staging[edit]

cd ..
cd apertium-eng-kaz/
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh --with-lang2=$HOME/Source/apertium-kaz
make

Troubleshooting[edit]

If you get:

lt-comp: error while loading shared libraries: liblttoolbox3-3.2.so.0: cannot open shared object file: No such file or directory

Then you should do:

sudo ldconfig

Browse SVN[edit]

Here you can look at changes that have been made:

http://sourceforge.net/p/apertium/svn/HEAD/tree/staging/apertium-eng-kaz/

Contact[edit]

IRC[edit]

Open up XChat (normally "Programs -> Internet -> XChat IRC") and type:

/server irc.oftc.net
/join #apertium
/join #hfst

To install xchat:

sudo apt-get install xchat

In Windows:

http://www.silverex.org/download/

Chat logs/archives: http://alpha.visl.sdu.dk/~tino/pisg/freenode/logs/

Mailing list[edit]

Email: apertium-turkic@lists.sourceforge.net

http://blog.gmane.org/gmane.science.linguistics.turkic.mt

November 2013 to-do list[edit]

Check the constraint-grammar file for strange rules and also rules that may not be correct. Try to understand the rules we have.
Write .t1x and .t2x code to deal with case (capitalization)
Make sure all new rules have the correct superblank management
- That is, where a rule reorders items, all superblanks in the reordered area should go before the reordered area
The copula, t2x rules: past copula with PP ("I was from Kazakhstan"), negative copula, adverbial adjective phrases in copula "The man is very large"
- added rule for past copula with PP
- rule with negative copula for PP in past and present
- solved by adding rule "preadv + adj" as AdjP
NPs with adverbs, particularly "very" ("Three very beautiful children")
- solved by adding rule "preadv + adj" and "preadv"
1-word chunks to always provide a translation for any English word
- solved partly by adding pronouns: before,behind,below,in,on,after,towards,through, under as adverbs, and rules in constraint grammar.
- even for stranded prepositions, so that we have a translation for "in" similar to "inside" etc., as if they were adverbs
noun-noun compounds in NPs and PPs: add the most frequent ones to t1x (hard to do in t2x as prepositions are solved in t1x)
- Solved by adding rules to t1x:
  - noun1 noun2
  - adjec noun1 noun2
  - det num noun1 noun2
  - num noun1 noun2
  - prep num noun1 noun2
  - num adjec noun1 noun2
  - det num adjec noun1 noun2
  - prep num adjec noun1 noun2
  - prep det num adjec noun1 noun2
interrogative sentences: Yes/no (-ba) and informative ("Where is my Kazakh dictionary"?) → probably work for t2x
- special questions were done by adding rules:
  - Where/When + NP or PP, but not for WHAT
  - Simple questions Do,Did + NP or PP
  - how write rule for "Are/Were?", rule for "Noun is/was" has the same pattern
- 'which' as a determiner → precedes noun ("which house" → "қайсы үй") or genitive construct ("үйдің қайсысы")
- "which" as adv-itg, "which house do you like?"
relatives (simple relatives: "the book that I wrote", "the book which I wrote" → "Мен жазған кітап"; adverbial relatives "when he came" → "Ол келгенде" [uses locative!])
- added rules for simple relatives(that,which)
- added rule for "when" and "which"
check some changes made to punctuation regular expressions in the English dictionary to solve mismatches with the Kazakh dictionary
"-ing" is hard (check the appropriate section in Tagging_guidelines_for_English. This gives problems in "I like playing football" vs. "I like flying birds" and will be hard. Transitivity could be a clue? What to do in t1x and what in t2x? Also "Flying planes can be dangerous", famous ambiguity). Try to get as much as possible done with CG rules.
Negative pronous ("yesh" forms)
- write lexical selection to generate "yesh-" forms from "any-" forms in negatives or "bir" forms in questions, e.g. "do" "not" vblex.inf "anything" (dictionaries should be populated with alternatives)
  - lexical selection for "anything" as "yeshnarse" for negative sentences and as "bir narse" for affirmative sentences.
Choosing auxiliaries for present continuous ("be" → "bol" (default), "zhatyr", "otyr" ,etc.)
- can't be solved by lexical selection
deciding t1x versus t2x:
- NPs and PPs in t1x as long as possible (hard design choice, tedious work, code repetition, but...)
Some adjective phrases like "num "years old"".
- added as NP phrase "num years old" - "* жаста"
Comparative constructs (more ADJ than NP → NP-dat karaganda ADJ-comp)
- is done, by 4 rules in t2x
- added comparative and superlative adj, in the biggest city as PP
Adverbial phrases: think on how to treat them similarly to PP in .t2x ("very quickly" is not that different from "in the park" when it comes to t2x reordering)
(Partly done) Pseudo-modals "finish" "start" "love" "hate" "enjoy", "like" which take -ing and sometimes "to".... Three possibilities to deal with them: (1) a long def-cat, (2) def-list and tests in rules, and (3) Jim's <exception> (dangerous!). Route: change new-gen-simple-verb macro with a deflist, to generate VP_psmod, translate "-ing" into NPs (as they take case,etc.) and write .t2x rules. Careful: -ing desambiguation not too goodd
Collect parallel kaz-eng corpora!

What to take care of when writing rules[edit]

What are these systems going to be used for and how does this affect design?

It is quite unlikely that a system like this will ever be used for postediting the output into publishable text. The output is quite unlikely to be useful, particularly for sentences longer than a few words, as it would be very difficult to get the right word order.

It might be used, however, for:

(1) interactive MT as in http://www.dlsi.ua.es/bbcat/?slang=eng&tlang=kaz ?
(2) fuzzy-match repair (when a translator using a computer-aided translation system gets a very good fuzzy match from a translaton memory, MT output can be intelligently used to find which parts of the target side need to be changed an actually change them (a thesis at the Universitat d'Alacant). This is because short segments may get very good translations.
(3) assimilation or gisting (understanding what a text is about); the evaluation of this may be tricky but some Apertiumers have had interesting ideas: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/4867

Indeed, evaluation may be tricky in general.

Development should take these applications into account:

(1) and (2): getting good translations for short segments (2, 3, 4 words) can be very helpful here
(3): the idea here would be not to pay attention to features that do not impair understanding (e.g, English articles can be deleted; "of Kazakh constituents order acceptable may be", etc.). good translations for short phrases (linguistically motivated segments) could be the key here

Questions we had, open issues[edit]

morphology of reflexives ("Öz") → Mikel has to talk to Jonathan and Ilnar to make it work as in apertium-kir or as bir-bir reciprocals. Make morphology describe the real morphotactics of these forms.
why do we have gender in Kazakh morphologies when gender is not represented? Make morphology describe the real morphotactics of these forms??? (Fran gave reasons for not doing so, check apertium-turkic)

November 2013 work done[edit]

Aida will complete and document this list

Regression test is completed with new sentences 426/426

Structural transfer
- Reported speech sentences
- Conditionals(First and Second)
- "be" + adjective in present and past : "You are/were beautiful"
- "be" + PP: "I am from Kazakhstan"
- Rule for demonstrative pronouns
- Rule for negative pronouns("nothing,nobody, anything(only for negative sentences)") changing verb to negative
- Def-list of pseudo-mod verbs(I LIKE/ENJOY/LOVE/HATE/START/FINISH playing) int .t1x and choosing them in gen-simple-verb macro
- Rule for "-ing" words as NP<subst> in .t1x,for example, I like playing.
- "ing" + NP in .t2x
- Rule for "would"(.t1x) + NP(.t2x)

Lexical selection
- One rule for "residence"
- Rules in CG

Dictionary work
- Put some country and city names into apertium-eng-kaz.eng-kaz.dix as NP-TOP
- Added missing pronouns to bilingual dictionary
- Corrected verbs which iv to tv(tv to iv)
- Changed "would" <inf> to <past> in eng.dix

Old stuff scheduled for removal[edit]

Some of this information is outdated and needs work, but make sure that everything is there before removing this part.

Postpositions[edit]

Apparently Kazakh has 5 kinds of postpositions, according to the case of the NP they follow. Some of those following genitive may be interpreted as "nouns" with a case, such as

бақшаның астында

garden-of bottom-in

garden.gen bottom.loc

"under the garden"

where астын is roughly the noun "bottom", much as in Basque "ortu-a-ren azpi-an" "azpi" is a noun.

With nominative (or base form)[edit]

Check this list:

арқылы through
туралы about
секілді similarly to
жөнінде about

With genitive[edit]

астынан from below
астында above (top-its-in)
жанынан from beside (side-its-from)
жанында beside (side-its-in)

With dative[edit]

қарай (towards)
арналған (intended for)

With ablative[edit]

кейін behind, after

With instrumental[edit]

қатар beside
бірге together with

Starting work on Apertium Kazakh to English[edit]

General ideas[edit]

Try to translate as literally as possible in the first prototypes (do not have too many .t2x rules)

Make the most of existing CG-based PoS tagging (wait for instructions on how to use the apertium-kaz.kaz.rlx in apertium-kaz)

Detecting NPs and PPs[edit]

There is a lot of stuff in apertium-eng-kaz.kaz-eng.t1x already! We have to study, and check the following.

Main kinds of NPs:

accusative and nominative → no preposition
genitive → two solutions: N's N or N of N (attention genitive chains)
dative → what should one do? (tricky)
locative, ablative (make list) → PPs
- what to do with possessives (particularly 1st and 2nd person) to avoid double possessives in sentences with "mening", etc.
  - Менің бақшам → my garden of me (!)

Composition of NPs: n, adj n, num n, num adj n, ...

Things to take care of:

Decide if noun-based postpositions: artynda, ustinde, keyin, etc. will be detected in t1x. A list of lemmas would be necessary, or changes to bilingual dictionaries
Make sure we generate plurals for numbers
articles (use third-person possessives as a hint to generate definite articles)
- kalaning baqshasy → the garden of city

Detecting VPs[edit]

simple verbs: decide on reasonable equivalents
- some may be hard to decide, such as generating future simple in English different from present
- present or past perfect to generate. Only present perfect
compound forms based on zhatyr, otyr, etc.
generating negatives (have negative VPs detected separately or use logic (choose) inside t1x rules
gender in third-person pronouns (including 'öz' reflexives)

Loose list of problems[edit]

constructions based on infinite verbs (participles, etc.) (the problem of generating tense)
reinserting the verb to be when the copula is missing

November 2013 work done[edit]

Aida will complete this part

Regression test added
Transfer
- Continuous tenses, simple tenses, negatives
- Subject pronouns (gender still an open issue)
- Nouns and adjectives
  - Deleted n.attr from adjective definition

Dictionaties
- changed "жатқан жоқ" in kaz.lexc and eng-kaz.dix to vaux-negative to catch negative present continuous
- added some words

Aida Sundetova's GSoC 2014: Adopting an unreleased English-Kazakh language pair[edit]

Workplan[edit]

First work plan I prepared for proposal: http://wiki.apertium.org/wiki/User:Aida/Application

Before coding was started:

Total stems in apertium-eng-kaz.eng-kaz.dix: 3660

Chunk rules: 118
Interchunk rules: 99
Postchunk-and-cleanup rules: 6.
CG rules:202

New plan[edit]

By new plan, we focused on adding vocabulary from 4 corpora. Please see: http://wiki.apertium.org/wiki/English_and_Kazakh/Work_plan_(GSOC_2014)

Results[edit]

Vocabulary[edit]

Coverage of corpora now:

SETimes:92,32%

EuroParl: 96,18%

NewsCommentary:93,99%

Total stems in apertium-eng-kaz.eng-kaz.dix: 11071

Transfer rules[edit]

Needed transfer rules were written by translating texts, which were taken for coding challenge and midterm evaluation, also some cleaning and single-word rules were written while cleaning testvoc.

Written rules:
- For single-word: vbmod, subs-ing, be-vblex, num-year, det-which
- Constructions for adv/adjec + verb: adjec to inf-verb, adv-itg to inf-verb, have + adv + been + verb-pp,
- For years, "after 1920", etc. translating as "1920 жылДАН кейін": prep num-years.
- Rules for "unknown" words, if word are not in dix,rules can not match. So for some phrases like "the hargle house" - "*hargle үй":det unknown noun, unknown - for single unknown word, will translate as NP, unknown noun2, prep det unknown adjec noun, prep det unknown noun, sup-adjec unknown nom.
- Interchunk rules
- Cleaning rules for pronouns, adjectives.

Testvoc[edit]

Tue Aug 5 22:02:59 ALMT 2014

POS	Total	Clean	Clean %
n	31166	31166	100
vblex	9317	9317	100
adj	2269	2269	100
np	1410	1410	100
adv	1236	1236	100
prn	172	172	100
pr	107	107	100
abbr	78	78	100
num	63	63	100
det	62	62	100
vaux	51	51	100
cnjadv	34	34	100
vbmod	26	26	100
vbser	24	24	100
ij	23	23	100
cnjcoo	19	19	100
cnjsub	16	16	100
vbhaver	12	12	100
rel	4	4	100
preadv	2	2	100
guio	1	1	100
cm	1	1	100

Work done before November[edit]

Progress is not so big :)

Results[edit]

Vocabulary[edit]

Coverage of corpora now:

SETimes:92,93%

EuroParl: 96,76%

NewsCommentary:96,62%

Total stems in apertium-eng-kaz.eng-kaz.dix: 13359

Some interchunk rules added
Cleaning # from europarl corpora, did not finish.
Correcting some errors, and gereating wrong attributes, like <pp>, etc.

Future work[edit]

Cleaning all # from europarl
Solving problem with "Are/Am" same morph analyse as "I AM a doctor": ^vP_q<VPQ><aor>{ }$ ^obj-pron<NP><sg><p2><PXD><CD>{^сіз<prn><pers><p2><2><4><5>$}$ ^nP_ger<NP><PD><ND><ger><PXD><CD>{^ойна<v><tv><4><5><3><2><6>$}$^sent<Q_mark>{^?<sent>$}$^sent<SENT>{^.<sent>$}$
Something wrong with regression-tests
Correcting some errors, and gereating wrong attributes, like <pp>, etc.

November 2014 to do list[edit]

Example of transfer with apertium-eng-kaz[edit]

The small children were playing in the park Det Adj N Vbe Vger Prep Art N

Chunker [.t1x] (pattern of lexical form→action)

[NP Det Adj N]    [VP Vbe Vger]   [PP Prep Art N]

- Output

[NP Adj N] [VP V-"п" Vaux-отыр] [PP N+Postp]

Need to do[edit]

^detart-adjec-nom<NP><pl><p3><PXD><CD>{ ^кішкентай<adj>$ ^бала<n><2><4><5>$}$

^pers-verb<VP><ND><PD><ifi><PXD><NXD><CD>{ ^ойна<v><tv><prc_perf>$ ^отыр<vaux><6><4><5><3><2><7>$}$ 

→ why 6 and parknot 7 (question for developers)
→ we need to repair this

Interchunk[.t2x]

NP VP PP → NP PP VP

^detart-adjec-nom<NP><pl><p3><PXD><CD>{ ^кішкентай<adj>$ ^бала<n><2><4><5>$}$  
^prep-detart-noun<PP><sg><p3><PXD><loc>{  ^саябақ<n><2><4><5>$}$
^pers-verb<VP><pl><p3><ifi><PXD><NXD>{ ^ойна<v><tv><prc_perf>$ ^отыр<vaux><6><4><5><3><2><7>$}$

Postchunk[.t3x]

"instantiate labels + remove syntax"

^кішкентай<adj>$ 
 ^бала<n><pl><PXD><CD>$    
 ^саябақ<n><sg><PXD><loc>$  
 ^ойна<v><tv><prc_perf>$ 
 ^отыр<vaux><NXD><ifi><PXD><p3><pl>$
 ^.<sent>$

Cleanup[.t4x]

Select default values for PXD, CD, NXD, ... Remove <sg>

^кішкентай<adj>$ 
^бала<n><pl><nom>$    
^саябақ<n><loc>$  
^ойна<v><tv><prc_perf>$ 
^отыр<vaux><ifi><p3><pl>$
^.<sent>$

Number of rules today:

$ grep "</rule>" apertium-eng-kaz.eng-kaz.t[1234]x | wc -l 309

Work to do generally for English-Kazakh[edit]

Lexical selection: lots of work (and cleaning) to do
- DONE for noun, adj, adv

Getting all nouns in bidix with more than 1 translation (or repeated)

grep ">[a-zA-Z]\+<s n=\"n\"" apertium-eng-kaz.eng-kaz.dix | sed 's/.*[>]\([a-zA-Z]\+[<]s n=\"n\"\).*/\1/g' | 
fgrep -v ">" | sort | uniq -c | grep "[2-9] "

2× → 249
3× → 50
4× → 15
5×→ 2
6× → 5

respect, reputation, possession etc.

Transfer[edit]

- We need to treat "Be N verb" questions like "Am I a doctor?" by deferring copula generation to .t2x or later
- We need a rule for "the girl's mother" which is like the rule for "girl's mother" but with an additional determiner (typical example of rule writing by cutting-and-pasting).
  - DONE: det n1's n2; prep n1's n2; prep det n1's n2
- The problem of indirect objects without a preposition (He told Mrs. Doyle).
  - maybe by looking at verbs that can do this
  - look at 1980's Oxford Advanced Learner's Dictionary and see if A.S. Hornby's verb patterns are of any help (VPx)
- It is a good idea to have NP chunks that are given-name + family-name and similar constructs
  - DONE for np-ant + np-cog.
  - Have to think about constructions: the Head of State Nursultan Nazarbayev

Miscellaneous[edit]

- "On the way to the hospital" needs to be translated with an adverbial construction with "go". (емханаға бара жатқанда)
- What happens if one wants to change case at .t2x level? Maybe leave it for .t3x
- What to do with proper nouns? Recognize (!), tag, and transliterate?
  - What happens if they go through in Latin (possible .twol rules for Latin vowels: e.g. Kilkenny-де but Carlow-да).
  - Another possibility (Aida): detect unknown capitalized words (possible?). We tried with regular expressions but they do not seem to work in the apertium-eng-kaz.eng.dix unless they are added to the bilingual and the intersection code notices them (carefyl: cyclic!!!) It does not seem to be related to the -w switch of lt-proc. There is some rubbish in the dictionary now for testing.

To compare[edit]

http://www.sanasoft.kz/online/translater/

http://itranslate4.eu (uses Trident)

@@ Line 14: / Line 14: @@
 Open a terminal window and type
 <pre>
-wget http://apertium.projectjj.com/apt/install-nightly.sh
+curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
-sudo bash install-nightly.sh
 sudo apt-get -f install locales build-essential \
@@ Line 35: / Line 34: @@
 <pre>
 cd ~/Source
-svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools
+git clone https://github.com/apertium/apertium-tools.git
-svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-kaz
+git clone https://github.com/apertium/apertium-kaz.git
-svn co https://svn.code.sf.net/p/apertium/svn/staging/apertium-eng-kaz
+git clone https://github.com/apertium/apertium-eng-kaz.git
 </pre>
@@ Line 48: / Line 47: @@
 </pre>
-=== Install English--Kazakh language pair data from incubator ===
+=== Install English--Kazakh language pair data from staging ===
 <pre>
 cd ..
@@ Line 83: / Line 82: @@
 <pre>
-/server irc.freenode.net
+/server irc.oftc.net
 /join #apertium
 /join #hfst</pre>

Difference between revisions of "English and Kazakh"

Latest revision as of 06:26, 27 May 2021

Contents

Starting work on Apertium English to Kazakh[edit]

Installing what is needed[edit]

Operating System[edit]

Install vislcg3, hfst, apertium, lttoolbox essentials, etc.[edit]

Download apertium, lttoolbox and eng-kaz data from SVN[edit]

Install Kazakh language[edit]

Install English--Kazakh language pair data from staging[edit]

Troubleshooting[edit]

Browse SVN[edit]

Contact[edit]

IRC[edit]

Mailing list[edit]

November 2013 to-do list[edit]

What to take care of when writing rules[edit]

Questions we had, open issues[edit]

November 2013 work done[edit]

Old stuff scheduled for removal[edit]

Postpositions[edit]

With nominative (or base form)[edit]

With genitive[edit]

With dative[edit]

With ablative[edit]

With instrumental[edit]

Starting work on Apertium Kazakh to English[edit]

General ideas[edit]

Detecting NPs and PPs[edit]

Detecting VPs[edit]

Loose list of problems[edit]

November 2013 work done[edit]

Aida Sundetova's GSoC 2014: Adopting an unreleased English-Kazakh language pair[edit]

Workplan[edit]

New plan[edit]

Results[edit]

Vocabulary[edit]

Transfer rules[edit]

Testvoc[edit]

Work done before November[edit]

Results[edit]

Vocabulary[edit]

Future work[edit]

November 2014 to do list[edit]

Example of transfer with apertium-eng-kaz[edit]

Need to do[edit]

Work to do generally for English-Kazakh[edit]

Transfer[edit]

Miscellaneous[edit]

To compare[edit]

Navigation menu

Search