Difference between revisions of "Kazakh and Tatar/Diary"

From Apertium
Jump to navigation Jump to search
m
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
=== Monday, 28th May 2012 ===
== Monday, 28th May 2012 ==


==== Checking & refactoring clitics ====
=== Checking & refactoring clitics ===


Some of the clitics appear only after certain forms (e.g. "шы<mod_foc>" in Kazakh, which expresses politeness, joins only 2nd person singular). And vice versa - some of the forms can get only certain clitics (imperative forms get only "чы" and "сана" in Tatar)
Some of the clitics appear only after certain forms (e.g. "шы<mod_foc>" in Kazakh, which expresses politeness, joins only 2nd person singular). And vice versa - some of the forms can get only certain clitics (imperative forms get only "чы" and "сана" in Tatar)
Line 9: Line 9:
In Tatar some new clitics were added as well.
In Tatar some new clitics were added as well.


=== Monday, 28th May 2012 ===
== Tuesday, 29th May 2012 ==


==== Checking & refactoring clitics (cont.) ====
=== Checking & refactoring clitics (cont.) ===


A question whether <code>%+ғана%<postadv%>:% %{G%}ана # ; ! "only"</code> in <code>CLIT</code> continuation class was correct produced a discussion about whether we should handle harmonizing of such words in transducer (means matching them to the previous word) or post-generator can take care of that.
A question whether <code>%+ғана%<postadv%>:% %{G%}ана # ; ! "only"</code> in <code>CLIT</code> continuation class was correct produced a discussion about whether we should handle harmonizing of such words in transducer (means matching them to the previous word) or post-generator can take care of that.
Line 19: Line 19:
I learned a lot of new stuff :), but the possible changes in CLIT lexicon were kept for later.
I learned a lot of new stuff :), but the possible changes in CLIT lexicon were kept for later.


==== Some work on postadverbs ====
=== Some work on postadverbs ===


See [[../Postadverbs]]
See [[../Postadvebs|Postadverbs]]

== Wednesday, 30th May 2012 ==

Had to study for a "zachet", not much done, but:

=== Went over numerals again, some additions ===

=== Started categorizing postpositions depending on what case they govern ===

Their "case-governance" often mismatches between the the two languages, so some transfer rules will be required.

I'll need help to set up coverage-measuring scripts and to learn how I can testvoc only certain POS's.

Also I think that I need another story :) To keep testing things on a parallel text much earlier than midterm comes is a good idea anyway.

== Wednesday, 30th May 2012 ==

=== Postpositions ===

Categorized Kazakh postpositions and translated them into Tatar. Wasn't sure about four or five, and put them down in <code>dev/bidix/postpositions.todo.txt</code>. Also in kaz.lexc there are some postpositions which seem to be POS-miscategorized (see <code>! To be checked</code>).

Categorized postpositions in tat.lexc, added some more stuff.

Tomorrow I am going to translate this additional Tatar ones (to boost up the coverage - they seem to be quite frequent!) and work on transfer rules if they are needed.

== Thursday, 31st May 2012 - Sunday, 3rd June 2012 ==

Let's review what has been happening to apertium-kaz-tat in the last few days (with every new day it is getting harder to remember about changes).

I haven't written here for a couple of days because I wanted to finish working on POS's I had to finish according to the workplan and only after that happily announce about it. But what I learned from it is to try to keep track of changes as they occur, because going back even just for a few days takes more time than to sum up what you have been doing in a few shorter sentences immediately after the "working hours".

=== Postpositions (cont.)===

Translated Tatar postpositions and categorized them according the cases they require. Rule(s) to handle the case-governance differences still to be written.

=== Coverage ===

Now I have a script to measure the trimmed coverage (thanks Unhammer!; see <code>dev/trimmed-coverage.sh)</code>. Added some of the top unrecognized words. Translated country names from kaz.lexc (they were easy to translate and I thought that they could increase the coverage significantly, because the testing corpora are either news or WP).

Good idea is to retrieve translations for stuff like toponyms from [[Building dictionaries#Getting cheap bilingual dictionary entries|interwiki links]].

== Monday, 4th June 2012 ==

=== Conjunctions ===

Translated what already was in lexc's, added some more. Wasn't sure about ones in <code>dev/bidix/conjunctions.todo.txt</code>, gonna go back to them later (they are rather archaic).


There was a question about how I should classify conjunctions (co-ordinating, sub-ordinating or adverbial). The first two are distinguished by all grammars I have, so I mostly just followed them while adding stems into CCLEX and CSLEX.

I might have misunderstood "adverbial conjunctions" [although I think I haven't :)]. In this lexicon landed words, which aren't pure conjunctions, but are derived from other parts of speech (<prn>, <prn>+<post>, <det><n> etc). They connect two sentences, but also appear as a part of one of them (in contrast to co-ordinating and sub-ordinating conjunctions, which are autonomous) and substitute semantically the other sentence <s>and can appear in the middle of the sentece (separated by comma)</s>.

If I understood it correctly, adverbial conjunctions are what Tatar grammars call "мөнәсәбәтле сүзләр = относительные слова" (see pp. 341-351 of Volume 2 of the Academical Grammar).

Also into this lexicon I plan to add all other so called "transition words" (e.g. "беренчедән" = "firstly" etc.).


Grammars I have also write about conjunctive use of interrogative - demonstrative pronoun pairs (calling them "союзные слова"). E.g.: ''"Кем эшләми, шул ашамый".'' But I think that adding all these pronouns once more as correlative conjunctions wouldn't make any sense for Kazah-Tatar pair (or any other Turkic pair).

== Tuesday, 5th June 2012 ==

Not much of coding, played around with some scripts: installed apertium-dixtools, tried iw-word.sh. Added some pending translations into the head-files.

== Wednesday, 6th June 2012 ==

Checked adverbs (not all yet) in <code>kaz.lexc</code> for miscategorizations (some "adjective-adverbs" landed there), translated them and added into bilingual dictionary.

Tomorrow the right part of entries will hopefully land in <code>tat.lexc</code> (sort them with <code>apertium-dixtools sort -right</code> and then export with some sed one-liners).

== Friday, 13th July 2012 ==

=== TODO ===

# fix @ and # errors in the kaz.crp.txt translation;
# revise TODO's here in the wiki, put everything in a single place (General Todo);
# put all regression tests in one page;
# write a rule for sequence [{prn.pers,prn.dem} + post_governing_nom/gen];
# in kaz-tat.t1x, take care of instrumental case of all substantivized things (add them to category "nom" or add a separate category for them);
# testvoc pronouns<ref>On my stand-alone PC (which is much more powerful than my old laptop) - I am installing Xubuntu there right now</ref> and grep for errors.

== Saturday, 14th July 2012 ==

Was translating pending nouns of kaz.lexc, will add them tomorrow.

== Thursday, 19 July ==

The last days was addding more stems, today finally tried to testvoc pronouns. Commands given [[../Testvoc|here]] work fine, except <code>hfst-fst2strings</code> output lots of zeros and nothing else (it might be only on my computer though). In order to make it work, I had to comment out everything except pronouns in ''Root Lexicon''. I got 3934 lines for personal pronouns and aroud 56 000 lines for all pronouns, so it seems to be alright.

Then I uncommented everything in Root lexicon back, compiled and typed this:
<pre>
cat /tmp/pronouns.exp | cut -f2 -d':' | sed 's/^/^/g' | sed 's/$/$/g' | apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x kaz-tat.t1x.bin | apertium-transfer -n apertium-kaz-tat.kaz-tat.t2x kaz-tat.t2x.bin > /tmp/pronouns.kaz-tat.exp
</pre>
<pre>
cat /tmp/pronouns.kaz-tat.exp | hfst-proc -d kaz-tat.autogen.hfst > /tmp/pronouns.tat.exp
</pre>

Fixed some minor issues, recompiled and checked wheter it has helped while running commands above again.

Remaning #-errors of pronouns are because there are no personal copula suffixes in tat.lexc yet.

Latest revision as of 21:50, 19 July 2012

Monday, 28th May 2012[edit]

Checking & refactoring clitics[edit]

Some of the clitics appear only after certain forms (e.g. "шы<mod_foc>" in Kazakh, which expresses politeness, joins only 2nd person singular). And vice versa - some of the forms can get only certain clitics (imperative forms get only "чы" and "сана" in Tatar)

I moved the above clitics into a separate lexicon, and linked imperative forms to it, so that there is no overgeneration now (and a bit easier life for spectie's "testvocing" PC's).

In Tatar some new clitics were added as well.

Tuesday, 29th May 2012[edit]

Checking & refactoring clitics (cont.)[edit]

A question whether %+ғана%<postadv%>:% %{G%}ана # ; ! "only" in CLIT continuation class was correct produced a discussion about whether we should handle harmonizing of such words in transducer (means matching them to the previous word) or post-generator can take care of that.

Another thing is that some Tatar modal particles do not vary depending of the previous word (e.g. "бит"), but I have put them into CLIT continuation class (as all other modal particles were there). This might be wrong.

I learned a lot of new stuff :), but the possible changes in CLIT lexicon were kept for later.

Some work on postadverbs[edit]

See Postadverbs

Wednesday, 30th May 2012[edit]

Had to study for a "zachet", not much done, but:

Went over numerals again, some additions[edit]

Started categorizing postpositions depending on what case they govern[edit]

Their "case-governance" often mismatches between the the two languages, so some transfer rules will be required.

I'll need help to set up coverage-measuring scripts and to learn how I can testvoc only certain POS's.

Also I think that I need another story :) To keep testing things on a parallel text much earlier than midterm comes is a good idea anyway.

Wednesday, 30th May 2012[edit]

Postpositions[edit]

Categorized Kazakh postpositions and translated them into Tatar. Wasn't sure about four or five, and put them down in dev/bidix/postpositions.todo.txt. Also in kaz.lexc there are some postpositions which seem to be POS-miscategorized (see ! To be checked).

Categorized postpositions in tat.lexc, added some more stuff.

Tomorrow I am going to translate this additional Tatar ones (to boost up the coverage - they seem to be quite frequent!) and work on transfer rules if they are needed.

Thursday, 31st May 2012 - Sunday, 3rd June 2012[edit]

Let's review what has been happening to apertium-kaz-tat in the last few days (with every new day it is getting harder to remember about changes).

I haven't written here for a couple of days because I wanted to finish working on POS's I had to finish according to the workplan and only after that happily announce about it. But what I learned from it is to try to keep track of changes as they occur, because going back even just for a few days takes more time than to sum up what you have been doing in a few shorter sentences immediately after the "working hours".

Postpositions (cont.)[edit]

Translated Tatar postpositions and categorized them according the cases they require. Rule(s) to handle the case-governance differences still to be written.

Coverage[edit]

Now I have a script to measure the trimmed coverage (thanks Unhammer!; see dev/trimmed-coverage.sh). Added some of the top unrecognized words. Translated country names from kaz.lexc (they were easy to translate and I thought that they could increase the coverage significantly, because the testing corpora are either news or WP).

Good idea is to retrieve translations for stuff like toponyms from interwiki links.

Monday, 4th June 2012[edit]

Conjunctions[edit]

Translated what already was in lexc's, added some more. Wasn't sure about ones in dev/bidix/conjunctions.todo.txt, gonna go back to them later (they are rather archaic).


There was a question about how I should classify conjunctions (co-ordinating, sub-ordinating or adverbial). The first two are distinguished by all grammars I have, so I mostly just followed them while adding stems into CCLEX and CSLEX.

I might have misunderstood "adverbial conjunctions" [although I think I haven't :)]. In this lexicon landed words, which aren't pure conjunctions, but are derived from other parts of speech (<prn>, <prn>+<post>, <det><n> etc). They connect two sentences, but also appear as a part of one of them (in contrast to co-ordinating and sub-ordinating conjunctions, which are autonomous) and substitute semantically the other sentence and can appear in the middle of the sentece (separated by comma).

If I understood it correctly, adverbial conjunctions are what Tatar grammars call "мөнәсәбәтле сүзләр = относительные слова" (see pp. 341-351 of Volume 2 of the Academical Grammar).

Also into this lexicon I plan to add all other so called "transition words" (e.g. "беренчедән" = "firstly" etc.).


Grammars I have also write about conjunctive use of interrogative - demonstrative pronoun pairs (calling them "союзные слова"). E.g.: "Кем эшләми, шул ашамый". But I think that adding all these pronouns once more as correlative conjunctions wouldn't make any sense for Kazah-Tatar pair (or any other Turkic pair).

Tuesday, 5th June 2012[edit]

Not much of coding, played around with some scripts: installed apertium-dixtools, tried iw-word.sh. Added some pending translations into the head-files.

Wednesday, 6th June 2012[edit]

Checked adverbs (not all yet) in kaz.lexc for miscategorizations (some "adjective-adverbs" landed there), translated them and added into bilingual dictionary.

Tomorrow the right part of entries will hopefully land in tat.lexc (sort them with apertium-dixtools sort -right and then export with some sed one-liners).

Friday, 13th July 2012[edit]

TODO[edit]

  1. fix @ and # errors in the kaz.crp.txt translation;
  2. revise TODO's here in the wiki, put everything in a single place (General Todo);
  3. put all regression tests in one page;
  4. write a rule for sequence [{prn.pers,prn.dem} + post_governing_nom/gen];
  5. in kaz-tat.t1x, take care of instrumental case of all substantivized things (add them to category "nom" or add a separate category for them);
  6. testvoc pronouns[1] and grep for errors.

Saturday, 14th July 2012[edit]

Was translating pending nouns of kaz.lexc, will add them tomorrow.

Thursday, 19 July[edit]

The last days was addding more stems, today finally tried to testvoc pronouns. Commands given here work fine, except hfst-fst2strings output lots of zeros and nothing else (it might be only on my computer though). In order to make it work, I had to comment out everything except pronouns in Root Lexicon. I got 3934 lines for personal pronouns and aroud 56 000 lines for all pronouns, so it seems to be alright.

Then I uncommented everything in Root lexicon back, compiled and typed this:

cat /tmp/pronouns.exp  | cut -f2 -d':' | sed 's/^/^/g' | sed 's/$/$/g' | apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x kaz-tat.t1x.bin | apertium-transfer -n apertium-kaz-tat.kaz-tat.t2x kaz-tat.t2x.bin > /tmp/pronouns.kaz-tat.exp
cat /tmp/pronouns.kaz-tat.exp | hfst-proc -d  kaz-tat.autogen.hfst > /tmp/pronouns.tat.exp

Fixed some minor issues, recompiled and checked wheter it has helped while running commands above again.

Remaning #-errors of pronouns are because there are no personal copula suffixes in tat.lexc yet.

  1. On my stand-alone PC (which is much more powerful than my old laptop) - I am installing Xubuntu there right now