Difference between revisions of "Talk:Turkic-Turkic translator"

Latest revision as of 18:54, 12 May 2012

Lexicon trimming[edit]

<spectre> also, some 'corner cases' for the lexicon scraper
<spectre> when you have stems in a continuation lexicon, e.g. demonstratives in kazakh, personal pronouns in other places
<spectre> they will all get included disregarding the bilingual dictionary
<spectre> another example: when one language has a word that another doesn't have (e.g. a case that turns into a postposition, 
          but the postposition doesn't have an equivalent in the other language, e.g. is inserted by transfer
<spectre> (same goes for "particles")
<spectre> for verbs (this one can be fixed by having a correspondence between tags/continuation lexica) 
          e.g. when you have a verb which is both tv/iv but only one in the bidix
<spectre> another example: when you have an entry like foo<adj>:bar<n><attr> in the bilingual dictionary, but no entry 
          for foo<adj><subst>:bar<n> (or foo<adj><subst>:baz<n>) then there will be errors

Testvoc[edit]

We probably need to work out a way to run the testvoc in a reasonable amount of time. Here are some suggestions:

Create sub-lexicons, which just run one category through the testvoquing process.
Treat clitics like 'mi', 'i' etc. separately, and not as attached. In Turkish this would reduce the size of those categories which can take these clitics by _at least_ 6 times.
Because of how the lexicons are laid out. We could try and do some kind of continuation-based testvoc.

Idea:

Read in the lexc file, and from the stems, reading up, make a list of the combinations of continuation lexicons, e.g. V-TV V-FIN-COMMON V-NONFIN ...
Make a hash relating each combination of continuation lexicons to a list of stems
Do this for both lexc files
Match up the combinations via the bilingual dictionary. e.g. "foo" N -- "bar" N; "baz" N-NOPOS - "barm" N
Then for each combination of lists of continuation lexicons, make lists of the different combinations with the bilingual dictionary.
Select $n$ at random from each of the pairs and expand them.

@@ Line 21: / Line 21: @@
 * Treat clitics like 'mi', 'i' etc. separately, and not as attached. In Turkish this would reduce the size of those categories which can take these clitics by _at least_ 6 times.
 * Because of how the lexicons are laid out. We could try and do some kind of continuation-based testvoc.
+Idea:
+* Read in the lexc file, and from the stems, reading up, make a list of the combinations of continuation lexicons, e.g. <code>V-TV V-FIN-COMMON V-NONFIN ...</code>
+* Make a hash relating each combination of continuation lexicons to a list of stems
+* Do this for both lexc files
+* Match up the combinations via the bilingual dictionary. e.g. <code>"foo" N -- "bar" N; "baz" N-NOPOS - "barm" N</code>
+* Then for each combination of lists of continuation lexicons, make lists of the different combinations with the bilingual dictionary.
+* Select <math>n</math> at random from each of the pairs and expand them.

Difference between revisions of "Talk:Turkic-Turkic translator"

Latest revision as of 18:54, 12 May 2012

Lexicon trimming[edit]

Testvoc[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools