Difference between revisions of "Talk:Turkic-Turkic translator"
Jump to navigation
Jump to search
Line 21: | Line 21: | ||
* Treat clitics like 'mi', 'i' etc. separately, and not as attached. In Turkish this would reduce the size of those categories which can take these clitics by _at least_ 6 times. |
* Treat clitics like 'mi', 'i' etc. separately, and not as attached. In Turkish this would reduce the size of those categories which can take these clitics by _at least_ 6 times. |
||
* Because of how the lexicons are laid out. We could try and do some kind of continuation-based testvoc. |
* Because of how the lexicons are laid out. We could try and do some kind of continuation-based testvoc. |
||
Idea: |
|||
* Read in the lexc file, and from the stems, reading up, make a list of the combinations of continuation lexicons, e.g. <code>V-TV V-FIN-COMMON V-NONFIN ...</code> |
|||
* Make a hash relating each combination of continuation lexicons to a list of stems |
|||
* Do this for both lexc files |
|||
* Match up the combinations via the bilingual dictionary. e.g. <code>"foo" N -- "bar" N; "baz" N-NOPOS - "barm" N</code> |
|||
* Then for each combination of lists of continuation lexicons, make lists of the different combinations with the bilingual dictionary. |
|||
* Select <math>n</math> at random from each of the pairs and expand them. |
Latest revision as of 18:54, 12 May 2012
Lexicon trimming[edit]
<spectre> also, some 'corner cases' for the lexicon scraper <spectre> when you have stems in a continuation lexicon, e.g. demonstratives in kazakh, personal pronouns in other places <spectre> they will all get included disregarding the bilingual dictionary <spectre> another example: when one language has a word that another doesn't have (e.g. a case that turns into a postposition, but the postposition doesn't have an equivalent in the other language, e.g. is inserted by transfer <spectre> (same goes for "particles") <spectre> for verbs (this one can be fixed by having a correspondence between tags/continuation lexica) e.g. when you have a verb which is both tv/iv but only one in the bidix <spectre> another example: when you have an entry like foo<adj>:bar<n><attr> in the bilingual dictionary, but no entry for foo<adj><subst>:bar<n> (or foo<adj><subst>:baz<n>) then there will be errors
Testvoc[edit]
We probably need to work out a way to run the testvoc in a reasonable amount of time. Here are some suggestions:
- Create sub-lexicons, which just run one category through the testvoquing process.
- Treat clitics like 'mi', 'i' etc. separately, and not as attached. In Turkish this would reduce the size of those categories which can take these clitics by _at least_ 6 times.
- Because of how the lexicons are laid out. We could try and do some kind of continuation-based testvoc.
Idea:
- Read in the lexc file, and from the stems, reading up, make a list of the combinations of continuation lexicons, e.g.
V-TV V-FIN-COMMON V-NONFIN ...
- Make a hash relating each combination of continuation lexicons to a list of stems
- Do this for both lexc files
- Match up the combinations via the bilingual dictionary. e.g.
"foo" N -- "bar" N; "baz" N-NOPOS - "barm" N
- Then for each combination of lists of continuation lexicons, make lists of the different combinations with the bilingual dictionary.
- Select at random from each of the pairs and expand them.