User:Firespeaker/TODO

From Apertium
Jump to navigation Jump to search

Things for selimcan

Things for spectie

Things for hector2

  • tests/a_0.yaml at apertium-cv-tr : For some reason {а} in the present tense suffix doesn't fall in verbs like вула (ending in а)
This was occurring due to an undocumented exception in the {а} deletion rule, which I commented out. If this exception was there for a reason, those forms should be added to a new section of the a_0.yaml file and the issue should be "reopened" on this list. —Firespeaker 00:51, 1 August 2012 (UTC)
Oh yeah, the "fix" occurred in r39860Firespeaker 01:02, 1 August 2012 (UTC)
Thanks. I couldn't find yet why the fix was done. --Hèctor Alòs i Font 05:29, 1 August 2012 (UTC)
  • tests/ger1.yaml at apertium-cv-tr : For some reason м falls in ger1 (%>м%{А%}) (and it shouldn't), but м does not fall e.g. in <neg><pres> (%>м%{А%}с%>т)
I see nothing wrong in ger1.yaml. If there are forms that aren't working right, could you add them to the yaml file? —Firespeaker 01:00, 1 August 2012 (UTC)
Not really. Two forms are generated, I don't know why. One is the good one, but the other (without м) is odd:
[PASS] вула<v><tv><ger1> => вулама
[FAIL] вула<v><tv><ger1> => unexpected results: вулаа
--Hèctor Alòs i Font 05:29, 1 August 2012 (UTC)
The problem is in lexc. I don't know why yet there's a mix between ger1 and ger10, but there's nothing to do with twol.--Hèctor Alòs i Font 10:15, 7 August 2012 (UTC)
  • tests/пенсионер.yaml : After the 3rd person affix (ӗ) front vowels should be used, but the {RUS} tag blocks the vowel harmony for the whole word. It should block it only until {ӗ} is found
Genitive fixed in r40191, still working on dative —Firespeaker 06:19, 9 August 2012 (UTC)
Dative fixed in r40232 (and the wiki's back up!) —Firespeaker 20:47, 14 August 2012 (UTC)
Great! --Hèctor Alòs i Font 10:04, 15 August 2012 (UTC)
  • tests/кала.yaml : In some tenses the last vowel of the root falls, even if it's а or e and the tense (or person) affix begins with {Ӑ}. In order to avoid new archiphonemes that would have to be added in zillions of rules, I've created a new pseudoarchiphoneme {del2}, similar to {del}. A couple of rules in twol should do all the work (search del2). The problem is that when these two rules are uncommented all px3sp words are not recognized, nor the verbal forms with del2 are generated. The compiler doesn't show any rule conflict and I can't see it either. In the кала.yaml a few px3sp forms are added to be able to follow better the problem. There's also a file tests/кил.yaml which works well, as the verb doesn't end in a vowel.
    I put the problem in the first place of the pending list, as it affects lots of words.
    • I finally decided that's cleaner to create an archiphoneme {Ӑ2} (which is needed also for px2pl) and I'm working on it.--Hèctor Alòs i Font 07:27, 23 August 2012 (UTC)
What's the difference between {Ӑ} and {Ӑ2}? How is each used? —Firespeaker 08:35, 1 September 2012 (UTC)
In fact I renamed {Ӑ2} to a more explicit {Ӑdel}. {Ӑ} is the existing archiphoneme which can be ӑ, ӗ or ø. In fact it often falls (ø) when it's in contact with a "hard" vowel. The new archiphoneme {Ӑdel} can be ӑ or ӗ, but is never ø (so it's like {А}, where {Ӑ} behaves more or less like {а} - the names probably should better show the similarity). The difference, e.g. can be seen comparing лаша<n><gen> =лаша{w}>{Ӑ}н = лашан and лаша<n><px2pl><nom> = лаша>{dup}{Ӑdel}р = лашӑр.--Hèctor Alòs i Font 13:23, 1 September 2012 (UTC)
I don't really like the name (the style of the name or its content: it seems like it means the other one), but I see the need for it now and can work with it as needed. Thanks for the explanation. —Firespeaker 05:11, 2 September 2012 (UTC)
In fact, I'm thinking of another way to deal with it, and maybe rework (read: clean up / simplify) some of the other existing phonology. I think a phantom archiphoneme that blocks (OR allows) deletion of the archiphoneme might be more elegant. But I'll have to think about it later. —Firespeaker 05:13, 2 September 2012 (UTC)
In fact, this was the form I was thinking at the beginning, but I couldn't get it work, so I used another, more according to what has been done since now (and it worked, which is the most important for me).
It should be good to changed the tags to something more explicit. For example, {RUS} has in fact two meanings: no vocal harmony and no consonant duplication; but not all "Russian" words may have {RUS} because a subset of them have vocal harmony. So the name is confusing and doesn't clearly present the features of the tag. On the other hand, my priority is not having an ideal morphological analizator, but something that works (we are in a software project with benchmarks at fixed dates). So, I'd not like to waste time in redoing things that are working, instead of solving actual problems, unless redoing is necessary for solving these current problems.--Hèctor Alòs i Font 09:14, 12 September 2012 (UTC)
  • tests/чикӗ.yaml, tests/училище.yaml : There are strange errors in the vowel harmony for some cases in the px2sg. In some cases the rule "Vowel harmony for archiphoneme {У}" is correctly applied, but in some others no. I've tried to add a couple of lines (search "училище" in the twol file), but they didn't solve the problem.
  • tests/хӑю.yaml : As in tests/ту.yaml (which works perfectly), there is a у/ӑв variation in this word. The problem is that (because of the Russian orthography) ю has to be split and an inexistent position for й has to be found. That's why, for instance, adding ю in the rule "в surfaces in у/ӳ > ӑв/ӗв before vowel (2)" may not solve part of the problem. A solution can be adding something at the end of this kind of words in lexc, but that may give problems in twol (fortunately there is not vowel harmony in this case). A very dirty trick could be use the morpheme boundary symbol for that.
I made a certain amount of progress on this (r41127), but had some comments. Primarily, this problem should be solvable using a simple and elegant set of rules, but I've had to treat almost every circumstance of this phonology differently. Cf. the following forms:

  • хӑю<n><px1sg><nom> → хӑю>{dup}{Ӑ}м → хӑйӑвӑм
  • хӑю<n><px2sg><nom> → хӑю>{dup}{У}{н2} → хӑйӑву
  • хӑю<n><px3sg><nom> → хӑю>{dup}{в}{ӗ}{н} → хӑйӑвӗ

It seems that (of these examples at least) only <px3sg> was really set up to deal with this sort of phonology (i.e., the extra {в}). So with the other forms, I'm using > and {dup} for the ӑ and в, respectively. (Btw, are there any front-vowel forms, where ӳ becomes ӗв?)
Yes, there are fully regular, e.g. http://wiki.apertium.org/wiki/Пӳ But, of course, there is not a letter for /йӳ/.--Hèctor Alòs i Font 17:30, 12 September 2012 (UTC)
Because of all these extra rules (and how complicated twol and lexc of Chuvash have become), it takes a long time to compile. It's gone from almost 17 minutes to almost 20 minutes to compile on my machine while I work on it. This is getting really bad. It can take an hour to tweak a rule if I don't get it exactly right the first time, and by that point I've lost my train of thought. Do you (or Fran?) have any ideas on what we can do to improve this situation? One part of it will have to be to clean up the twol file, which won't be fun, but is there anything else that can be done in the meantime? —Firespeaker 05:22, 2 September 2012 (UTC)
I need less time, but anyway there is a problem. Maybe we should partly deal with morphological paradigms in lexc? I don't like this kind of solution, as it seems that we are near 90%, and it would cause a lot of work, and I don't have time for it. And of course, phonological rules can explain better how the language works.--Hèctor Alòs i Font 17:30, 12 September 2012 (UTC)

questions/requests for hector2

  • What is special about пуртӑ? As far as I can tell, it behaves as one would expect for a noun ending in ӑ that has gemination. —Firespeaker 07:56, 18 September 2012 (UTC)
    • The gemination of т. There is no gemination in CCӑ, only in VCӑ. The exception is пуртӑ.--Hèctor Alòs i Font 09:14, 18 September 2012 (UTC)
Ah. I think we're going to move gemination to something conditioned by lexc. Trying to set up phonological triggers for it is too complex and has been causing too many problems. Marking nouns that get gemination in lexc is simple. —Firespeaker 07:07, 19 September 2012 (UTC)
  • Add forms and correct ones at Nouns ending in оFirespeaker 08:05, 18 September 2012 (UTC)
    • I'll do. In fact, the rule is quite simple: words ending in unstressed o, u or a behave as words ending in ӑ, except in gemination (and the "dictionary" form, which conserves the Russian orthography). A subrule, which complicates this general one, is that the 3rd person differs for words invariable in Russian, and also радио seems to be special (it behaves as o was stressed - it seems there some Russian words are assimilated changing the stress to the final vowel). I'll write down the examples, change the wiki, etc.--Hèctor Alòs i Font 19:59, 18 September 2012 (UTC)
  • Could we have pages for the following irregular nouns, especially the first two? —Firespeaker 08:30, 18 September 2012 (UTC)
  • Could we have a page for the following regular noun? —Firespeaker 08:48, 18 September 2012 (UTC)
    • хӗв
      • вот! There's also a yaml file. I am not sure yet about px2sg.loc and px2sg.abl, but all other forms should be correct.
        • Awesome, thanks!
          • In principle the forms px2sg are already revised. However I'm not really sure about px2sg.dat. The forms px2sg are seldom used and there were zillions of errors in px2sg.loc and px2sg.abl, even after 2-3 revisions (i.a. by philology teachers). In px2sg.dat in some cases {У} falls, but I'm not sure which is the rule (I haven't analyzed it yet) and I guess the informants may have made mistakes. In these cases, rarely used, where a few cases without any difference with others don't follow the rules, I finally decided to follow the general rule and not believe the informants.--Hèctor Alòs i Font 19:03, 20 September 2012 (UTC)
  • Could I have yaml files for the following words? —Firespeaker 08:46, 18 September 2012 (UTC)
Given the following line in the twol, is the gemination in утӑ right? —Firespeaker 07:39, 21 September 2012 (UTC)
утӑ:ут%{nodup%}ӑ N1 ; ! "сено"	! exception according to И.П. Павлов 1974: 18
      • You can use the script tests/gen_yaml_mot.sh, which creates a yaml for a given noun from the wiki.yaml file (which I regularly read from the wiki, when I change something in it). As above, I still have some doubts about px2sg.loc and px2sg.abl, but all other forms should be correct. Атте should be correct in all forms.--Hèctor Alòs i Font 19:07, 18 September 2012 (UTC)
        • See above about px2sg (by the way, I see that both атте and анне have the forms px2sg.loc and px2sg.abl which I corrected everywhere. As they are irregular words I don't dare correct them... It may be said that really in these words personal suffixes are used, so this gives some more confidence).--Hèctor Alòs i Font 19:03, 20 September 2012 (UTC)

General cv.twol TODO list

  • gemination
  • ӳ:ӗв, у:ӑв
  • Nouns ending in о
  • <px2sg><dat> of nouns
  • clean up twol conflicts