Talk:German to English

From Apertium
Revision as of 22:47, 2 December 2011 by Elaichi (talk | contribs) (→‎Progress)
Jump to navigation Jump to search

Getting started

What's the best approach to start adding entries to the German monodix?

A good way would be to stat writing a script to download Wiktionary entries for German nouns and converting them into speling format, e.g.
http://en.wiktionary.org/wiki/Bett#Declension
http://en.wiktionary.org/wiki/Haus#Declension
Bett; Bett; sg.nom; n.nt
Bett; Bettes; sg.gen; n.nt
Bett; Betts; sg.gen; n.nt
Bett; Bett; sg.dat; n.nt
Bett; Bett; sg.acc; n.nt
Bett; Betten; pl.nom; n.nt
Bett; Betten; pl.gen; n.nt 
Bett; Betten; pl.dat; n.nt
Bett; Betten; pl.acc; n.nt
Haus; Haus; sg.nom; n.nt
Haus; Hauses; sg.gen; n.nt
Haus; Haus; sg.gen; n.nt
Haus; Haus; sg.dat; n.nt
Haus; Haus; sg.acc; n.nt
Haus; Häuser; pl.nom; n.nt
Haus; Häuser; pl.gen; n.nt
Haus; Häusern; pl.dat; n.nt
Haus; Häuser; pl.acc; n.nt
There are around 15,000 entries in the category German nouns, so that should be a good start. - Francis Tyers 07:13, 18 October 2011 (UTC)
Another thing you can do is make lists of closed category words that don't inflect (E.g. prepositions, conjunctions) and also of abbreviations. - Francis Tyers 07:15, 18 October 2011 (UTC)

Order of symbols

Francis, what should be the expected order of the symbols in the morphological analysis? Let's say we are analyzing "Apfel", is it <POS><gender><case><number> or <POS><gender><number><case>? I guess it should also output all the possible cases, e.g.:

Apfel<n><m><nom><sg>
Apfel<n><m><acc><sg>
Apfel<n><m><dat><sg>
<PoS><gender><number><case> - for lack of a better phrase, that's the order of inherency, plus it's easier to work with. Much easier. -- Jimregan 15:39, 19 October 2011 (UTC)
Also, listing 'viele' as the plural of 'ein' is dubious, and will more than likely cause problems. Treat them as separate words -- Jimregan 15:53, 19 October 2011 (UTC)
I started a stub at Tag_order on this, but it's not very complete. --unhammer 07:05, 20 October 2011 (UTC)

Here's a repository with some initial progress (sorry for the delay, I was out of town last week):

https://github.com/elaichi/apertium-de-en-dev

There are fewer nouns than expected because my script only got the ones with the de-noun template and not the infl|de|noun template.

Nice, there are 4187 in total, with 282 paradigms. - Francis Tyers 22:15, 25 October 2011 (UTC)

Auxiliary verbs

Question: In "Basic German" by Schenke the only two auxiliary verbs are "sein" and "haben", while 'the six modal verbs in German' are "dürfen", "können", "müssen", "sollen", "wollen", "mögen". I guess that the correct treatment in Apertium is something like this:

bin/sein<vbser><pres><p1><sg>
bist/sein<vbser><pres><p2><sg>
ist/sein<vbser><pres><p3><sg>
sind/sein<vbser><pres><p1><pl>
sind/sein<vbser><pres><p3><pl>
habe/haben<vbhaver><pres><p1><sg>
hast/haben<vbhaver><pres><p2><sg>
hat/haben<vbhaver><pres><p3><sg>
haben/haben<vbhaver><pres><p1><pl>
haben/haben<vbhaver><pres><p3><pl>
haben/haben<vbhaver><inf>
...

and mark those six modal verbs with the vbmod tag. But doing this would leave the vbaux tag unused, is this correct?

Does "werden" classify as vbaux?
We use vaux, but it can be I guess. - Francis Tyers 22:03, 25 October 2011 (UTC)
Sorry, that's a typo; I meant vaux. So the question is, should "werden" be the only verb in German with this tag?
Sure, that's fine. - Francis Tyers 01:27, 26 October 2011 (UTC)

Personal pronouns

Question: Regarding personal pronouns, I looked at the way it's done in Icelandic and it seems that the correct treatment would be something like this:

ich/ich<prn><p1><mf><sg><nom>  
mich/ich<prn><p1><mf><sg><acc> 
mir/ich<prn><p1><mf><sg><dat>  

du/du<prn><p2><mf><sg><nom>    
dich/du<prn><p2><mf><sg><acc>  
dir/du<prn><p2><mf><sg><dat>   

er/er<prn><p3><m><sg><nom>     
ihn/er<prn><p3><m><sg><acc>    
ihm/er<prn><p3><m><sg><dat>    
...

this is, as opposed to using prpers as in:

I/prpers<prn><subj><p1><mf><sg>
me/prpers<prn><obj><p1><mf><sg>

is this correct?

Yes, that's fine. This stuff is really easy to change later anyway. - Francis Tyers 22:03, 25 October 2011 (UTC)


Adjectives

German adjectives are confusing, any tips on how to treat them would be appreciated :)

Here is an example declension for grün.
grün; grün; pst.m.sg.pred; adj
grün; grüner; pst.m.sg.nom.sta; adj
grün; grüne; pst.f.sg.nom.sta; adj
grün; grünes; pst.nt.sg.nom.sta; adj
grün; grüne; pst.mfn.pl.nom.sta; adj
...
grün; grüne; pst.m.sg.nom.vei; adj
grün; grüne; pst.f.sg.nom.vei; adj
grün; grüne; pst.nt.sg.nom.vei; adj
grün; grünen; pst.mfn.pl.nom.vei; adj
...
grün; grüner; pst.m.sg.nom.mix; adj
grün; grüne; pst.f.sg.nom.mix; adj
grün; grünes; pst.nt.sg.nom.mix; adj
grün; grünen; pst.mfn.pl.nom.mix; adj
...
grün; grünerer; comp.m.sg.nom.sta; adj
grün; grünster; sup.m.sg.nom.sta; adj
...
This is how I would suggest to do it in speling to start off with. - Francis Tyers 13:14, 26 October 2011 (UTC)
I guess we then need those extra tags to indicate the declension type? sta (strong), vei (weak), mix (mixed). Note that these tags are not in Tags, but maybe that's just not updated (or this hasn't occurred in previous languages). Elaichi 13:46, 26 October 2011 (UTC)
sta and vei are from Icelandic, mix I made up for German. :) - Francis Tyers 08:45, 27 October 2011 (UTC)
Ok, let's use those tags then. By the way, I've been using enwiktionary-20111016-pages-meta-current.xml to extract stuff; just to be safe, that's the one I should use, right? Also, it seems that you already have scripts to parse wiktionaries; would you mind sharing them? - Elaichi 13:28, 27 October 2011 (UTC)
I don't really have anything generic. I just do screenscraping on the HTML. User:AureiAnimus might have something. - Francis Tyers 13:53, 27 October 2011 (UTC)
I have written some scripts based on the ones which were lying around, they are in svn at /trunk/apertium-tools/wiktionary Depdending on the HTML you have you'll need to alter them, I based wikExtractionary on what you get when you use index.php&action=render to fetch the pages. Also, the version which is in there is for Dutch nouns but it shouldn't be too hard to correctly modify them, but I figure the rich German morphology will cause some trouble. AureiAnimus 18:29, 6 November 2011 (UTC)

Ordinals

Question: should ordinal numbers be treated as determiners or adjectives? Some examples in other languages:

fifth<det><ord><sp>  # english
quinto<det><ord><m><sg>  # spanish
fimmti<adj><ord><m><sg><nom>  # icelandic
vijfde<det><ord><sp>  # dutch
It doesn't really matter either way. What do the traditional grammars say ? - Francis Tyers 22:03, 25 October 2011 (UTC)
Can they have both strong/weak endings, or only one or the other ? - Francis Tyers 22:19, 25 October 2011 (UTC)
According to canoo, they can have both strong and weak endings, and they are considered adjectives.
Great, so go with that then :) (ps. you can sign your posts with ~~~~) - Francis Tyers 13:05, 26 October 2011 (UTC)

Contractions

Question: how should prepositional articles (preposition + article) be treated? e.g.

am = an + dem
aufs = auf + das
beim = bei + dem
im = in + dem
vom = von + dem

my guess is that it should follow the treatment in other languages, e.g.

al/a<pr>+el<det><def><m><sg>  # spanish
del/de<pr>+el<det><def><m><sg>  # spanish
au/à<pr>+le<det><def><m><sg>  # french

is this correct?

Yes, this is correct. But I wouldn't bother to do this in the speling file. Use the speling file mainly for the open categories. - Francis Tyers 22:03, 25 October 2011 (UTC)

Progress

Just added some words in uninflectional categories (using the speling format) to the github repository. Got the idea of uninflectional words from Canoo. Elaichi 22:03, 1 December 2011 (UTC)

Nice, but I'm not sure

zu; zum; det.def.nt.sg.dat; pr
zu; zum; det.def.m.sg.dat; pr
zu; zur; det.def.f.sg.dat; p

Is supported by the speling script, you might need to add these manually in the end. Any work on verbs yet ? - Francis Tyers 08:00, 2 December 2011 (UTC)

Thanks for spotting that! Regarding verbs... not yet. Tried parsing the wiktionary templates a while back, but didn't have enough time. I'll look some more into it. Elaichi 22:47, 2 December 2011 (UTC)