Talk:German to English

From Apertium
Revision as of 16:49, 12 September 2012 by Elaichi (talk | contribs) (→‎Progress)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Getting started[edit]

What's the best approach to start adding entries to the German monodix?

A good way would be to stat writing a script to download Wiktionary entries for German nouns and converting them into speling format, e.g.
http://en.wiktionary.org/wiki/Bett#Declension
http://en.wiktionary.org/wiki/Haus#Declension
Bett; Bett; sg.nom; n.nt
Bett; Bettes; sg.gen; n.nt
Bett; Betts; sg.gen; n.nt
Bett; Bett; sg.dat; n.nt
Bett; Bett; sg.acc; n.nt
Bett; Betten; pl.nom; n.nt
Bett; Betten; pl.gen; n.nt 
Bett; Betten; pl.dat; n.nt
Bett; Betten; pl.acc; n.nt
Haus; Haus; sg.nom; n.nt
Haus; Hauses; sg.gen; n.nt
Haus; Haus; sg.gen; n.nt
Haus; Haus; sg.dat; n.nt
Haus; Haus; sg.acc; n.nt
Haus; Häuser; pl.nom; n.nt
Haus; Häuser; pl.gen; n.nt
Haus; Häusern; pl.dat; n.nt
Haus; Häuser; pl.acc; n.nt
There are around 15,000 entries in the category German nouns, so that should be a good start. - Francis Tyers 07:13, 18 October 2011 (UTC)
Another thing you can do is make lists of closed category words that don't inflect (E.g. prepositions, conjunctions) and also of abbreviations. - Francis Tyers 07:15, 18 October 2011 (UTC)

Order of symbols[edit]

Francis, what should be the expected order of the symbols in the morphological analysis? Let's say we are analyzing "Apfel", is it <POS><gender><case><number> or <POS><gender><number><case>? I guess it should also output all the possible cases, e.g.:

Apfel<n><m><nom><sg>
Apfel<n><m><acc><sg>
Apfel<n><m><dat><sg>
<PoS><gender><number><case> - for lack of a better phrase, that's the order of inherency, plus it's easier to work with. Much easier. -- Jimregan 15:39, 19 October 2011 (UTC)
Also, listing 'viele' as the plural of 'ein' is dubious, and will more than likely cause problems. Treat them as separate words -- Jimregan 15:53, 19 October 2011 (UTC)
I started a stub at Tag_order on this, but it's not very complete. --unhammer 07:05, 20 October 2011 (UTC)

Here's a repository with some initial progress (sorry for the delay, I was out of town last week):

https://github.com/elaichi/apertium-de-en-dev

There are fewer nouns than expected because my script only got the ones with the de-noun template and not the infl|de|noun template.

Nice, there are 4187 in total, with 282 paradigms. - Francis Tyers 22:15, 25 October 2011 (UTC)

Auxiliary verbs[edit]

Question: In "Basic German" by Schenke the only two auxiliary verbs are "sein" and "haben", while 'the six modal verbs in German' are "dürfen", "können", "müssen", "sollen", "wollen", "mögen". I guess that the correct treatment in Apertium is something like this:

bin/sein<vbser><pres><p1><sg>
bist/sein<vbser><pres><p2><sg>
ist/sein<vbser><pres><p3><sg>
sind/sein<vbser><pres><p1><pl>
sind/sein<vbser><pres><p3><pl>
habe/haben<vbhaver><pres><p1><sg>
hast/haben<vbhaver><pres><p2><sg>
hat/haben<vbhaver><pres><p3><sg>
haben/haben<vbhaver><pres><p1><pl>
haben/haben<vbhaver><pres><p3><pl>
haben/haben<vbhaver><inf>
...

and mark those six modal verbs with the vbmod tag. But doing this would leave the vbaux tag unused, is this correct?

Does "werden" classify as vbaux?
We use vaux, but it can be I guess. - Francis Tyers 22:03, 25 October 2011 (UTC)
Sorry, that's a typo; I meant vaux. So the question is, should "werden" be the only verb in German with this tag?
Sure, that's fine. - Francis Tyers 01:27, 26 October 2011 (UTC)

Personal pronouns[edit]

Question: Regarding personal pronouns, I looked at the way it's done in Icelandic and it seems that the correct treatment would be something like this:

ich/ich<prn><p1><mf><sg><nom>  
mich/ich<prn><p1><mf><sg><acc> 
mir/ich<prn><p1><mf><sg><dat>  

du/du<prn><p2><mf><sg><nom>    
dich/du<prn><p2><mf><sg><acc>  
dir/du<prn><p2><mf><sg><dat>   

er/er<prn><p3><m><sg><nom>     
ihn/er<prn><p3><m><sg><acc>    
ihm/er<prn><p3><m><sg><dat>    
...

this is, as opposed to using prpers as in:

I/prpers<prn><subj><p1><mf><sg>
me/prpers<prn><obj><p1><mf><sg>

is this correct?

Yes, that's fine. This stuff is really easy to change later anyway. - Francis Tyers 22:03, 25 October 2011 (UTC)


Adjectives[edit]

German adjectives are confusing, any tips on how to treat them would be appreciated :)

Here is an example declension for grün.
grün; grün; pst.m.sg.pred; adj
grün; grüner; pst.m.sg.nom.sta; adj
grün; grüne; pst.f.sg.nom.sta; adj
grün; grünes; pst.nt.sg.nom.sta; adj
grün; grüne; pst.mfn.pl.nom.sta; adj
...
grün; grüne; pst.m.sg.nom.vei; adj
grün; grüne; pst.f.sg.nom.vei; adj
grün; grüne; pst.nt.sg.nom.vei; adj
grün; grünen; pst.mfn.pl.nom.vei; adj
...
grün; grüner; pst.m.sg.nom.mix; adj
grün; grüne; pst.f.sg.nom.mix; adj
grün; grünes; pst.nt.sg.nom.mix; adj
grün; grünen; pst.mfn.pl.nom.mix; adj
...
grün; grünerer; comp.m.sg.nom.sta; adj
grün; grünster; sup.m.sg.nom.sta; adj
...
This is how I would suggest to do it in speling to start off with. - Francis Tyers 13:14, 26 October 2011 (UTC)
I guess we then need those extra tags to indicate the declension type? sta (strong), vei (weak), mix (mixed). Note that these tags are not in Tags, but maybe that's just not updated (or this hasn't occurred in previous languages). Elaichi 13:46, 26 October 2011 (UTC)
sta and vei are from Icelandic, mix I made up for German. :) - Francis Tyers 08:45, 27 October 2011 (UTC)
Ok, let's use those tags then. By the way, I've been using enwiktionary-20111016-pages-meta-current.xml to extract stuff; just to be safe, that's the one I should use, right? Also, it seems that you already have scripts to parse wiktionaries; would you mind sharing them? - Elaichi 13:28, 27 October 2011 (UTC)
I don't really have anything generic. I just do screenscraping on the HTML. User:AureiAnimus might have something. - Francis Tyers 13:53, 27 October 2011 (UTC)
I have written some scripts based on the ones which were lying around, they are in svn at /trunk/apertium-tools/wiktionary Depdending on the HTML you have you'll need to alter them, I based wikExtractionary on what you get when you use index.php&action=render to fetch the pages. Also, the version which is in there is for Dutch nouns but it shouldn't be too hard to correctly modify them, but I figure the rich German morphology will cause some trouble. AureiAnimus 18:29, 6 November 2011 (UTC)

Ordinals[edit]

Question: should ordinal numbers be treated as determiners or adjectives? Some examples in other languages:

fifth<det><ord><sp>  # english
quinto<det><ord><m><sg>  # spanish
fimmti<adj><ord><m><sg><nom>  # icelandic
vijfde<det><ord><sp>  # dutch
It doesn't really matter either way. What do the traditional grammars say ? - Francis Tyers 22:03, 25 October 2011 (UTC)
Can they have both strong/weak endings, or only one or the other ? - Francis Tyers 22:19, 25 October 2011 (UTC)
According to canoo, they can have both strong and weak endings, and they are considered adjectives.
Great, so go with that then :) (ps. you can sign your posts with ~~~~) - Francis Tyers 13:05, 26 October 2011 (UTC)

Contractions[edit]

Question: how should prepositional articles (preposition + article) be treated? e.g.

am = an + dem
aufs = auf + das
beim = bei + dem
im = in + dem
vom = von + dem

my guess is that it should follow the treatment in other languages, e.g.

al/a<pr>+el<det><def><m><sg>  # spanish
del/de<pr>+el<det><def><m><sg>  # spanish
au/à<pr>+le<det><def><m><sg>  # french

is this correct?

Yes, this is correct. But I wouldn't bother to do this in the speling file. Use the speling file mainly for the open categories. - Francis Tyers 22:03, 25 October 2011 (UTC)

Progress[edit]

Just added some words in uninflectional categories (using the speling format) to the github repository. Got the idea of uninflectional words from Canoo. Elaichi 22:03, 1 December 2011 (UTC)

Nice, but I'm not sure

zu; zum; det.def.nt.sg.dat; pr
zu; zum; det.def.m.sg.dat; pr
zu; zur; det.def.f.sg.dat; p

Is supported by the speling script, you might need to add these manually in the end. Any work on verbs yet ? - Francis Tyers 08:00, 2 December 2011 (UTC)

Thanks for spotting that! Regarding verbs... not yet. Tried parsing the wiktionary templates a while back, but didn't have enough time. I'll look some more into it. Elaichi 22:47, 2 December 2011 (UTC)
The best place to look for how I'd probably do the verbs is the Dutch dictionary in the Afrikaans and Dutch pair. :) - Francis Tyers 00:31, 3 December 2011 (UTC)

Now that we have a fairly complete full-form list (look into the github repository), what's the workflow for using the speling tools to finally get a monodix? Elaichi 17:14, 3 January 2012 (UTC)

The best thing to start with is nouns... The workflow is basically ... (0) split-speling.py (1) speling-paradigms.py, (2) paradigm-chopper.py (3) check. The tools are here. If you need any help using them, let me know. - Francis Tyers 00:18, 5 January 2012 (UTC)

Once we have a monodix, it'll be time to work on the PoS tagger. How is the disambiguation going to work for the cases? e.g. "Der Apfel ist rot." (Apfel<n><m><sg><nom>), "Ich esse den Apfel." (Apfel<n><m><sg><acc>). Is apertium-tagger capable of this? Or perhaps using Constraint Grammar? - Elaichi 03:52, 27 January 2012 (UTC)

I would recommend Constraint Grammar. :) - Francis Tyers 09:49, 27 January 2012 (UTC)
Is there something similar using CG for other Germanic languages (or languages with cases)? I found a nice example here, but based on it, this seems like a tough task: basically to come up with a rule-based parser for German, right? Elaichi 23:06, 6 February 2012 (UTC)
Well, it would work basically like Icelandic I guess, there is a CG in the Icelandic pair, or you could also look at the Slavic languages. It's not that hard a task, what you're trying to do is remove the ambiguity you can so that the statistical tagger has an easier time, not make a perfect disambigutor for German. :) - Francis Tyers 01:40, 7 February 2012 (UTC)


Ok, guys, the github repository has what you need to compile the monodix now. It's still lacking some things like entries with sub-readings or entries handled with regular expressions, but it's a start :) Elaichi 16:49, 12 September 2012 (UTC)