Difference between revisions of "Talk:German to English"
(36 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
==Getting started== |
|||
What's the best approach to start adding entries to the German monodix? |
What's the best approach to start adding entries to the German monodix? |
||
A good way would be to stat writing a script to download Wiktionary entries for German nouns and converting them into [[speling format]], e.g. |
:A good way would be to stat writing a script to download Wiktionary entries for German nouns and converting them into [[speling format]], e.g. |
||
http://en.wiktionary.org/wiki/Bett#Declension |
:http://en.wiktionary.org/wiki/Bett#Declension |
||
http://en.wiktionary.org/wiki/Haus#Declension |
:http://en.wiktionary.org/wiki/Haus#Declension |
||
<pre> |
<pre> |
||
Line 27: | Line 29: | ||
</pre> |
</pre> |
||
There are around 15,000 entries in the category [http://en.wiktionary.org/wiki/Category:German_nouns German nouns], so that should be a good start. - [[User:Francis Tyers|Francis Tyers]] 07:13, 18 October 2011 (UTC) |
:There are around 15,000 entries in the category [http://en.wiktionary.org/wiki/Category:German_nouns German nouns], so that should be a good start. - [[User:Francis Tyers|Francis Tyers]] 07:13, 18 October 2011 (UTC) |
||
::Another thing you can do is make lists of closed category words that don't inflect (E.g. prepositions, conjunctions) and also of abbreviations. - [[User:Francis Tyers|Francis Tyers]] 07:15, 18 October 2011 (UTC) |
|||
==Order of symbols== |
|||
Francis, what should be the expected order of the symbols in the morphological analysis? Let's say we are analyzing "Apfel", is it <POS><gender><case><number> or <POS><gender><number><case>? I guess it should also output all the possible cases, e.g.: |
|||
<pre> |
|||
Apfel<n><m><nom><sg> |
|||
Apfel<n><m><acc><sg> |
|||
Apfel<n><m><dat><sg> |
|||
</pre> |
|||
::<PoS><gender><number><case> - for lack of a better phrase, that's the order of inherency, plus it's easier to work with. Much easier. -- [[User:Jimregan|Jimregan]] 15:39, 19 October 2011 (UTC) |
|||
::Also, listing 'viele' as the plural of 'ein' is dubious, and will more than likely cause problems. Treat them as separate words -- [[User:Jimregan|Jimregan]] 15:53, 19 October 2011 (UTC) |
|||
::: I started a stub at [[Tag_order]] on this, but it's not very complete. --[[User:Unhammer|unhammer]] 07:05, 20 October 2011 (UTC) |
|||
<hr> |
|||
Here's a repository with some initial progress (sorry for the delay, I was out of town last week): |
|||
https://github.com/elaichi/apertium-de-en-dev |
|||
There are fewer nouns than expected because my script only got the ones with the [http://en.wiktionary.org/wiki/Template:de-noun de-noun template] and not the <code>infl|de|noun</code> template. |
|||
:Nice, there are 4187 in total, with 282 paradigms. - [[User:Francis Tyers|Francis Tyers]] 22:15, 25 October 2011 (UTC) |
|||
==Auxiliary verbs== |
|||
Question: In "Basic German" by Schenke the only two auxiliary verbs are "sein" and "haben", while 'the six modal verbs in German' are "dürfen", "können", "müssen", "sollen", "wollen", "mögen". I guess that the correct treatment in Apertium is something like this: |
|||
<pre> |
|||
bin/sein<vbser><pres><p1><sg> |
|||
bist/sein<vbser><pres><p2><sg> |
|||
ist/sein<vbser><pres><p3><sg> |
|||
sind/sein<vbser><pres><p1><pl> |
|||
sind/sein<vbser><pres><p3><pl> |
|||
habe/haben<vbhaver><pres><p1><sg> |
|||
hast/haben<vbhaver><pres><p2><sg> |
|||
hat/haben<vbhaver><pres><p3><sg> |
|||
haben/haben<vbhaver><pres><p1><pl> |
|||
haben/haben<vbhaver><pres><p3><pl> |
|||
haben/haben<vbhaver><inf> |
|||
... |
|||
</pre> |
|||
and mark those six modal verbs with the <code>vbmod</code> tag. But doing this would leave the <code>vbaux</code> tag unused, is this correct? |
|||
:Does "werden" classify as <code>vbaux</code>? |
|||
:We use <code>vaux</code>, but it can be I guess. - [[User:Francis Tyers|Francis Tyers]] 22:03, 25 October 2011 (UTC) |
|||
::Sorry, that's a typo; I meant <code>vaux</code>. So the question is, should "werden" be the only verb in German with this tag? |
|||
:::Sure, that's fine. - [[User:Francis Tyers|Francis Tyers]] 01:27, 26 October 2011 (UTC) |
|||
==Personal pronouns== |
|||
Question: Regarding personal pronouns, I looked at the way it's done in Icelandic and it seems that the correct treatment would be something like this: |
|||
<pre> |
|||
ich/ich<prn><p1><mf><sg><nom> |
|||
mich/ich<prn><p1><mf><sg><acc> |
|||
mir/ich<prn><p1><mf><sg><dat> |
|||
du/du<prn><p2><mf><sg><nom> |
|||
dich/du<prn><p2><mf><sg><acc> |
|||
dir/du<prn><p2><mf><sg><dat> |
|||
er/er<prn><p3><m><sg><nom> |
|||
ihn/er<prn><p3><m><sg><acc> |
|||
ihm/er<prn><p3><m><sg><dat> |
|||
... |
|||
</pre> |
|||
this is, as opposed to using <code>prpers</code> as in: |
|||
<pre> |
|||
I/prpers<prn><subj><p1><mf><sg> |
|||
me/prpers<prn><obj><p1><mf><sg> |
|||
</pre> |
|||
is this correct? |
|||
:Yes, that's fine. This stuff is really easy to change later anyway. - [[User:Francis Tyers|Francis Tyers]] 22:03, 25 October 2011 (UTC) |
|||
==Adjectives== |
|||
German adjectives are confusing, any tips on how to treat them would be appreciated :) |
|||
:Here is an example declension for [http://en.wiktionary.org/wiki/gr%C3%BCn#Declension grün]. |
|||
<pre> |
|||
grün; grün; pst.m.sg.pred; adj |
|||
grün; grüner; pst.m.sg.nom.sta; adj |
|||
grün; grüne; pst.f.sg.nom.sta; adj |
|||
grün; grünes; pst.nt.sg.nom.sta; adj |
|||
grün; grüne; pst.mfn.pl.nom.sta; adj |
|||
... |
|||
grün; grüne; pst.m.sg.nom.vei; adj |
|||
grün; grüne; pst.f.sg.nom.vei; adj |
|||
grün; grüne; pst.nt.sg.nom.vei; adj |
|||
grün; grünen; pst.mfn.pl.nom.vei; adj |
|||
... |
|||
grün; grüner; pst.m.sg.nom.mix; adj |
|||
grün; grüne; pst.f.sg.nom.mix; adj |
|||
grün; grünes; pst.nt.sg.nom.mix; adj |
|||
grün; grünen; pst.mfn.pl.nom.mix; adj |
|||
... |
|||
grün; grünerer; comp.m.sg.nom.sta; adj |
|||
grün; grünster; sup.m.sg.nom.sta; adj |
|||
... |
|||
</pre> |
|||
:This is how I would suggest to do it in speling to start off with. - [[User:Francis Tyers|Francis Tyers]] 13:14, 26 October 2011 (UTC) |
|||
::I guess we then need those extra tags to indicate the declension type? <code>sta</code> (strong), <code>vei</code> (weak), <code>mix</code> (mixed). Note that these tags are not in [[Tags]], but maybe that's just not updated (or this hasn't occurred in previous languages). [[User:Elaichi|Elaichi]] 13:46, 26 October 2011 (UTC) |
|||
:::<code>sta</code> and <code>vei</code> are from Icelandic, <code>mix</code> I made up for German. :) - [[User:Francis Tyers|Francis Tyers]] 08:45, 27 October 2011 (UTC) |
|||
::::Ok, let's use those tags then. By the way, I've been using <code>enwiktionary-20111016-pages-meta-current.xml</code> to extract stuff; just to be safe, that's the one I should use, right? Also, it seems that you already have scripts to parse wiktionaries; would you mind sharing them? - [[User:Elaichi|Elaichi]] 13:28, 27 October 2011 (UTC) |
|||
:::::I don't really have anything generic. I just do screenscraping on the HTML. [[User:AureiAnimus]] might have something. - [[User:Francis Tyers|Francis Tyers]] 13:53, 27 October 2011 (UTC) |
|||
::::::I have written some scripts based on the ones which were lying around, they are in svn at /trunk/apertium-tools/wiktionary Depdending on the HTML you have you'll need to alter them, I based wikExtractionary on what you get when you use index.php&action=render to fetch the pages. Also, the version which is in there is for Dutch nouns but it shouldn't be too hard to correctly modify them, but I figure the rich German morphology will cause some trouble. [[User:AureiAnimus|AureiAnimus]] 18:29, 6 November 2011 (UTC) |
|||
==Ordinals== |
|||
Question: should ordinal numbers be treated as determiners or adjectives? Some examples in other languages: |
|||
<pre> |
|||
fifth<det><ord><sp> # english |
|||
quinto<det><ord><m><sg> # spanish |
|||
fimmti<adj><ord><m><sg><nom> # icelandic |
|||
vijfde<det><ord><sp> # dutch |
|||
</pre> |
|||
:It doesn't really matter either way. What do the traditional grammars say ? - [[User:Francis Tyers|Francis Tyers]] 22:03, 25 October 2011 (UTC) |
|||
::Can they have both strong/weak endings, or only one or the other ? - [[User:Francis Tyers|Francis Tyers]] 22:19, 25 October 2011 (UTC) |
|||
:::According to [http://www.canoo.net/services/Controller?dispatch=inflection&input=dritte canoo], they can have both strong and weak endings, and they are considered adjectives. |
|||
::::Great, so go with that then :) (ps. you can sign your posts with <nowiki>~~~~</nowiki>) - [[User:Francis Tyers|Francis Tyers]] 13:05, 26 October 2011 (UTC) |
|||
==Contractions== |
|||
Question: how should prepositional articles (preposition + article) be treated? e.g. |
|||
<pre> |
|||
am = an + dem |
|||
aufs = auf + das |
|||
beim = bei + dem |
|||
im = in + dem |
|||
vom = von + dem |
|||
</pre> |
|||
my guess is that it should follow the treatment in other languages, e.g. |
|||
<pre> |
|||
al/a<pr>+el<det><def><m><sg> # spanish |
|||
del/de<pr>+el<det><def><m><sg> # spanish |
|||
au/à<pr>+le<det><def><m><sg> # french |
|||
</pre> |
|||
is this correct? |
|||
:Yes, this is correct. But I wouldn't bother to do this in the speling file. Use the speling file mainly for the open categories. - [[User:Francis Tyers|Francis Tyers]] 22:03, 25 October 2011 (UTC) |
|||
==Progress== |
|||
Just added some words in uninflectional categories (using the speling format) to the [https://github.com/elaichi/apertium-de-en-dev github repository]. Got the idea of uninflectional words from [http://canoo.net/services/OnlineGrammar/Wort/Ueberblick/Flexionslos.html Canoo]. [[User:Elaichi|Elaichi]] 22:03, 1 December 2011 (UTC) |
|||
Nice, but I'm not sure |
|||
<pre> |
|||
zu; zum; det.def.nt.sg.dat; pr |
|||
zu; zum; det.def.m.sg.dat; pr |
|||
zu; zur; det.def.f.sg.dat; p |
|||
</pre> |
|||
Is supported by the speling script, you might need to add these manually in the end. Any work on verbs yet ? - [[User:Francis Tyers|Francis Tyers]] 08:00, 2 December 2011 (UTC) |
|||
: Thanks for spotting that! Regarding verbs... not yet. Tried parsing the wiktionary templates a while back, but didn't have enough time. I'll look some more into it. [[User:Elaichi|Elaichi]] 22:47, 2 December 2011 (UTC) |
|||
::The best place to look for how I'd probably do the verbs is the Dutch dictionary in the [[Afrikaans and Dutch]] pair. :) - [[User:Francis Tyers|Francis Tyers]] 00:31, 3 December 2011 (UTC) |
|||
Now that we have a fairly complete full-form list (look into the github repository), what's the workflow for using the speling tools to finally get a monodix? [[User:Elaichi|Elaichi]] 17:14, 3 January 2012 (UTC) |
|||
:The best thing to start with is nouns... The workflow is basically ... (0) split-speling.py (1) speling-paradigms.py, (2) paradigm-chopper.py (3) check. The tools are [https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/speling here]. If you need any help using them, let me know. - [[User:Francis Tyers|Francis Tyers]] 00:18, 5 January 2012 (UTC) |
|||
Once we have a monodix, it'll be time to work on the PoS tagger. How is the disambiguation going to work for the cases? e.g. "Der Apfel ist rot." (Apfel<n><m><sg><nom>), "Ich esse den Apfel." (Apfel<n><m><sg><acc>). Is apertium-tagger capable of this? Or perhaps using Constraint Grammar? - [[User:Elaichi|Elaichi]] 03:52, 27 January 2012 (UTC) |
|||
:I would recommend Constraint Grammar. :) - [[User:Francis Tyers|Francis Tyers]] 09:49, 27 January 2012 (UTC) |
|||
:: Is there something similar using CG for other Germanic languages (or languages with cases)? I found a nice example [http://wiki.apertium.org/wiki/Apertium_and_Constraint_Grammar here], but based on it, this seems like a tough task: basically to come up with a rule-based parser for German, right? [[User:Elaichi|Elaichi]] 23:06, 6 February 2012 (UTC) |
|||
:::: Well, it would work basically like Icelandic I guess, there is a CG in the Icelandic pair, or you could also look at the Slavic languages. It's not that hard a task, what you're trying to do is remove the ambiguity you can so that the statistical tagger has an easier time, not make a perfect disambigutor for German. :) - [[User:Francis Tyers|Francis Tyers]] 01:40, 7 February 2012 (UTC) |
|||
Ok, guys, the github repository has what you need to compile the monodix now. It's still lacking some things like entries with sub-readings or entries handled with regular expressions, but it's a start :) [[User:Elaichi|Elaichi]] 16:49, 12 September 2012 (UTC) |
Latest revision as of 16:49, 12 September 2012
Getting started[edit]
What's the best approach to start adding entries to the German monodix?
- A good way would be to stat writing a script to download Wiktionary entries for German nouns and converting them into speling format, e.g.
Bett; Bett; sg.nom; n.nt Bett; Bettes; sg.gen; n.nt Bett; Betts; sg.gen; n.nt Bett; Bett; sg.dat; n.nt Bett; Bett; sg.acc; n.nt Bett; Betten; pl.nom; n.nt Bett; Betten; pl.gen; n.nt Bett; Betten; pl.dat; n.nt Bett; Betten; pl.acc; n.nt Haus; Haus; sg.nom; n.nt Haus; Hauses; sg.gen; n.nt Haus; Haus; sg.gen; n.nt Haus; Haus; sg.dat; n.nt Haus; Haus; sg.acc; n.nt Haus; Häuser; pl.nom; n.nt Haus; Häuser; pl.gen; n.nt Haus; Häusern; pl.dat; n.nt Haus; Häuser; pl.acc; n.nt
- There are around 15,000 entries in the category German nouns, so that should be a good start. - Francis Tyers 07:13, 18 October 2011 (UTC)
- Another thing you can do is make lists of closed category words that don't inflect (E.g. prepositions, conjunctions) and also of abbreviations. - Francis Tyers 07:15, 18 October 2011 (UTC)
Order of symbols[edit]
Francis, what should be the expected order of the symbols in the morphological analysis? Let's say we are analyzing "Apfel", is it <POS><gender><case><number> or <POS><gender><number><case>? I guess it should also output all the possible cases, e.g.:
Apfel<n><m><nom><sg> Apfel<n><m><acc><sg> Apfel<n><m><dat><sg>
- <PoS><gender><number><case> - for lack of a better phrase, that's the order of inherency, plus it's easier to work with. Much easier. -- Jimregan 15:39, 19 October 2011 (UTC)
- Also, listing 'viele' as the plural of 'ein' is dubious, and will more than likely cause problems. Treat them as separate words -- Jimregan 15:53, 19 October 2011 (UTC)
Here's a repository with some initial progress (sorry for the delay, I was out of town last week):
https://github.com/elaichi/apertium-de-en-dev
There are fewer nouns than expected because my script only got the ones with the de-noun template and not the infl|de|noun
template.
- Nice, there are 4187 in total, with 282 paradigms. - Francis Tyers 22:15, 25 October 2011 (UTC)
Auxiliary verbs[edit]
Question: In "Basic German" by Schenke the only two auxiliary verbs are "sein" and "haben", while 'the six modal verbs in German' are "dürfen", "können", "müssen", "sollen", "wollen", "mögen". I guess that the correct treatment in Apertium is something like this:
bin/sein<vbser><pres><p1><sg> bist/sein<vbser><pres><p2><sg> ist/sein<vbser><pres><p3><sg> sind/sein<vbser><pres><p1><pl> sind/sein<vbser><pres><p3><pl> habe/haben<vbhaver><pres><p1><sg> hast/haben<vbhaver><pres><p2><sg> hat/haben<vbhaver><pres><p3><sg> haben/haben<vbhaver><pres><p1><pl> haben/haben<vbhaver><pres><p3><pl> haben/haben<vbhaver><inf> ...
and mark those six modal verbs with the vbmod
tag. But doing this would leave the vbaux
tag unused, is this correct?
- Does "werden" classify as
vbaux
?
- We use
vaux
, but it can be I guess. - Francis Tyers 22:03, 25 October 2011 (UTC)
- Sorry, that's a typo; I meant
vaux
. So the question is, should "werden" be the only verb in German with this tag?
- Sorry, that's a typo; I meant
- Sure, that's fine. - Francis Tyers 01:27, 26 October 2011 (UTC)
Personal pronouns[edit]
Question: Regarding personal pronouns, I looked at the way it's done in Icelandic and it seems that the correct treatment would be something like this:
ich/ich<prn><p1><mf><sg><nom> mich/ich<prn><p1><mf><sg><acc> mir/ich<prn><p1><mf><sg><dat> du/du<prn><p2><mf><sg><nom> dich/du<prn><p2><mf><sg><acc> dir/du<prn><p2><mf><sg><dat> er/er<prn><p3><m><sg><nom> ihn/er<prn><p3><m><sg><acc> ihm/er<prn><p3><m><sg><dat> ...
this is, as opposed to using prpers
as in:
I/prpers<prn><subj><p1><mf><sg> me/prpers<prn><obj><p1><mf><sg>
is this correct?
- Yes, that's fine. This stuff is really easy to change later anyway. - Francis Tyers 22:03, 25 October 2011 (UTC)
Adjectives[edit]
German adjectives are confusing, any tips on how to treat them would be appreciated :)
- Here is an example declension for grün.
grün; grün; pst.m.sg.pred; adj grün; grüner; pst.m.sg.nom.sta; adj grün; grüne; pst.f.sg.nom.sta; adj grün; grünes; pst.nt.sg.nom.sta; adj grün; grüne; pst.mfn.pl.nom.sta; adj ... grün; grüne; pst.m.sg.nom.vei; adj grün; grüne; pst.f.sg.nom.vei; adj grün; grüne; pst.nt.sg.nom.vei; adj grün; grünen; pst.mfn.pl.nom.vei; adj ... grün; grüner; pst.m.sg.nom.mix; adj grün; grüne; pst.f.sg.nom.mix; adj grün; grünes; pst.nt.sg.nom.mix; adj grün; grünen; pst.mfn.pl.nom.mix; adj ... grün; grünerer; comp.m.sg.nom.sta; adj grün; grünster; sup.m.sg.nom.sta; adj ...
- This is how I would suggest to do it in speling to start off with. - Francis Tyers 13:14, 26 October 2011 (UTC)
sta
andvei
are from Icelandic,mix
I made up for German. :) - Francis Tyers 08:45, 27 October 2011 (UTC)
- Ok, let's use those tags then. By the way, I've been using
enwiktionary-20111016-pages-meta-current.xml
to extract stuff; just to be safe, that's the one I should use, right? Also, it seems that you already have scripts to parse wiktionaries; would you mind sharing them? - Elaichi 13:28, 27 October 2011 (UTC)
- Ok, let's use those tags then. By the way, I've been using
- I don't really have anything generic. I just do screenscraping on the HTML. User:AureiAnimus might have something. - Francis Tyers 13:53, 27 October 2011 (UTC)
- I have written some scripts based on the ones which were lying around, they are in svn at /trunk/apertium-tools/wiktionary Depdending on the HTML you have you'll need to alter them, I based wikExtractionary on what you get when you use index.php&action=render to fetch the pages. Also, the version which is in there is for Dutch nouns but it shouldn't be too hard to correctly modify them, but I figure the rich German morphology will cause some trouble. AureiAnimus 18:29, 6 November 2011 (UTC)
Ordinals[edit]
Question: should ordinal numbers be treated as determiners or adjectives? Some examples in other languages:
fifth<det><ord><sp> # english quinto<det><ord><m><sg> # spanish fimmti<adj><ord><m><sg><nom> # icelandic vijfde<det><ord><sp> # dutch
- It doesn't really matter either way. What do the traditional grammars say ? - Francis Tyers 22:03, 25 October 2011 (UTC)
- Can they have both strong/weak endings, or only one or the other ? - Francis Tyers 22:19, 25 October 2011 (UTC)
- According to canoo, they can have both strong and weak endings, and they are considered adjectives.
- Great, so go with that then :) (ps. you can sign your posts with ~~~~) - Francis Tyers 13:05, 26 October 2011 (UTC)
Contractions[edit]
Question: how should prepositional articles (preposition + article) be treated? e.g.
am = an + dem aufs = auf + das beim = bei + dem im = in + dem vom = von + dem
my guess is that it should follow the treatment in other languages, e.g.
al/a<pr>+el<det><def><m><sg> # spanish del/de<pr>+el<det><def><m><sg> # spanish au/à<pr>+le<det><def><m><sg> # french
is this correct?
- Yes, this is correct. But I wouldn't bother to do this in the speling file. Use the speling file mainly for the open categories. - Francis Tyers 22:03, 25 October 2011 (UTC)
Progress[edit]
Just added some words in uninflectional categories (using the speling format) to the github repository. Got the idea of uninflectional words from Canoo. Elaichi 22:03, 1 December 2011 (UTC)
Nice, but I'm not sure
zu; zum; det.def.nt.sg.dat; pr zu; zum; det.def.m.sg.dat; pr zu; zur; det.def.f.sg.dat; p
Is supported by the speling script, you might need to add these manually in the end. Any work on verbs yet ? - Francis Tyers 08:00, 2 December 2011 (UTC)
- Thanks for spotting that! Regarding verbs... not yet. Tried parsing the wiktionary templates a while back, but didn't have enough time. I'll look some more into it. Elaichi 22:47, 2 December 2011 (UTC)
- The best place to look for how I'd probably do the verbs is the Dutch dictionary in the Afrikaans and Dutch pair. :) - Francis Tyers 00:31, 3 December 2011 (UTC)
Now that we have a fairly complete full-form list (look into the github repository), what's the workflow for using the speling tools to finally get a monodix? Elaichi 17:14, 3 January 2012 (UTC)
- The best thing to start with is nouns... The workflow is basically ... (0) split-speling.py (1) speling-paradigms.py, (2) paradigm-chopper.py (3) check. The tools are here. If you need any help using them, let me know. - Francis Tyers 00:18, 5 January 2012 (UTC)
Once we have a monodix, it'll be time to work on the PoS tagger. How is the disambiguation going to work for the cases? e.g. "Der Apfel ist rot." (Apfel<n><m><sg><nom>), "Ich esse den Apfel." (Apfel<n><m><sg><acc>). Is apertium-tagger capable of this? Or perhaps using Constraint Grammar? - Elaichi 03:52, 27 January 2012 (UTC)
- I would recommend Constraint Grammar. :) - Francis Tyers 09:49, 27 January 2012 (UTC)
- Well, it would work basically like Icelandic I guess, there is a CG in the Icelandic pair, or you could also look at the Slavic languages. It's not that hard a task, what you're trying to do is remove the ambiguity you can so that the statistical tagger has an easier time, not make a perfect disambigutor for German. :) - Francis Tyers 01:40, 7 February 2012 (UTC)
Ok, guys, the github repository has what you need to compile the monodix now. It's still lacking some things like entries with sub-readings or entries handled with regular expressions, but it's a start :) Elaichi 16:49, 12 September 2012 (UTC)