Difference between revisions of "Welsh to English"

From Apertium
Jump to navigation Jump to search
 
(16 intermediate revisions by the same user not shown)
Line 4: Line 4:
 
==Todo==
 
==Todo==
   
* Fix multiword verbs in bilingual dictionary -- and add ones non-existent in English dictionary to that dictionary
+
* <s>Fix multiword verbs in bilingual dictionary -- and add ones non-existent in English dictionary to that dictionary</s>
 
* Remove items which are in English dictionary but not Welsh/Bilingual
 
* Remove items which are in English dictionary but not Welsh/Bilingual
* Fix verb conjugation in the Welsh analyser
+
* <s>Fix verb conjugation in the Welsh analyser</s>
* Add restrictions in the bidix
+
* <s>Add restrictions in the bidix</s>
 
* Fix numbers
 
* Fix numbers
* Add adverbs
+
* <s>Add adverbs</s>
* More thorough handling of contractions (i'ch, a'u, ...)
+
* <s>More thorough handling of contractions (i'ch, a'u, ...) &mdash; including preblank</s>
  +
* <s>Add pre-verbal particles (basic functionality)</s>
  +
* Add adjective macro to all chunks
   
 
==Roadmap==
 
==Roadmap==
Line 25: Line 27:
 
* To be able to identify ''who'' said ''what'' to ''who''.
 
* To be able to identify ''who'' said ''what'' to ''who''.
 
* To be able to distinguish is a particular item is interesting enough to be translated properly.
 
* To be able to distinguish is a particular item is interesting enough to be translated properly.
* Sentences of up to 5 words should be translated reasonably well in both directions.
+
* Sentences of up to 5 words should be translated reasonably well from Welsh to English.
   
  +
;Report
===apertium-cy-en 0.5===
 
   
  +
* Coverage:
===apertium-cy-en 1.0===
 
  +
** Wikipedia (753,741 words): 85.5%
  +
** PNAW (11,684,177 words): 94%
  +
** BBC Newyddion (144,887 words): 91%
   
  +
===apertium-cy-en 0.2===
== Tagger ==
 
   
  +
* 0.1 performance and coverage for English to Welsh.
Tagger needs to be retrained to take into account new POS, e.g. "relative pronoun", "adverb"
 
   
  +
===apertium-cy-en 0.5===
===="i" as preposition====
 
Ambiguity: <code>^i/i<pr>/prpers<prn><subj><p1><mf><sg>$ ^foderneiddio/moderneiddio<vblex><inf>/moderneiddio<vblex><prs><p3><sg>$</code>
 
   
  +
* Properly capitalised sentences.
Welsh "i" (to) is getting translated as "[f]i" (I, me).
 
  +
* Get the number for nouns from the appropriate place. e.g. sometimes from the det, sometimes from the noun.
   
  +
===apertium-cy-en 1.0===
if Welsh "i" occurs immediately after a verb marked as 1p sing
 
output pronoun 1p sing
 
otherwise output preposition "to"
 
 
===="o'n" - disambiguate "he" and "from"====
 
 
; mae fo'n mynd -> he isgoing
 
Fine (apart from the missing space).
 
 
Contrast:
 
; mae o'n mynd -> *is ofgoing - he is going
 
 
The elided form "o" is more common here than "fo". Following the 1.3.4 pattern above:
 
 
if Welsh "o" occurs immediately after a verb marked as 3p sing
 
output pronoun 3p sing
 
otherwise output preposition "of/from"
 
 
This is probably better than the earlier version I had here:
 
 
For Welsh pattern "verb + o"
 
output "verb + 3p sing pronoun"
 
 
====Preferential choice between verbforms====
 
 
; bydd y lamp yn rhoi golau -> *are the lamp giving light - the lamp will be giving light
 
: (and presumably we could massage this into "the lamp will give light" later, since that would be the more natural English equivalent)
 
 
A couple of things here. The most important is that tagger chooses the less frequent imperative out of the imperative/future choice for the verb. Presumably this then means that the subject shift can't take place. But even with the imperative choice, the imperative 2p sing info gets lost between interchunk and postchunk, and replaced with a generic? present which gets output as "are". Odd.
 
 
(I'm assuming that "bydd" would get output as "will be", since that would be the correct English tense.)
 
 
== Transfer ==
 
 
<pre>
 
# Welsh
 
: Literal
 
@ Gloss (English)
 
</pre>
 
 
=== Welsh to English ===
 
 
==== Word order (VSO to SVO) ====
 
<pre>
 
# Genir pawb yn rhydd ac yn gydradd â 'i gilydd mewn urddas a hawliau.
 
: Be born everyone free and equal with each other in dignity and rights.
 
 
@ Everyone is born free and equal with each other in dignity and rights.
 
</pre>
 
==== Noun Noun -> Noun of Noun ====
 
<pre>
 
# Llywodraeth Cynulliad Cymru
 
: Government Assembly Wales ==> Government (of) Assembly (of) Wales
 
 
@ Welsh Assembly Government
 
</pre>
 
 
==== Noun Adjective -> Adjective Noun====
 
<pre>
 
# bachgen hapus
 
: boy happy
 
 
@ happy boy
 
 
# geneth bert
 
: girl pretty
 
 
@ pretty girl
 
</pre>
 
 
====Compound prepositions====
 
<pre>
 
<donnek> I've also thought of another wrinkle - compound prepositions
 
<spectie> i will probably need to write a rule
 
<donnek> eg ar ben (on top of)
 
<donnek> lit on head
 
<spectie> we can do a similar thing with those
 
<spectie> for example:
 
<donnek> becomes ar fy mhen (on my head, literally) = on top of me
 
<donnek> ar ei ben, ar ei phen, ar ein pennau
 
<spectie> are there many of them
 
<donnek> maybe we don't need to think about them now, but just to flag them for later
 
<spectie> if there are not many it might be worth making them multiwords
 
<donnek> how do multiwords work
 
<spectie> there are a few ways
 
<spectie> depending on if one of the words inside the multiword inflects or not
 
<donnek> that would be the case here
 
<spectie> for example "take care"
 
<spectie> "i take care of", "you take care of", "he takes care of"
 
<spectie> but "take care" is treated as one verb
 
<donnek> ok
 
</pre>
 
 
====Attributive and predicative adjectives====
 
 
<pre>
 
<spectie> its a problem with attributive/predicative
 
<donnek> it's say something (which is) nice
 
<spectie> but in english we don't distinguish between the two (at least in terms of morphology)
 
<spectie> yes
 
<spectie> in afrikaans they have a -e for attributive (e.g. feodale stelsel -- feudal system)
 
<spectie> and "the system is feudal" - "die stelsel is feodaal"
 
<spectie> donnek, aye
 
<donnek> in Welsh the second would have yn before the adj
 
<donnek> so we may not need anything to mark attrib/pred
 
</pre>
 
 
* Dywedodd rhywbeth neis wrthi = He said something nice to her
 
* Mae'r peth yno yn neis = That thing is nice
 
: Mae yr peth yno yn neis
 
* Mae'n gar neis = It is a nice car
 
: Mae yn gar neis
 
 
<pre>
 
<donnek> at first glance, we may just need a rule for rhyw+thing
 
<donnek> rhyw = some
 
<donnek> rhywbeth (something), rhywfaint (somewhat), etc
 
<donnek> rhywle (somewhere)
 
</pre>
 
 
====Possession====
 
 
<pre>
 
Mae cath 'da Bwflw
 
Bod+p1.sg.pres cath gyda Bwflw
 
Be+p1.sg.pres cat with Beefalo
 
`Beefalo has a cat'
 
</pre>
 
 
;Apertium notes
 
 
We can probably deal with this in interchunk as follows
 
 
vbbod NP1 pr_gyda NP2
 
 
->
 
 
NP2 vbhave NP1
 
 
====The 'yn' particle====
 
 
 
As well as meaning 'in', 'yn' is used to form the present participle of a verb in welsh. For example:
 
 
*dysgu = to learn
 
*yn dysgu = learning
 
 
The present tense is formed by combining 'yn' with the corresponding form of 'bod' (to be) as follows:
 
 
*Mae Beefalo yn gweithio = Beefalo is working/Beefalo works
 
 
Note: when following a vowel, yn is abbreviated to 'n, e.g.
 
 
*Mae Beefalo'n gweithio
 
 
====Genitive Phrases====
 
   
  +
* Handling of gender and number in adjectives
To form the indefinite genitive, a simple construct of <object><subject> can be used.
 
For example, "Soldiers of Wales" would be "milwyr Cymru", literally "soldier Wales"
 
   
Definite genitives are formed with a similar construction, just with the addition of y between the object and the subject.
 
For example, "Beic y gath" = "The cat's bike" literally "bike the cat"
 
Note: feminine nouns incur a soft mutation after the word "y"
 
   
   
 
[[Category:Discussions]]
 
[[Category:Discussions]]
[[Category:Language pairs]]
 
 
[[Category:Welsh to English]]
 
[[Category:Welsh to English]]

Latest revision as of 13:24, 10 December 2010


Todo[edit]

  • Fix multiword verbs in bilingual dictionary -- and add ones non-existent in English dictionary to that dictionary
  • Remove items which are in English dictionary but not Welsh/Bilingual
  • Fix verb conjugation in the Welsh analyser
  • Add restrictions in the bidix
  • Fix numbers
  • Add adverbs
  • More thorough handling of contractions (i'ch, a'u, ...) — including preblank
  • Add pre-verbal particles (basic functionality)
  • Add adjective macro to all chunks

Roadmap[edit]

apertium-cy-en 0.1[edit]

  • 8,000 of the highest frequency words in each dictionary.
  • Rules dealing with basic verb tenses (past, present, future)
  • Basic word re-ordering for simple phrases.
Aims and uses
  • For a non-native speaker to be able to discern the topic of a general news item.
  • To be able to identify who said what to who.
  • To be able to distinguish is a particular item is interesting enough to be translated properly.
  • Sentences of up to 5 words should be translated reasonably well from Welsh to English.
Report
  • Coverage:
    • Wikipedia (753,741 words): 85.5%
    • PNAW (11,684,177 words): 94%
    • BBC Newyddion (144,887 words): 91%

apertium-cy-en 0.2[edit]

  • 0.1 performance and coverage for English to Welsh.

apertium-cy-en 0.5[edit]

  • Properly capitalised sentences.
  • Get the number for nouns from the appropriate place. e.g. sometimes from the det, sometimes from the noun.

apertium-cy-en 1.0[edit]

  • Handling of gender and number in adjectives