Difference between revisions of "Talk:Welsh to English"
Line 338: | Line 338: | ||
:: Yes, these are multiword constructions, like for example "He became accustomed to the taste." → "cynefinodd Fe i y blas." (try it in the testing interface). Is there a way of getting a list of these? (actually there are many I currently need to fix in the bidix/English dict, but if you have a list I can look at them. At the moment we only seem to have multiword verbs on the English side. - [[User:Francis Tyers|Francis Tyers]] |
:: Yes, these are multiword constructions, like for example "He became accustomed to the taste." → "cynefinodd Fe i y blas." (try it in the testing interface). Is there a way of getting a list of these? (actually there are many I currently need to fix in the bidix/English dict, but if you have a list I can look at them. At the moment we only seem to have multiword verbs on the English side. - [[User:Francis Tyers|Francis Tyers]] |
||
}} |
}} |
||
====Subordinate ("reported speech") clauses with "bod" + noun==== |
|||
Also referring to the cool sentence, we have two sentences as follows: |
|||
;(1) roedd y Comisiwn yn ymchwilio i'r honiadau - the Commission was investigating the allegations |
|||
;(2) mae yr AS wedi methu datgan £103,000 o roddion - the MP has failed to declare £103,000 of gifts |
|||
Subordinate clauses, like the relative clauses, will be difficult. But a first stab at this might be as follows: |
|||
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + wedi + verb" |
|||
output English "that + [det.def] + noun + [qualifiers] + has/have (number agreeing with noun) + verb_past_participle" |
|||
; clywodd y dyn bod y trên wedi cyrraedd yn hwyr -> *the man heard be the train after arrive #late |
|||
The above rule would give "the man heard that the train has arrived late" - not perfect, since in English we would use pluperfect rather than perfect here, but a lot better. |
|||
We can extend this to another construction: |
|||
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + am + verb" |
|||
output English "that + [det.def] + noun + [qualifiers] + will (number agreeing with noun) + verb" |
|||
; clywodd y dyn bod y trên am gyrraedd yn hwyr -> *the man heard be the train for arrive #late |
|||
The above rule would give "the man heard that the train will arrive late" - not perfect, since in English we would use conditional rather than future here. |
|||
These could be improved if it were possible to refer back to the verb of the main clause. Thus where it is past, the subordinate would use pluperfect or conditional; where it is non-past, the subordinate would use perfect or future. |
|||
There are other varieties of subordinate clause that I give other suggestions about. |
|||
==Regression tests== |
==Regression tests== |
Revision as of 14:19, 28 June 2008
- Note: Comments should not include '=' as it confuses the Wiki templating system (as I just found out myself)
- Note 2: Suggestions for part-of-speech disambiguation should go here.
English to Welsh
Macros
- This will contain chunks of rules that we need to split out to make them more maintainable
Patterns
Determiner Adjective Noun
When the determiner is indefinite, output noun + adjective When the determiner is definite, output determiner + noun + adjective.
- Tests
(1) A red cat
- coch cath
(2) The red cat
- Y coch cath
Notes for areas to be covered
A sort of scratchpad / todo list, based on things that come up when putting phrases into the testing webform.
Conjunctive genitive
- gwallt yr eneth - *hair the girl - the hair of the girl - the girl's hair
- llaw y bachgen - *hand the boy - the hand of the boy - the boy's hand
Note that the noun phrase in English is definite - contrast "merch y meddyg" (the doctor's daughter) and "merch meddyg" (a doctor's daughter).
For an English phrase of the type "def + noun1 + of + def + noun2" or of the type "def + noun2 + 's + noun1" convert in Welsh to "noun1 + def + noun2".
- Here can noun1 be a simple noun, or can it be a noun phrase? For example "the red cat of the young boy" - Francis Tyers
- e.g.
- For the pattern det.def + noun1 + of + det.def + noun2:
- Output noun1 + det.def + noun2
- For the pattern det.def + noun1 + of + det.def + noun2:
- Yes, as long as you like, eg,
- cath goch bachgen bach merch ifanc bert rheolwr y banc mawr du
- the red cat of the little boy of the pretty young daughter of the manager of the big black bank
- It's only the last NP of the sequence that gets the def.det. Donnek
- Ok, so this requires a three level rule.
- t1x -> t2x SN_(the cat red) of_(of) SN_(the boy little) of_(of) SN_(the daughter young pretty) of_(of) SN_(the manager) of_(of) SN_(the bank big black)
- t2x -> t3x SN_(the cat red) SN_(the boy little) SN_(the daughter young pretty) SN_(the manager) SN_(the bank big black)
- t3x -> gen (cat red boy little daughter young pretty manager the bank big black)
- What I'll do for now is get the chunks working ('SN' -- noun phrase, and 'of'), for values of 'noun', 'det noun', 'det adj noun', 'det adj adj noun', 'det adj adj adj noun', etc. Then look at taking care of more frequent cases (e.g. the first example). Francis Tyers
For a Welsh phrase of the type "!det + noun1 + def + noun2" convert in English to "def + noun1 + of + def + noun2" or to "def + noun2 + 's + noun1".
The second noun is probably historically a genitive, but it has lost all case markers. The equivalent in Irish would be:
- ceann an chapaill - *head the of-horse (gen) - the head of the horse - the horse's head
- ceann capaill - *head of-horse (gen) - the head of a horse - a horse's head
"was"
"roedd" ([he/she/it] was) is unknown, but I seem to remember adding entries for "to be" to the dixes in the mists of time. Was I dreaming? (roedd <- yr + oedd)
- There are entries for 'bod', but 'roedd' doesn't get processed as all of the 'bod' entries start with 'b' (see this link). I will need to fix this in the analyser. If I understand you correctly, 'roedd' is a contraction of 'yr' (determiner ...) + 'oedd' (verb 'bod', past tense ...)? Francis Tyers
- Some serious errors have crept in to those entries. I've sent an amended version to you by email. You're right - roedd -> yr + oedd, but in the amended version I've sent, I've put (e.g.) "roedd" and "oedd" as alternate forms, because "Roedd" is the spoken form, and even in written Welsh you hardly ever see "Yr oedd" nowadays. Donnek
- the boy was in the garden -> *y bachgen bu yn yr ardd - bu'r bachgen yn yr ardd
Almost correct, except for word-order, and the fact that the preterite is being used instead of the imperfect ("roedd y bachgen yn yr ardd"). The preterite needs to be marked as only being used in written Welsh, and to have a lower likelihood than the imperfect. This is too rough a rule, but would do for the time being.
Marking and word-order
The above brings up a useful point about this. If the standard VSO sequence is changed to SVO (ie unchanged from the English standard), this is a marked pattern, conveying a relative clause. In written Welsh, the verb will be preceded by "a" + soft mutation, but in spoken Welsh the "a" usually disappears.
- y bachgen [a] fu yn yr ardd ddydd Llun (the boy who was in the garden on Monday)
- yr eneth [a] welodd y ci (the girl who saw the dog)
contrast
- gwelodd yr eneth y ci (the girl saw the dog)
Hmmm. Relative clauses are going to be difficult.
For Welsh pattern "noun + a + soft-mutated_verb" output English pattern "noun + who/which + verb".
- The dictionary only has 'a' down as a co-ordinating conjunction "and", does it have other meanings? - Francis Tyers
- Yes. "a" - relative "who, which" in a relative clause where the subject is the same as that of the main clause, and "a" - interrogative pre-verbal particle (eg a weles ti hwnna? - did you see that?). Both are followed by soft mutation. Note that interrogative "a" is usually omitted in speech, leaving only the mutation. - Donnek
"yn" as stative
For Welsh pattern "yn + adj" output "adj"
There is a problem here in that this pattern can also be an adverb:
- siaradodd yn hapus am ei fywyd - he talked happily about his life
For English pattern "adverb_formed_from_adj + ly" output Welsh "yn + adj"
- This second one will be difficult to do, as we don't have adverbs in the English dictionary marked as derivatives from adjectives or not. - Francis Tyers
- OK. Unfortunately, since "yn + adj" can be either an adj or an adv in Welsh, I don't even mark them separately in Eurfa - perhaps I should. Would one option be to replicate all the Welsh adj entries in Apertium by preceding them with "yn + space", and adding "-ly" to the English side? This would get the EW direction, but I don't know whether it would cause problems on the WE direction. - Donnek
The above rule has been applied (way!), but does not catch mutated adjectives ("yn" causes soft mutation):
- *tyfodd fo yn mawr -> he grew big
- tyfodd fo yn fawr -> *he grew in *fawr
- This was a dictionary error, 'fawr' did not have the initial-m paradigm. Now added. - Francis Tyers
- OK - there are a couple of others I've come across: mwy (fwy), bach (fach), gwyn (wyn). there may be a few more. - Donnek
- Taken care of the first two, 'gwyn' doesn't seem to appear in the dictionary (only as 'complaint'). does it inflect at all? - Francis Tyers
- LOL! There are some obvious words not in Eurfa, tut tut to me! gwyn (white), *gwen (in practice "wen", fem), gwynion (occasionally, plural), gwynnach (whiter), gwynnaf (whitest). There may be fem comp and super forms too, but we can ignore those. By the way, "da" also has this problem too. - Donnek
:) -- Ok, I've added gwyn/gwnnach/gwynnaf for now, adding the genders would probably mess up some rules and these are probably fairly low frequency and can be taken care of later. - Francis Tyers
Preferential choice between noun and verbform
- atebodd hi'r cwestiwn -> *answered shethe #hold an inquiry - she answered the question
proc selects 'cwestiwn' (question) - correct - and 1p pl imperative of 'cwestio' (an infrequent verb for 'hold an inquiry'). The 1p pl present would also have been a possibility, and indeed a more likely one. tagger selects the second of these.
Not sure how widespread this would be, but the tagger should give precedence to the noun choice whenever the verb form is preceded by 'y':
For Welsh pattern "[y | yr | 'r] + [noun | verb]" output "[y | yr | 'r] + [noun]"
This is not perfect, because "y | yr" can also be an indirect relative clause pronoun before a verb, but it would catch most things until we can resolve the latter point.
- gwelodd y dyn y llyfr -> *the man saw the books - the man saw the book
This is similar, but is tricksy because it is superficially correct apart from the plural. But in fact, tagger is reading "llyfr" as pres 3p sing of "llyfru" (to book). Apart from being infrequent, and therefore much less likely to appear ("bwcio" would be the usual word), Eurfa has "llyfra" as the pres 3p sing, so there may be a paradigm problem too. The above rule would throw out the verb in the meantime.
- It is currently using the aberth/u__vblex paradigm (see output here). Is this incorrect? - Francis Tyers
- The problem is that "aberthu", apart from the 'regular' "abertha" also has a written "aberth". So yes, it probably is incorrect. The problem is that a lot of less common verbs are very rarely inflected. It might have been better to use something like "gwenu" or "siomi". In the meantime, perhaps just changing "aberth" to "abertha" in the pres 3p sing will do. - Donnek
Number agreement of verb
- I added 'rabbits' to the dictionary, but the problem of unknown words and phrase movement is one we're experiencing in Basque too... - Francis Tyers
- OK - so it's basically an issue that you can't do much about until the word is logged. Hmm. I suppose that makes sense, since Apertium can't figure out what to do with something until it knows what it should do with it ... In a practical sense, this is going to be problematic if we demo Apertium using unseen text. Is there any way of doing some blind choosing, eg
- if this word is
- preceded by [y,yr,'r]
- we will assume it's a noun
- preceded by yn
- we will assume it's a verb
- unless a verb has been identified in the current phrase
- in which case we'll assume it's an adjective
- preceded by [y,yr,'r]
- if this word is
- This might break Apertium - I don't know. In theory, though, we might be able to get relative probabilities for a particular sequences from a corpus. - Donnek
I'd be reluctant to add one as we'd not be able to get the translation, on the other hand, it wouldn't cause messing up of word order. It's an open problem, and we're thinking about it :) - Francis Tyers
Prepositional noun phrase should not be a subject
- cerddodd fo i'r dref -> he walked in the town
Fine, except that the preposition "i" should really be glossed as "to" ("yn y dref" would be "in the town")
Contrast:
- cerddodd i'r dref -> *the town walked in - [he/she] walked to the town
Welsh pattern "prep + det.def + noun" is never a subject phrase
and therefore the "det.def + noun" section shouldn't be shifted. (I can't think of any exceptions to this, but there may be one.)
- There was a rule to do this, I've commented it out, I think there was a reason for it, but I can't recall now. I've run the regression tests below and it doesn't seem to have broken anything. Regarding the preposition, should I change "i" to be "to" instead of "in" ? - Francis Tyers
- Re "i", yes, change it to "to". - Donnek
- The problem here was the dictionary only had i'r → yn+yr... i've added i'r → i+yr and now it is picking the right one, although I don't know what will happen for other contexts... - Francis Tyers
- Not sure where that would have come from. The only vaguely relevant thing I can think of is "i mewn i" (into). - Donnek
- allan i'r cyfarfod -> *the meeting #exit<vblex><pres><p3> in - out to the meeting
This is similar - "in" should be "to", and should be kept with "the meeting".
However, there is another issue here, which is in effect the same as "Preferential choice between noun and verbform" above. In this case, the verb "allanu" (to exit) is being chosen instead of the much more likely "allan" (out).
- roedd o ar dy lyfr -> *was of on your books - it was on your book
1.3.9 would deal with "of", and 1.3.6 would deal with "books". Subject shift would then produce a reasonable translation.
However:
- roedd ar dy lyfr -> *your #be<vbser><past><p3> on books - (it) was on your book
Omitting the subject pronoun can happen quite frequently in speech if the subject has already been mentioned. The <sg> tag gets lost at interchunk, which means the verb can't be conjugated (this came up somewhere else, but I think it's been taken off the page - maybe it would be better just to mark the issue heading as "addressed" rather than delete it). But there is an additional issue, in that the possessive pronoun is getting treated as the subject and moved separately. So maybe we need a broader rule to say that "prep + det.def/pr.poss/whatever + noun" is an indivisible chunk, and must be dealt with as a block. No part of it would be moved in this case anyway.
- Regarding page cleanup, ok. perhaps having a separate section, and then moving sections down would be a good idea. - Francis Tyers
It would also be nice in the longer term to fill in the pronoun if it is omitted.
For Welsh pattern "verb + non-subject noun phrase" output English "verb + pronoun agreeing in number and person + non-subject noun phrase"
The NSNP could be a prepositional phrase (marked by an initial preposition), or an object phrase (marked with initial soft mutation).
"-ing" as "yn + verb"
For English pattern "subject + verb<vbser> + verb + ing" output for Welsh "verb<vbser> + subject + yn + verb"
Inflected verbs not being parsed
- aeth -> *aeth - (he/she/it) went
However, "aeth" is listed in cy.dix.xml (line 27491) as past 3p sing in the mynd_vblex paradigm, which is what "mynd" (to go) gets conjugated against (line 54444).
Ah - a bug in the segmentation.
- *myndaeth fo -> he went
- he went -> *myndaeth fe
The infinitive is getting added to the irregular forms, instead of being replaced by them.
- Yep, this is a problem in the paradigm for 'mynd', I'll need to rewrite it, fortunately it is only used once... New paradigm output here - Francis Tyers
- Fine, but the imperative forms also need "mynd" excised. - Donnek
- Done. - Francis Tyers
Insert det.indef in prepositional NP
- daeth Taid â lamp -> Grandfather came with lamp (preferably "with a lamp")
- dychwelodd y rheolwr gyda gŵr tew -> the manager returned with fat man (preferably "with a fat man")
For Welsh pattern "prep + noun" output English pattern "prep + det.indef + noun"
this should probably be working now, - Francis Tyers
Preferential choice between verbforms
- bydd y lamp yn rhoi golau -> *are the lamp giving light - the lamp will be giving light (and presumably we could massage this into "the lamp will give light" later, since that would be the more natural English equivalent)
A couple of things here. The most important is that tagger chooses the less frequent imperative out of the imperative/future choice for the verb. Presumably this then means that the subject shift can't take place. But even with the imperative choice, the imperative 2p sing info gets lost between interchunk and postchunk, and replaced with a generic? present which gets output as "are". Odd.
(I'm assuming that "bydd" would get output as "will be", since that would be the correct English tense.)
Comparative adjectives with "less/more"
- tyfodd y twnnel yn llai llachar -> *the tunnel grew small bright - the tunnel grew less bright
- tyfodd y twnnel yn fwy tywyll -> *the tunnel grew big dark - the tunnel grew more dark
For Welsh pattern "fwy/llai + adj" output English "more/less + adj"
For English pattern "more/less + adj" output Welsh "fwy/llai + adj"
Synthetic comparative adjectives
Many of these seem to have faulty dictionary entries:
- tyfodd y twnnel yn fwy -> the tunnel grew big (should be "bigger")
- tyfodd y twnnel yn llai -> the tunnel grew small (should be "smaller")
- tyfodd y twnnel yn hirach -> the tunnel grew long (should be "longer")
- tyfodd y twnnel yn uwch -> the tunnel grew high (should be "higher")
- Dictionary error in the bidix, now fixed. - Francis Tyers
- Cool! - Donnek
Verb + preposition
Re "coolness factor" below (woop woop!), we need to cater for verbs such as "ymchwilio" which are followed by a preposition that is different from English, or where there is no preposition in English.
For example:
- ymchwilio i - research into, investigate
- siarad am - talk about
- dweud wrth - say to, tell
- gofyn am - ask for
Is there any way to get the verb+prep phrase parsed as a phrase, rather than separately? Perhaps an entry in one of the dictionaries? This would only need to be done for those phrases where the preposition differs in English and Welsh.
Not, for instance for:
- neidio dros - jump over
- cerdded i - walk to
- delio gyda - deal with
where there is a regular correlation between the meanings of the Welsh and English prepositions.
- Yes, these are multiword constructions, like for example "He became accustomed to the taste." → "cynefinodd Fe i y blas." (try it in the testing interface). Is there a way of getting a list of these? (actually there are many I currently need to fix in the bidix/English dict, but if you have a list I can look at them. At the moment we only seem to have multiword verbs on the English side. - Francis Tyers
Subordinate ("reported speech") clauses with "bod" + noun
Also referring to the cool sentence, we have two sentences as follows:
- (1) roedd y Comisiwn yn ymchwilio i'r honiadau - the Commission was investigating the allegations
- (2) mae yr AS wedi methu datgan £103,000 o roddion - the MP has failed to declare £103,000 of gifts
Subordinate clauses, like the relative clauses, will be difficult. But a first stab at this might be as follows:
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + wedi + verb" output English "that + [det.def] + noun + [qualifiers] + has/have (number agreeing with noun) + verb_past_participle"
- clywodd y dyn bod y trên wedi cyrraedd yn hwyr -> *the man heard be the train after arrive #late
The above rule would give "the man heard that the train has arrived late" - not perfect, since in English we would use pluperfect rather than perfect here, but a lot better.
We can extend this to another construction:
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + am + verb" output English "that + [det.def] + noun + [qualifiers] + will (number agreeing with noun) + verb"
- clywodd y dyn bod y trên am gyrraedd yn hwyr -> *the man heard be the train for arrive #late
The above rule would give "the man heard that the train will arrive late" - not perfect, since in English we would use conditional rather than future here.
These could be improved if it were possible to refer back to the verb of the main clause. Thus where it is past, the subordinate would use pluperfect or conditional; where it is non-past, the subordinate would use perfect or future.
There are other varieties of subordinate clause that I give other suggestions about.
Regression tests
- Treatment of 'is' in present tense.
- The boy is in the garden. → mae y bachgen yn yr ardd. (note: yr → 'r is an open bug)
- mae'r bachgen yn yr ardd. → the boy is in the garden.
- These are both correct (apart from the 'r), but I thought "regressions" were when you fix something and in the process break something else? Re 'r:
In Welsh pattern "aeiouwy + space + y[r]" output "aeiouwy + 'r"
- Yep, so these should be 'regression tests' :) --
- Yep, I know the pattern, the problem is that the post-generator insists on having a ~ before anything that it deals with -- This would mean that we have to have '~' before every vowel, which would be quite difficult. There is another possibility though, if we can't fix that and it would be to just use a plain transliterator to replace:
- "aeiouwy + space + ~yr + space" with "aeiouwy'r + space"
- Can you think of anything this might catch by accident? or is it a fairly safe search/replace? - Francis Tyers
- I would go with this in the meantime - I think it's pretty safe. Note that your rule can act on both 'y' and 'yr'. The system is:
- consonant + space + y + space + consonant
- consonant + space + yr + space + vowel
- vowel + 'r + space + consonant-or-vowel
- I would go with this in the meantime - I think it's pretty safe. Note that your rule can act on both 'y' and 'yr'. The system is:
- No subject shift with imperative
- gwasgwch y botwm! → squeeze the button!
- squeeze the button! → gwasgu y botwm! (note: infinitive for imperative is an open bug)
- "yn" as stative
- yn falch → proud
- yn hapus → happy
- tyfodd fo yn fawr → he grew big
- Subject shift for pronouns
- roedden nhw'n hapus → they were happy
- Number agreement of verb
- roedd y bechgyn yn hapus → the boys were happy
- roedd y cwningod yn hapus → the rabbits were happy
- gwelodd y dyn y llyfr → the man saw the book
- "yn" as "-ing"
- yn mynd → going
- yn gweld → seeing
- Conjugation of 'mynd'
- aeth fo → he went
- Comparative adjectives
- tyfodd y twnnel yn fwy -> the tunnel grew bigger
- tyfodd y twnnel yn llai -> the tunnel grew smaller
- tyfodd y twnnel yn hirach -> the tunnel grew longer
- tyfodd y twnnel yn uwch -> the tunnel grew higher
- Insertion of indefinite determiner
- daeth Taid â lamp → Grandfather came with a lamp
- dychwelodd y rheolwr gyda gŵr tew → the manager returned with a fat man
Coolness factor
- Roedd y Comisiwn yn ymchwilio i'r honiadau bod yr AS wedi methu datgan £103,000 o roddion.
- the Commission Was investigating to the allegations be the MP after fail declare £103,000 of gifts.