Difference between revisions of "Talk:Welsh to English"
m (Reverted edits by Adamwalsh (Talk) to last revision by Francis Tyers) |
|||
(246 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
:''Note: Comments should not include '=' as it confuses the Wiki templating system (as I just found out myself)'' |
|||
:''Note 2: Suggestions for part-of-speech disambiguation should go [[Welsh to English#Tagger|here]].'' |
|||
;Section numbers for existing sections from [http://wiki.apertium.org/w/index.php?title=Talk:Welsh_to_English&oldid=5958 this version] — each section / topic should probably be re-numbered to remove reliance on automatic numbering |
|||
{{comment|::OK, I'll try, but I'm not entirely sure of the distinction. some of the stuff at the end of that page, for instance, is covered here. - [[User:Donnek|Donnek]]}} |
|||
---- |
|||
==English to Welsh== |
|||
* [[Talk:Welsh to English/Archive 1]] (? → 12:54, 16 July 2008 (UTC)) |
|||
===Macros=== |
|||
:''This will contain chunks of rules that we need to split out to make them more maintainable'' |
|||
---- |
|||
===Patterns=== |
|||
====Determiner Adjective Noun==== |
|||
When the determiner is indefinite, |
|||
output noun + adjective |
|||
When the determiner is definite, |
|||
output determiner + noun + adjective. |
|||
;Tests |
|||
(1) ''A red cat'' |
|||
: coch cath |
|||
(2) ''The red cat'' |
|||
: Y coch cath |
|||
:''Note: Comments should not include '=' as it confuses the Wiki templating system (as I just found out myself)'' |
|||
:''Note 2: Suggestions for part-of-speech disambiguation should go [[Welsh to English#Tagger|here]].'' |
|||
{{comment|::OK, I'll try, but I'm not entirely sure of the distinction. some of the stuff at the end of that page, for instance, is covered here. - [[User:Donnek|Donnek]]}} |
|||
:''Note 3: Comments should not include the '|' symbol either, at least within double quotes, since it too confuses the wiki. |
|||
==Notes for areas to be covered== |
|||
A sort of scratchpad / todo list, based on things that come up when putting phrases into the testing webform. |
A sort of scratchpad / todo list, based on things that come up when putting phrases into the testing webform. |
||
===Conjunctive genitive=== |
|||
; gwallt yr eneth - *hair the girl - the hair of the girl - the girl's hair |
; gwallt yr eneth - *hair the girl - the hair of the girl - the girl's hair |
||
Line 87: | Line 73: | ||
; ceann capaill - *head of-horse (gen) - the head of a horse - a horse's head |
; ceann capaill - *head of-horse (gen) - the head of a horse - a horse's head |
||
====Welsh to English==== |
|||
A couple of things have come up: |
|||
Using a NOUN1 + DET.DEF + NOUN2, we get "the daughter of the doctor", but: |
|||
===="was"==== |
|||
:Ond mae fy nhad, mam a '''thad fy ngŵr''' wedi talu trethi ac erioed wedi defnyddio'r gwasanaeth iechyd. → But My father is, mother <s>that</s> and '''my father man''' after pay taxes and never after use the service health. |
|||
"roedd" ([he/she/it] was) is unknown, but I seem to remember adding entries for "to be" to the dixes in the mists of time. Was I dreaming? (roedd <- yr + oedd) |
|||
If we use NOUN1 + DET + NOUN2, we get "the daughter of the doctor", and: |
|||
{{comment| |
|||
::There are entries for 'bod', but 'roedd' doesn't get processed as all of the 'bod' entries start with 'b' (see [http://www.nopaste.com/p/aZI5GWX7C this link]). I will need to fix this in the analyser. If I understand you correctly, 'roedd' is a contraction of 'yr' (determiner ...) + 'oedd' (verb 'bod', past tense ...)? [[User:Francis Tyers|Francis Tyers]] |
|||
:Ond mae fy nhad, mam a '''thad fy ngŵr''' wedi talu trethi ac erioed wedi defnyddio'r gwasanaeth iechyd. → But My father is, mother <s>that</s> and '''my father of my man''' after pay taxes and never after use the service health. |
|||
}} |
|||
Is NOUN1 + DET.POS + NOUN2 reliably going to be DET.POS + NOUN2 + 's + NOUN1 (or DET.DEF + NOUN1 + of + DET.POS + NOUN2)? |
|||
{{comment|:Yes. - [[User:Donnek|Donnek]]}} |
|||
Actually, this is trickier, I guess: |
|||
;'''fy nhad, mam a thad fy ngŵr''' → my father, mother and my husband's father |
|||
We'd need to consider "DET NOUN COMMA NOUN CNJCOO NOUN DET NOUN" or at least "DET NOUN COMMA NOUN" → "my father and mother" + "NOUN DET NOUN" → "my husband's father", which would give: |
|||
;'''my father and mother and my husband's father''' |
|||
Which although awkward, isn't as horrific as the current translation. I'd prefer to make small chunks where possible, even if (for now) it makes a worse translation, as when we are able to collapse NP CNJCOO NP → NP, in an intermediate stage, it will be more clean. |
|||
{{comment|:I think that reads well, actually. I realise there is a tension between doing something that will work for common examples, but is too much of a hack for less common examples, and doing something more comprehensive but which will take longer to finalise. All other things being equal, I would tend towards the former for 0.1, but only you can say what is the best way forward, since you know how Apertium is put together. You've probably struck a decent balance here. - [[User:Donnek|Donnek]]}} |
|||
{{comment| |
{{comment| |
||
::Ok, the output is now: |
|||
::Some serious errors have crept in to those entries. I've sent an amended version to you by email. You're right - roedd -> yr + oedd, but in the amended version I've sent, I've put (e.g.) "roedd" and "oedd" as alternate forms, because "Roedd" is the spoken form, and even in written Welsh you hardly ever see "Yr oedd" nowadays. [[User:Donnek|Donnek]] |
|||
}} |
|||
::;But my father, mother and the father of my man is after pay taxes and never after use the service health. |
|||
::Which is reasonably ok... The way I have done it is: |
|||
::In level-1, made three chunks: |
|||
::* SN(my father , mother) |
|||
::* CC(and) |
|||
::* SN(the father of my man) |
|||
::Then at level-2, I detect: |
|||
::* SV SN CC SN and transform to: SN CC SN SV. |
|||
::It doesn't seem to have caused any regressions, but might warrant further testing. - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
|||
{{comment|:::This seems to work pretty well for different phrases. The only things that break it are: |
|||
;the boy was in the garden -> *y bachgen bu yn yr ardd - bu'r bachgen yn yr ardd |
|||
::::insertion of a comma after the second noun: "mae fy nhad, mam, a thad fy ngŵr - My father, mother is, and the father of my man" |
|||
::::replacing the second noun by a conjunctive genitive: "mae fy nhad, mam fy ngŵr a thad fy ngŵr - My father, mother is my man and the father of my man" |
|||
:::I think this is probably good enough for this stage. - [[User:Donnek|Donnek]]}} |
|||
Almost correct, except for word-order, and the fact that the preterite is being used instead of the imperfect ("roedd y bachgen yn yr ardd"). The preterite needs to be marked as only being used in written Welsh, and to have a lower likelihood than the imperfect. This is too rough a rule, but would do for the time being. |
|||
===Marking and word-order=== |
|||
The above brings up a useful point about this. If the standard VSO sequence is changed to SVO (ie unchanged from the English standard), this is a marked pattern, conveying a relative clause. In written Welsh, the verb will be preceded by "a" + soft mutation, but in spoken Welsh the "a" usually disappears. |
The above brings up a useful point about this. If the standard VSO sequence is changed to SVO (ie unchanged from the English standard), this is a marked pattern, conveying a relative clause. In written Welsh, the verb will be preceded by "a" + soft mutation, but in spoken Welsh the "a" usually disappears. |
||
Line 128: | Line 148: | ||
{{comment|:::Yes. "a" - relative "who, which" in a relative clause where the subject is the same as that of the main clause, and "a" - interrogative pre-verbal particle (eg a weles ti hwnna? - did you see that?). Both are followed by soft mutation. Note that interrogative "a" is usually omitted in speech, leaving only the mutation. - [[User:Donnek|Donnek]]}} |
{{comment|:::Yes. "a" - relative "who, which" in a relative clause where the subject is the same as that of the main clause, and "a" - interrogative pre-verbal particle (eg a weles ti hwnna? - did you see that?). Both are followed by soft mutation. Note that interrogative "a" is usually omitted in speech, leaving only the mutation. - [[User:Donnek|Donnek]]}} |
||
==="yn" as stative=== |
|||
For Welsh pattern "yn + adj" |
For Welsh pattern "yn + adj" |
||
Line 171: | Line 191: | ||
output English "noun-phrase" |
output English "noun-phrase" |
||
This is a bit complex. There are two "yn"s in Welsh: "yn" showing state or condition, or extension in time (yn hapus - happy; yn mynd - going), and "yn" the preposition showing location in a specific place (yn y tŷ - in the house (contrast: mewn tŷ - in a house; yn Nolgellau - in Dolgellau). (They are probably related historically.) The stative "yn" soft-mutates nouns and adjectives, but not verbs; the location "yn" nasal-mutates (and changes to "ym" to match an initial "m" in the following noun, eg ym Mangor - in Bangor). |
This is a bit complex. There are two "yn"s in Welsh: "yn" showing state or condition, or extension in time (yn hapus - happy; yn mynd - going), and "yn" the preposition showing location in a specific place (yn y tŷ - in the house (contrast: mewn tŷ - in a house); yn Nolgellau - in Dolgellau). (They are probably related historically.) The stative "yn" soft-mutates nouns and adjectives, but not verbs; the location "yn" nasal-mutates (and changes to "ym" to match an initial "m" in the following noun, eg ym Mangor - in Bangor). |
||
So - as it stands, the above will clash with 1.3.10 (change |
So - as it stands, the above will clash with 1.3.10 (change prep+noun to prep+det.indef+noun), even though "yn" the preposition will never occur before a non-specific noun (it must have specificity), and even though the above is not actually "yn" the preposition (it's "yn" the stative). We can't use the stative soft-mutation to decide, because (a) that doesn't apply to some consonant initials, and (b) other prepositions cause mutation too, and it would be overkill to check for each one. So the easiest thing is to adjust 1.3.10 to exclude "yn" as one of the prepositions that will be caught. Is it easy? I don't know :-) |
||
{{comment| |
|||
::I've added the yn "stative" to the analyser as well as the yn "preposition", but until we retrain the tagger it will not pick the former. If you could think of any rules that will choose the right one in a given context it would help (for ideas on the kinds of restrictions to these rules, see [[Tagger training#Writing a TSX file|here]] and [[TSX format|here]]). - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
|||
{{comment| |
|||
====Preferential choice between noun and verbform==== |
|||
:::The simplest would be: |
|||
:::: Welsh word "yn" is a preposition |
|||
::::: when it is followed by "det.def" or by a capitalised word |
|||
:::::: otherwise it is a stative |
|||
:::That may not be perfect, but it is good enough. I'll bear in mind the tagger pages, but it may take a while to get to that stage. - [[User:Donnek|Donnek]] |
|||
}} |
|||
===Insert det.indef before non-definite noun=== |
|||
; atebodd hi'r cwestiwn -> *answered shethe #hold an inquiry - she answered the question |
|||
; daeth Taid â lamp -> Grandfather came with lamp (preferably "with a lamp") |
|||
proc selects 'cwestiwn' (question) - correct - and 1p pl imperative of 'cwestio' (an infrequent verb for 'hold an inquiry'). The 1p pl present would also have been a possibility, and indeed a more likely one. tagger selects the second of these. |
|||
; dychwelodd y rheolwr gyda gŵr tew -> the manager returned with fat man (preferably "with a fat man") |
|||
For Welsh pattern "prep + noun" |
|||
Not sure how widespread this would be, but the tagger should give precedence to the noun choice whenever the verb form is preceded by 'y': |
|||
output English pattern "prep + det.indef + noun" |
|||
{{comment| |
|||
For Welsh pattern "[y | yr | 'r] + [noun | verb]" |
|||
this should probably be working now, - [[User:Francis Tyers|Francis Tyers]] |
|||
output "[y | yr | 'r] + [noun]" |
|||
}} |
|||
I am tempted to retire this in favour of a broader rule: |
|||
This is not perfect, because "y | yr" can also be an indirect relative clause pronoun before a verb, but it would catch most things until we can resolve the latter point. |
|||
For Welsh pattern "non-specific noun.sg" |
|||
output English "a + non-specific noun.sg" |
|||
"Non-specific" here means a noun that is not qualified by det.def, pr.poss, etc. |
|||
; gwelodd y dyn y llyfr -> *the man saw the books - the man saw the book |
|||
; daeth car ar hyd y ffordd -> *car came along the road - a car ... |
|||
This is similar, but is tricksy because it is superficially correct apart from the plural. But in fact, tagger is reading "llyfr" as pres 3p sing of "llyfru" (to book). Apart from being infrequent, and therefore much less likely to appear ("bwcio" would be the usual word), Eurfa has "llyfra" as the pres 3p sing, so there may be a paradigm problem too. The above rule would throw out the verb in the meantime. |
|||
; mae'r athro yn licio chwarae gêm o golff -> *the teacher is liking play game he golf - the teacher likes to play a game of golf |
|||
1.3.16 would deal with "play". For "gêm o golff" we would need to prevent "*a game of a golf" (which the existing rule would in fact have produced for "o golff"). Perhaps: |
|||
For Welsh pattern "non-specific noun + o + non-specific noun" |
|||
output English pattern "a + noun + of + noun" |
|||
This would need to fire before the revised rule above, or we need some other way of sorting out the possible doubling of "a" (a a game of golf) - LOL - let it happen and then have a rule: |
|||
For English pattern "det.indef + det.indef" |
|||
output "det.indef" |
|||
; gwelodd y bachgen gath yn yr ardd -> the boy saw cat in the garden (preferably "a cat") |
|||
{{comment| |
|||
::It is currently using the aberth/u__vblex paradigm (see output [http://www.nopaste.com/p/aVI2yKOdqb here]). Is this incorrect? - [[User:Francis Tyers|Francis Tyers]]}} |
|||
===Preferential choice between verbforms=== |
|||
; bydd y lamp yn rhoi golau -> *are the lamp giving light - the lamp will be giving light (and presumably we could massage this into "the lamp will give light" later, since that would be the more natural English equivalent) |
|||
A couple of things here. The most important is that tagger chooses the less frequent imperative out of the imperative/future choice for the verb. Presumably this then means that the subject shift can't take place. But even with the imperative choice, the imperative 2p sing info gets lost between interchunk and postchunk, and replaced with a generic? present which gets output as "are". Odd. |
|||
(I'm assuming that "bydd" would get output as "will be", since that would be the correct English tense.) |
|||
{{comment| |
{{comment| |
||
::I fixed this (crudely) by commenting out the imperative for "be" 2pSg (You are!) When I train the tagger next I'll see if i can take care of it there. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
:::The problem is that "aberthu", apart from the 'regular' "abertha" also has a written "aberth". So yes, it probably is incorrect. The problem is that a lot of less common verbs are very rarely inflected. It might have been better to use something like "gwenu" or "siomi". In the meantime, perhaps just changing "aberth" to "abertha" in the pres 3p sing will do. - [[User:Donnek|Donnek]]}} |
|||
Another example: |
|||
====Number agreement of verb==== |
|||
; roedd y bechgyn yn gallu croesi dan y ffordd -> *the boys were #be<vbmod><ger># able to #vblex><vblex><pres> under the road - the boys were able to cross under the road |
|||
{{comment|:I added 'rabbits' to the dictionary, but the problem of unknown words and phrase movement is one we're experiencing in Basque too... - [[User:Francis Tyers|Francis Tyers]]}} |
|||
Here, proc decides to ignore "croesi" (cy-en.dix line 6203) as an infinitive in favour of conjugating it as present 2p sing. The infinitive option doesn't even show! Can we add some such rule as: |
|||
{{comment|::OK - so it's basically an issue that you can't do much about until the word is logged. Hmm. I suppose that makes sense, since Apertium can't figure out what to do with something until it knows what it should do with it ... In a practical sense, this is going to be problematic if we demo Apertium using unseen text. Is there any way of doing some blind choosing, eg |
|||
When you have homographous Welsh verb options <inf> and conjugated_verb |
|||
:::: if this word is |
|||
choose <inf> unless verb is followed by pr |
|||
::::: preceded by [y,yr,'r] |
|||
:::::: we will assume it's a noun |
|||
::::: preceded by yn |
|||
:::::: we will assume it's a verb |
|||
:::::: unless a verb has been identified in the current phrase |
|||
::::::: in which case we'll assume it's an adjective |
|||
This isn't very good, because you could have the conjugated from without a pronoun, but it might deal with this to some extent. |
|||
::This might break Apertium - I don't know. In theory, though, we might be able to get relative probabilities for a particular sequences from a corpus. - [[User:Donnek|Donnek]]}} |
|||
The example above also has some funkiness going on with "gallu" - "were be able" needs to be transformed into "were able". However, I don't know enough about how Apertium treats modal verbs to make a suggestion. |
|||
{{comment|I'd be reluctant to add one as we'd not be able to get the translation, on the other hand, it wouldn't cause messing up of word order. It's an open problem, and we're thinking about it :) - [[User:Francis Tyers|Francis Tyers]]}} |
|||
{{comment|::This is now giving: |
|||
:::the boys were being able to cross under the road |
|||
::Which is an improvement. On the other hand, "^be<vbser><past><p3><pl>$ ^be able to<vbmod><ger>$" is probably redundant :) -- Not sure how to deal with this. - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
|||
{{comment|:::What about a purely surface hack, eg: |
|||
====Prepositional noun phrase should not be a subject==== |
|||
:::: For English pattern "was/were + being + able" |
|||
:::: output "was/were + able" |
|||
:::Or a mapping to convert "be able" to "can/could"? - [[User:Donnek|Donnek]] |
|||
}} |
|||
===Comparative adjectives with "less/more"=== |
|||
; cerddodd fo i'r dref -> he walked in the town |
|||
Fine, except that the preposition "i" should really be glossed as "to" ("yn y dref" would be "in the town") |
|||
; tyfodd y twnnel yn llai llachar -> *the tunnel grew small bright - the tunnel grew less bright |
|||
Contrast: |
|||
; |
; tyfodd y twnnel yn fwy tywyll -> *the tunnel grew big dark - the tunnel grew more dark |
||
Welsh pattern " |
For Welsh pattern "fwy/llai + adj" |
||
output English "more/less + adj" |
|||
For English pattern "more/less + adj" |
|||
and therefore the "det.def + noun" section shouldn't be shifted. (I can't think of any exceptions to this, but there may be one.) |
|||
output Welsh "fwy/llai + adj" |
|||
{{comment| |
{{comment| |
||
::I'll see if i can copy in a rule from Spanish--English for this :) - [[User:Francis Tyers|Francis Tyers]] |
|||
:: There was a rule to do this, I've commented it out, I think there was a reason for it, but I can't recall now. I've run the regression tests below and it doesn't seem to have broken anything. Regarding the preposition, should I change "i" to be "to" instead of "in" ? - [[User:Francis Tyers|Francis Tyers]]}} |
|||
}} |
|||
===Synthetic comparative adjectives=== |
|||
{{comment|:::Re "i", yes, change it to "to". - [[User:Donnek|Donnek]]}} |
|||
Many of these seem to have faulty dictionary entries: |
|||
{{comment|::::The problem here was the dictionary only had i'r → yn+yr... i've added i'r → i+yr and now it is picking the right one, although I don't know what will happen for other contexts... - [[User:Francis Tyers|Francis Tyers]]}} |
|||
; tyfodd y twnnel yn fwy -> the tunnel grew big (should be "bigger") |
|||
; tyfodd y twnnel yn llai -> the tunnel grew small (should be "smaller") |
|||
; tyfodd y twnnel yn hirach -> the tunnel grew long (should be "longer") |
|||
; tyfodd y twnnel yn uwch -> the tunnel grew high (should be "higher") |
|||
{{comment| |
|||
{{comment|:::::Not sure where that would have come from. The only vaguely relevant thing I can think of is "i mewn i" (into). - [[User:Donnek|Donnek]]}} |
|||
::Dictionary error in the bidix, now fixed. - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
|||
{{comment|:::Cool! - [[User:Donnek|Donnek]] |
|||
}} |
|||
Note, also a rule needs to be written for: |
|||
; allan i'r cyfarfod -> *the meeting #exit<vblex><pres><p3> in - out to the meeting |
|||
This is similar - "in" should be "to", and should be kept with "the meeting". |
|||
; the expensive house → y tŷ drud |
|||
However, there is another issue here, which is in effect the same as "Preferential choice between noun and verbform" above. In this case, the verb "allanu" (to exit) is being chosen instead of the much more likely "allan" (out). |
|||
; the more expensive house → y tŷ drutach |
|||
; the most expensive house → y tŷ drutaf |
|||
{{comment|:You mean for the consonant change t->d? - [[User:Donnek|Donnek]]}} |
|||
; roedd o ar dy lyfr -> *was of on your books - it was on your book |
|||
1.3.9 would deal with "of", and 1.3.6 would deal with "books". Subject shift would then produce a reasonable translation. |
|||
However: |
|||
; roedd ar dy lyfr -> *your #be<vbser><past><p3> on books - (it) was on your book |
|||
Omitting the subject pronoun can happen quite frequently in speech if the subject has already been mentioned. The <sg> tag gets lost at interchunk, which means the verb can't be conjugated (this came up somewhere else, but I think it's been taken off the page - maybe it would be better just to mark the issue heading as "addressed" rather than delete it). But there is an additional issue, in that the possessive pronoun is getting treated as the subject and moved separately. So maybe we need a broader rule to say that "prep + det.def/pr.poss/whatever + noun" is an indivisible chunk, and must be dealt with as a block. No part of it would be moved in this case anyway. |
|||
{{comment| |
|||
:: Regarding page cleanup, ok. perhaps having a separate section, and then moving sections down would be a good idea. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
===Subordinate ("reported speech") clauses with "bod" + noun=== |
|||
It would also be nice in the longer term to fill in the pronoun if it is omitted. |
|||
For Welsh pattern "verb + non-subject noun phrase" |
|||
output English "verb + pronoun agreeing in number and person + non-subject noun phrase" |
|||
The NSNP could be a prepositional phrase (marked by an initial preposition), or an object phrase (marked with initial soft mutation). |
|||
Also referring to the cool sentence, we have two sentences as follows: |
|||
===="-ing" as "yn + verb"==== |
|||
;(1) roedd y Comisiwn yn ymchwilio i'r honiadau - the Commission was investigating the allegations |
|||
;(2) mae yr AS wedi methu datgan £103,000 o roddion - the MP has failed to declare £103,000 of gifts |
|||
Subordinate clauses, like the relative clauses, will be difficult. But a first stab at this might be as follows: |
|||
For English pattern "subject + verb<vbser> + verb + ing" |
|||
output for Welsh "verb<vbser> + subject + yn + verb" |
|||
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + wedi + verb" |
|||
output English "that + [det.def] + noun + [qualifiers] + has/have (number agreeing with noun) + verb_past_participle" |
|||
; clywodd y dyn bod y trên wedi cyrraedd yn hwyr -> *the man heard be the train after arrive #late |
|||
{{comment| |
|||
====Inflected verbs not being parsed==== |
|||
::This would be: |
|||
:::VBSER(INF) + DEFINITE_NP + wedi + VERB |
|||
; aeth -> *aeth - (he/she/it) went |
|||
:::THAT + DEFINITE_NP + HAS + VERBPP |
|||
However, "aeth" is listed in cy.dix.xml (line 27491) as past 3p sing in the mynd_vblex paradigm, which is what "mynd" (to go) gets conjugated against (line 54444). |
|||
::Where DEFINITE_NP is any noun phrase preceded by the definite article? - [[User:Francis Tyers|Francis Tyers]] |
|||
Ah - a bug in the segmentation. |
|||
}} |
|||
; *myndaeth fo -> he went |
|||
; he went -> *myndaeth fe |
|||
The infinitive is getting added to the irregular forms, instead of being replaced by them. |
|||
{{comment| |
{{comment| |
||
:::Actually, thinking about this again, it doesn't have to be definite - it just so happened that in those sentences it was. You could have something like: |
|||
::Yep, this is a problem in the paradigm for 'mynd', I'll need to rewrite it, fortunately it is only used once... New paradigm output [http://www.nopaste.com/p/alinIcSFcb here] - [[User:Francis Tyers|Francis Tyers]]}} |
|||
::::; clywodd ysbïwr bod y trên .... -> a spy heard that the train .... |
|||
:::So the NP could be "[det.def, rhyw (some), pr.poss] + [adj - eg hen] + noun + [qualifiers - adjectives, demonstratives, etc]", or it could just be "pr.subj" (clywodd fo bod y trên ...). The same applies to the "am" construction below. Another point is the the VBSER can be soft-mutated - "fod" instead of "bod". - [[User:Donnek|Donnek]] |
|||
}} |
|||
{{comment| |
|||
{{comment|:::Fine, but the imperative forms also need "mynd" excised. - [[User:Donnek|Donnek]]}} |
|||
::This rule is broadly working for now. At least it is inserting the 'that', a form of 'have' and changing the verb to a pp. It is not however robust, and seems to me a bit hacky. Could you give some more examples so I can fine tune it? - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
|||
{{comment| |
|||
{{comment|::::Done. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
:::Hacky? Surely not .... I did say that relative and subordinate clauses will be difficult, so we may have to refactor as we go along. An alternative to the above (which would also cover the adjective example below) would be: |
|||
:::: For Welsh pattern "[b/f]od + NP + complement" |
|||
:::: output English "that + NP + is + complement" |
|||
::: This would give: |
|||
:::; clywodd y dyn bod y trên wedi cyrraedd yn hwyr -> *the man heard that the train is after arrive #late |
|||
:::You would then have further rules to transform "is + after + verb" to "has + verbpp", and "is + for + verb" to "will + verb". (Irish and Gaidhlig have a similar construction, by the way, using "ar, air" instead of "wedi", so whatever rule bundle you use here would be transferable to that branch of Celtic too.) |
|||
:::There is also another similar construction using "ar" in place of "wedi" and "am" - this one means "about to": |
|||
====Insert det.indef in prepositional NP==== |
|||
:::; clywodd y dyn bod y trên ar gyrraedd yn hwyr -> *the man heard be the train on arrive late - the man heard that the train was about to arrive late |
|||
:::So an additional rule "is + on + verb -> was + about to + arrive". |
|||
:::Oh, there's another one too, with "newydd", meaning "just now": |
|||
; daeth Taid â lamp -> Grandfather came with lamp (preferably "with a lamp") |
|||
:::; clywodd y dyn bod y trên newydd gyrraedd yn hwyr -> *the man heard be the new train arrive late - the man heard that the train had just arrived late |
|||
; dychwelodd y rheolwr gyda gŵr tew -> the manager returned with fat man (preferably "with a fat man") |
|||
:::This one has been caught by the adjective rule, but "newydd" belongs to the VP, not the NP in this case, so we'd need some prioritisation. |
|||
:::Other examples: |
|||
For Welsh pattern "prep + noun" |
|||
output English pattern "prep + det.indef + noun" |
|||
:::; roedd y bachgen yn dweud bod y tŷ wedi mynd ar werth -> the boy was saying that the house has gone on value |
|||
:::(fine, except that "ar werth" means "on sale" - see 1.3.19 below. |
|||
:::; dywedodd hi bod y trên yn hwyr -> *she said be the train late - she said that the train is late. |
|||
:::You could deal with this one by adding a similar rule: |
|||
::::VBSER(INF) + NP + ADJ |
|||
::::THAT + NP + VBSER + ADJ |
|||
:::; dwi'n meddwl bod y glaw wedi stopio -> *dwithinking that the rain has stopped - I think that .... |
|||
:::(We need to get the present tense of "bod" sorted out too) |
|||
:::; dywedodd yr eneth bod y siop am agor ar amser -> the girl said that the shop will open on #time |
|||
:::(I'm noticing the lack of adverbs, eg "tomorrow", "today", "afterwards", etc. I suppose the remaining bits of Eurfa need importing at some point.) |
|||
:::[[Donnek|Donnek]] |
|||
}} |
|||
The above rule would give "the man heard that the train has arrived late" - not perfect, since in English we would use pluperfect rather than perfect here, but a lot better. |
|||
We can extend this to another construction: |
|||
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + am + verb" |
|||
output English "that + [det.def] + noun + [qualifiers] + will + verb" |
|||
; clywodd y dyn bod y trên am gyrraedd yn hwyr -> *the man heard be the train for arrive #late |
|||
The above rule would give "the man heard that the train will arrive late" - not perfect, since in English we would use conditional rather than future here. |
|||
{{comment| |
{{comment| |
||
::This now seems to be working. - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
}} |
||
These could be improved if it were possible to refer back to the verb of the main clause. Thus where it is past, the subordinate would use pluperfect or conditional; where it is non-past, the subordinate would use perfect or future. |
|||
====Preferential choice between verbforms==== |
|||
{{comment| |
|||
; bydd y lamp yn rhoi golau -> *are the lamp giving light - the lamp will be giving light (and presumably we could massage this into "the lamp will give light" later, since that would be the more natural English equivalent) |
|||
::We can probably set this as a variable, but what would be the triggers to set/unset the variable? - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
|||
{{comment|:::End of the clause? Full stop or comma, perhaps? - [[User:Donnek|Donnek]] |
|||
A couple of things here. The most important is that tagger chooses the less frequent imperative out of the imperative/future choice for the verb. Presumably this then means that the subject shift can't take place. But even with the imperative choice, the imperative 2p sing info gets lost between interchunk and postchunk, and replaced with a generic? present which gets output as "are". Odd. |
|||
}} |
|||
(I'm assuming that "bydd" would get output as "will be", since that would be the correct English tense.) |
|||
There are other varieties of subordinate clause that I give other suggestions about. |
|||
Incidentally, in the above the det.def should be taken to include other prenominal qualifiers like possessives. |
|||
====Comparative adjectives with "less/more"==== |
|||
===Verbal nouns / Infinitives=== |
|||
; tyfodd y twnnel yn llai llachar -> *the tunnel grew small bright - the tunnel grew less bright |
|||
; tyfodd y twnnel yn fwy tywyll -> *the tunnel grew big dark - the tunnel grew more dark |
|||
;roedd y dyn yn gwerthu pethau rhad -> the man was selling cheap things |
|||
For Welsh pattern "fwy/llai + adj" |
|||
;roedd fo yn palu -> he was digging |
|||
output English "more/less + adj" |
|||
Both of these are fine. |
|||
Making the verbal noun/infinitive the subject doesn't work quite so well: |
|||
For English pattern "more/less + adj" |
|||
;roedd gwerthu pethau rhad yn hawdd -> *was sell cheap things easy - selling cheap things was easy |
|||
output Welsh "fwy/llai + adj" |
|||
;roedd palu yn waith caled -> *was dig in a hard work - digging was hard work |
|||
The latter would be benefit from the extension of the "yn as stative" rule to nouns, as suggested in 1.3.4 above. But we also need to define the VN as a subject, so that it can be shifted. This is not easy, because the rule may cause problems with other constructions later. But we can take a stab at it. |
|||
First, we can use the infinitival form in English - "to sell cheap things was easy" and "to dig was hard work" are equivalent to the above sentences. |
|||
====Synthetic comparative adjectives==== |
|||
For Welsh pattern "verb<vblex><inf>" |
|||
Many of these seem to have faulty dictionary entries: |
|||
output English "to + verb" |
|||
; tyfodd y twnnel yn fwy -> the tunnel grew big (should be "bigger") |
|||
; tyfodd y twnnel yn llai -> the tunnel grew small (should be "smaller") |
|||
For English pattern "to + verb" |
|||
; tyfodd y twnnel yn hirach -> the tunnel grew long (should be "longer") |
|||
output Welsh "verb<vblex><inf>" |
|||
; tyfodd y twnnel yn uwch -> the tunnel grew high (should be "higher") |
|||
This allows the same rule to be used in sentences like: |
|||
;ceisiodd y dyn agor y bocs -> *the man sought open the box |
|||
which should produce "the man tried to open the box". (Can we delete the "seek" entry for "ceisio" until we have refined choices between different entries? The "try" entry is more frequent.) |
|||
{{comment|"seek" entry commented out and replaced with "try". [[User:Francis Tyers|Francis Tyers]]}} |
|||
Second, we can assume that a verbal noun phrase will occur after an inflected verb (mostly forms of "bod"). So we might try expanding the above to say: |
|||
For Welsh pattern "verb_inflected + verb<vblex><inf> + [noun phrase]" |
|||
output English "to + verb + [noun phrase] + verb_inflected" |
|||
for the first two sentences ("roedd gwerthu pethau rhad" and "roedd palu"), and: |
|||
For Welsh pattern "verb_inflected + subject + verb<vblex><inf> + noun phrase" |
|||
output English "subject + verb_inflected + to + verb + noun phrase" |
|||
for the third ("ceisiodd y dyn agor y bocs"). |
|||
This is not perfect, and I am not sure how it would cut across the existing rule for subject shift. |
|||
There are also interesting issues with nesting of infinitival subject phrases: |
|||
; roedd gwerthu pethau rhad yn hawdd yn neis -> *was sell cheap things easy nice - to sell cheap things easily was nice |
|||
; roedd gwerthu pethau rhad yn hawdd yn beth neis -> *was sell cheap things easy in a nice thing - to sell cheap things easily was a nice thing |
|||
where I'm not sure how you specify the boundaries of the noun phrase. Any views, or is that too complex for the present iteration. |
|||
{{comment| |
{{comment| |
||
::At the moment to define noun phrases we just define fixed length patterns of tags which are matched in left-to-right, longest-match way. So for example for Welsh to English: |
|||
::Dictionary error in the bidix, now fixed. - [[User:Francis Tyers|Francis Tyers]] |
|||
:::NOUN → SN_(NOUN) |
|||
:::PRNSUBJ → SN_(PRNSUBJ) |
|||
:::DET → SN_(DET) |
|||
:::DET NOUN → SN_(DET NOUN) |
|||
:::NOUN1 NOUN2 → SN_(NOUN2-NOUN1) |
|||
:::NOUN ADJ → SN_(ADJ NOUN) |
|||
:::yn ADJ → SN_(ADJ) |
|||
:::DET NOUN ADJ → SN_(DET ADJ NOUN) |
|||
:::NOUN ADJ1 ADJ2 → SN_(ADJ2 ADJ1 NOUN) |
|||
::The first is the pattern we detect in Welsh, and the second is the "chunk" that we output in English. Any suggestions on defining more of these (the most frequently occurring), or changing them would be appreciated. - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
}} |
||
Note also in the above that we have the adverb problem from 1.3.4. |
|||
{{comment|:::Cool! - [[User:Donnek|Donnek]] |
|||
===Non-compositional multiword phrases=== |
|||
This section is for phrases that have to be scoped as a whole, rather than broken down to their constituent parts. Note: Will probably be marked as "adv" if nothing is given. |
|||
; <s>ar werth - on sale</s> (adv) |
|||
; <s>ar ôl y cwbl - after all</s> (adv) |
|||
; <s>erbyn hyn - by now</s> (adv) |
|||
; <s>doed a ddelo - come what may</s> (adv) |
|||
; <s>hyd yn oed - even</s> (adv) |
|||
; <s>pan fo angen - as needed</s> (adv) |
|||
; <s>wrth gwrs - of course</s> (adv) |
|||
; <s>yn ôl i - back to</s> (adv) |
|||
; <s>yn weddill i - remaining to</s> (adv) |
|||
===Superlative adjective + "oll"=== |
|||
; y rhai lleiaf oll -> *the some smallest *oll - the smallest ones of all |
|||
; yn gyntaf oll -> *first *oll - first of all |
|||
For Welsh pattern "adj.super + oll" |
|||
output English "adj.super + of all" |
|||
We need to add "oll" (all) to the dictionary, but this would still be a necessary rule. |
|||
==="rhai"=== |
|||
; y rhai bach -> *the some small - the small ones |
|||
; rhai mawr -> *some big - big ones |
|||
"rhai" (some) can be considered the plural of "un" (one). |
|||
For Welsh pattern "rhai + adj" |
|||
output English "adj + ones" |
|||
This also applies to phrases like "y rhai lleiaf oll" above. We need to convert "oll" first on the basis that is follows an adj, and then we need to convert "rhai" on the basis that it precedes an adj. |
|||
===Lexis=== |
|||
* cwm → cirque (replace with: cwm → valley) |
|||
{{comment|::Yes, the Eurfa entry is tagged (geography) so this makes sense. But "cym" would be wrong - cwm (valley), cymau or cymoedd (valleys) - [[User:Donnek|Donnek]] |
|||
}} |
}} |
||
{{comment|:::Oops, typo :) - [[User:Francis Tyers|Francis Tyers]]}} |
|||
* Some verb lemmas are being generated incorrectly, (see [http://www.nopaste.com/p/ajmI46If0 here]) |
|||
==== Verb + preposition==== |
|||
==="cael" - get / have=== |
|||
Re "coolness factor" below (woop woop!), we need to cater for verbs such as "ymchwilio" which are followed by a preposition that is different from English, or where there is no preposition in English. |
|||
"cael" is in the dix as "get", because in most cases that is the better gloss: |
|||
For example: |
|||
;ymchwilio i - research into, investigate |
|||
;siarad am - talk about |
|||
;dweud wrth - say to, tell |
|||
;gofyn am - ask for |
|||
; mi aeth o i'r banc i gael pres allan -> *mi he went in the bank I get money *allan - he went to the bank to get money out |
|||
Is there any way to get the verb+prep phrase parsed as a phrase, rather than separately? Perhaps an entry in one of the dictionaries? This would only need to be done for those phrases where the preposition differs in English and Welsh. |
|||
Apertium's version shows some regressions (we can omit the "mi" - see comment to 1.3.27, and "allan", along with the other Eurfa adverbs, has yet to be added to the dix). "i" should not be glossed as "in", but "to" (see comment to 1.3.7). Second, the shortened form of pr.sub.1p.sing ("i") will *never* occur without a preceding conjugated verb, so that option has to be banned in this context.) |
|||
{{comment|::Regarding the shortened prn.subj, yep, this is the biggest regression we've had and I'm currently trying to fix it. - [[User:Francis Tyers|Francis Tyers]] 14:14, 1 July 2008 (UTC)}} |
|||
Not, for instance for: |
|||
;neidio dros - jump over |
|||
; mae o'n cael mynd ar ôl y cwbl -> *he is getting go after the all - he's getting to go after all |
|||
;cerdded i - walk to |
|||
Pretty creditable attempt, but could be improved if 1.3.16 was applied (if infin is not preceded by "yn", translate as "to"+ infin). |
|||
;delio gyda - deal with |
|||
where there is a regular correlation between the meanings of the Welsh and English prepositions. |
|||
; mi gafodd yr athro radd dda o Brifysgol Bangor -> **mi the teacher got good degree of Bangor-University - the teacher got a good degree from the University of Bangor |
|||
"mi" again (1.3.27) and inserted "a" (1.3.10) would improve this, but choosing "of" or "from" for "o" will be difficult. For "Bangor-University" see 1.3.31 below. |
|||
Note that "cael" in this sentence focusses on the "getting" - focussing on possession uses a different construction: |
|||
; mae gan yr athro radd dda o Brifysgol Bangor -> *is with the teacher good degree of Bangor-University - the teacher has a good degree from the University of Bangor |
|||
"cael"is also used to create a passive construction: |
|||
; mi gafodd y dyn ei daro gan gar -> **mi the man got his strike with car - the man was struck by a car |
|||
Apertium's version is pretty close to the literal - "he got his striking by a car" - which is close to the English "he got himself struck by a car" |
|||
However, in the coolness sentence this gloss is inappropriate: |
|||
; ei bod hi'n cael perthynas â dyn llawer hŷn -> *that she is getting relation with a much older man - that she is having a relationship with a much older man |
|||
Can we try the following rule: |
|||
For Welsh pattern "cael + NP + {â, gyda}" |
|||
output English "have + NP + with" |
|||
==="i" + infin=== |
|||
After most time conjunctions (eg |
|||
* cyn - before, |
|||
* ar ôl - after, |
|||
* ers - since, |
|||
* nes - until, |
|||
* erbyn - by the time that, |
|||
* wedi - after, |
|||
* wrth - while), |
|||
* rhag ofn (in case), |
|||
* er mwyn (in order to), |
|||
the conjunctive clause uses "i" + SM_infin to express person (with "i" being conjugated as appropriate), with tense being dependent on the first part of the sentence. |
|||
{{comment|::In terms of conjunctions, we currently have: |
|||
<l>a<s n "cnjcoo"/></l><r>and<s n "cnjcoo"/></r> |
|||
<l>ond<s n "cnjcoo"/></l><r>but<s n "cnjcoo"/></r> |
|||
<l>neu<s n "cnjcoo"/></l><r>or<s n "cnjcoo"/></r> |
|||
::I'll add these, although some of them already exist as prepositions (which will be fun for the tagger to distinguish between). Should I just add them all to the bidix and Welsh monodix as given above? - [[User:Francis Tyers|Francis Tyers]]}} |
|||
{{comment|:::I tend to think of them just as prepositions with "i" after them, ie a special form of prepositional phrase. But you're right, it would probably be best to tag them as conjunctions. You could get the tagger to distinguish them from prepositions by specifying that the conjunctive version will always be follows by "i" (or a conjugated form of it). - [[User:Donnek|Donnek]]}} |
|||
{{comment| |
{{comment| |
||
::Ok, I've added some others I found in my grammar too... probably some of them are variants (north/south) maybe? e.g. it has "gan" for "since". A quick other question: |
|||
:: Yes, these are multiword constructions, like for example "He became accustomed to the taste." → "cynefinodd Fe i y blas." (try it in the testing interface). Is there a way of getting a list of these? (actually there are many I currently need to fix in the bidix/English dict, but if you have a list I can look at them. At the moment we only seem to have multiword verbs on the English side. - [[User:Francis Tyers|Francis Tyers]] |
|||
::Until I get the tagger up to speed, will it make much difference if I identify these in the transfer rules based only on lemma as opposed to on lemma+POS ? e.g. this means that if the tagger accidentally tags "wrth" as a preposition, it could still apply it in the transfer rules below. - [[User:Francis Tyers|Francis Tyers]] |
|||
}} |
}} |
||
{{comment|:::Yes, there are others I haven't got around to yet. "gan" (because, since) and "am" (because, since) are two of them, but they use a "bod" construction similar to 1.3.26, so that's why they're not in this section. |
|||
:::Re the tagger question, I think that should be cause no problems, because the key point is that all of these are followed with "i" - that's why I said above that I tend to think of them as "preposition + i construction". - [[User:Donnek|Donnek]]}} |
|||
(Note in the examples below that "i'r" is still coming up as "in the", which needs to be addressed.) |
|||
{{comment|:::I will try to compile a list of the most common, and send it to you tomorrow. - [[User:Donnek|Donnek]]}} |
|||
{{comment|::Done. This was confusing me for a while, but turns out it was down as "yn+yr" not "i+yr"! - [[User:Francis Tyers|Francis Tyers]]}} |
|||
====Subordinate ("reported speech") clauses with "bod" + noun==== |
|||
{{comment|:::The little rotter. Hopefully it has been squashed now. - [[User:Donnek|Donnek]]}} |
|||
Also referring to the cool sentence, we have two sentences as follows: |
|||
;(1) roedd y Comisiwn yn ymchwilio i'r honiadau - the Commission was investigating the allegations |
|||
;(2) mae yr AS wedi methu datgan £103,000 o roddion - the MP has failed to declare £103,000 of gifts |
|||
; aeth y bachgen wedi i'r bws ddod - *the boy went after in the bus come - the boy went after the bus came |
|||
Subordinate clauses, like the relative clauses, will be difficult. But a first stab at this might be as follows: |
|||
; aeth y bachgen wedi iddo fo ddod -> *the boy went after to him come - the boy went after he/it came |
|||
; bydd o wedi gadael erbyn i'r llythyr gyrraedd -> *that he has left by in the letter arrive - he will have left by the time the letter arrives |
|||
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + wedi + verb" |
|||
; bydd o wedi gadael erbyn iddo fo gyrraedd -> *that he has left by to him arrive - he will have left by the time he/it arrives |
|||
output English "that + [det.def] + noun + [qualifiers] + has/have (number agreeing with noun) + verb_past_participle" |
|||
; wrth i'r heddlu ddod i mewn, rhedodd y dyn i mewn i'r stryd -> *beside in the police come I in, the man ran intothe road - as the police came in, the man ran into the street |
|||
; clywodd y dyn bod y trên wedi cyrraedd yn hwyr -> *the man heard be the train after arrive #late |
|||
; wrth iddyn nhw ddod i mewn, rhedodd y dyn i mewn i'r stryd -> *beside to them come I in, the man ran intothe road - as they came in, the man ran into the street |
|||
For Welsh pattern "cyn + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" |
|||
The above rule would give "the man heard that the train has arrived late" - not perfect, since in English we would use pluperfect rather than perfect here, but a lot better. |
|||
output English "before + {NP, pr.subj) + verb.present/verb.past (see below) |
|||
{{comment|::Done. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
We can extend this to another construction: |
|||
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + am + verb" |
|||
output English "that + [det.def] + noun + [qualifiers] + will + verb" |
|||
For Welsh pattern "wedi + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" |
|||
; clywodd y dyn bod y trên am gyrraedd yn hwyr -> *the man heard be the train for arrive #late |
|||
output English "after + {NP, pr.subj) + verb.present/verb.past |
|||
{{comment|::Done. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
The above rule would give "the man heard that the train will arrive late" - not perfect, since in English we would use conditional rather than future here. |
|||
For Welsh pattern "ar ôl + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" |
|||
These could be improved if it were possible to refer back to the verb of the main clause. Thus where it is past, the subordinate would use pluperfect or conditional; where it is non-past, the subordinate would use perfect or future. |
|||
output English "after + {NP, pr.subj) + verb.present/verb.past |
|||
For Welsh pattern "nes + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" |
|||
There are other varieties of subordinate clause that I give other suggestions about. |
|||
output English "until + {NP, pr.subj) + verb.present/verb.past |
|||
For Welsh pattern "erbyn + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" |
|||
Incidentally, in the above the det.def should be taken to include other prenominal qualifiers like possessives. |
|||
output English "by the time that + {NP, pr.subj) + verb.present/verb.past |
|||
For Welsh pattern "wrth + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" |
|||
==Regression tests== |
|||
output English "as + {NP, pr.subj) + verb.present/verb.past |
|||
{{comment|::Done. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
;Treatment of 'is' in present tense. |
|||
For Welsh pattern "rhag ofn + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" |
|||
* ''The boy is in the garden.'' → mae y bachgen yn yr ardd. (note: yr → 'r is an open bug) |
|||
output English "in case + {NP, pr.subj) + verb.present/verb.past |
|||
* ''mae'r bachgen yn yr ardd.'' → the boy is in the garden. |
|||
For Welsh pattern "er mwyn + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" |
|||
output English "in order for + {NP, pr.subj) + verb.present/verb.past |
|||
For Welsh pattern "ers + {"i" + NP, "i".conj + [pr.subj]} + SM_verb" |
|||
{{comment|::These are both correct (apart from the 'r), but I thought "regressions" were when you fix something and in the process break something else? Re 'r: |
|||
output English "since + {NP, pr.subj) + verb.past |
|||
Note: the use of present or past tense depends on the main verb (except with "ers", when you really only use it in the past). If Apertium can only choose one, then use the present as default. But it would be nice to be able to switch based on the main verb - if this is past, use the past tense (preterite) in English; if it is non-past, use the present tense in English (see comments to 1.3.15). |
|||
In Welsh pattern "aeiouwy + space + y[r]" |
|||
output "aeiouwy + 'r" |
|||
{{comment|::I'll try setting a variable which tracks the tense of the last conjugated verb (e.g. no inf, ger, etc.) - [[User:Francis Tyers|Francis Tyers]]}} |
|||
[[User:Donnek|Donnek]] |
|||
}} |
|||
{{comment|:::That would be great, becasue it makes quite a difference in terms of how the translation reads to an English speaker. - [[User:Donnek|Donnek]]}} |
|||
{{comment|:::Yep, so these should be 'regression tests' :) -- |
|||
{{comment|::::It's done, but could I get examples for each of the rules above so I can test them? - [[User:Francis Tyers|Francis Tyers]]}} |
|||
:::Yep, I know the pattern, the problem is that the post-generator insists on having a ~ before anything that it deals with -- This would mean that we have to have '~' before every vowel, which would be quite difficult. There is another possibility though, if we can't fix that and it would be to just use a plain transliterator to replace: |
|||
{{comment|:::::I'll do full versions for "cyn": |
|||
::::"aeiouwy + space + ~yr + space" with "aeiouwy'r + space" |
|||
:::::;roedd y siop wedi cau cyn i'r bachgen orffen ei ginio -> *The shop was after close before the boy completed his dinner - the shop had closed before the boy finished his dinner |
|||
:::::;roedd y siop wedi cau cyn iddo orffen ei ginio -> *The shop was after close before him completed his dinner - the shop had closed before he finished his dinner |
|||
:::::Great - the "after close" is because we don't have a rule yet for periphrastic tenses. |
|||
:::::;caeodd y siop cyn i'r bachgen orffen ei ginio -> The shop closed before the boy completed his dinner - the shop closed before the boy finished his dinner |
|||
::: Can you think of anything this might catch by accident? or is it a fairly safe search/replace? - [[User:Francis Tyers|Francis Tyers]] |
|||
:::::;caeodd y siop cyn iddo orffen ei ginio -> The shop closed before him completed his dinner - the shop closed before he finished his dinner |
|||
:::::Great. |
|||
:::::;mae'r siop yn cau cyn i'r bachgen orffen ei ginio -> *The shop is closing before the boy complete his dinner - the shop closes before the boy finishes his dinner |
|||
:::::;mae'r siop yn cau cyn iddo orffen ei ginio -> *The shop is closing before him complete his dinner - the shop closes before he finishes his dinner |
|||
:::::Fine, apart from the fact that the pr number is not being carried across to the verb. |
|||
:::::;bydd y siop yn cau cyn i'r bachgen orffen ei ginio -> *Are the shop closing before the boy #complete<vblex><imp> his dinner |
|||
:::::;bydd y siop yn cau cyn iddo orffen ei ginio -> *Are the shop closing before him #complete<vblex><imp> his dinner |
|||
:::::Not so good. The main problem is that the imperative is being chosen instead of the future, and I suspect this then mucks up your tense choice. |
|||
}} |
}} |
||
{{comment|:::Is there any way to disambiguate between imperative / future here? - [[User:Francis Tyers|Francis Tyers]]}} |
|||
{{comment| |
|||
{{comment|::::I would go with this in the meantime - I think it's pretty safe. Note that your rule can act on both 'y' and 'yr'. The system is: |
|||
:::::Note that with the pronoun versions the pr.obj is being used instead of the pr.subj. |
|||
:::::consonant + space + y + space + consonant |
|||
:::::consonant + space + yr + space + vowel |
|||
:::::vowel + 'r + space + consonant-or-vowel |
|||
:::::Similar sentences for "wedi" (after): |
|||
:::::[[User:Donnek|Donnek]] |
|||
:::::;roedd y siop wedi cau wedi i'r bachgen orffen ei ginio |
|||
}} |
|||
:::::;roedd y siop wedi cau wedi iddo orffen ei ginio |
|||
:::::;caeodd y siop wedi i'r bachgen orffen ei ginio |
|||
;No subject shift with imperative |
|||
:::::;caeodd y siop wedi iddo orffen ei ginio |
|||
:::::;mae'r siop yn cau wedi i'r bachgen orffen ei ginio |
|||
* ''gwasgwch y botwm!'' → squeeze the button! |
|||
:::::;mae'r siop yn cau wedi iddo orffen ei ginio |
|||
* ''squeeze the button!'' → gwasgu y botwm! (note: infinitive for imperative is an open bug) |
|||
:::::;bydd y siop yn cau wedi i'r bachgen orffen ei ginio |
|||
;"yn" as stative |
|||
:::::;bydd y siop yn cau wedi iddo orffen ei ginio |
|||
:::::Similar sentences for "wrth" (as, while): |
|||
* ''yn falch'' → proud |
|||
:::::;roedd y siop wedi cau wrth i'r bachgen orffen ei ginio |
|||
* ''yn hapus'' → happy |
|||
:::::;roedd y siop wedi cau wrth iddo orffen ei ginio |
|||
* ''tyfodd fo yn fawr'' → he grew big |
|||
:::::;caeodd y siop wrth i'r bachgen orffen ei ginio |
|||
;Subject shift for pronouns |
|||
:::::;caeodd y siop wrth iddo orffen ei ginio |
|||
:::::;mae'r siop yn cau wrth i'r bachgen orffen ei ginio |
|||
* ''roedden nhw'n hapus'' → they were happy |
|||
:::::;mae'r siop yn cau wrth iddo orffen ei ginio |
|||
:::::;bydd y siop yn cau wrth i'r bachgen orffen ei ginio |
|||
;Number agreement of verb |
|||
:::::;bydd y siop yn cau wrth iddo orffen ei ginio |
|||
:::::- [[User:Donnek|Donnek]]}} |
|||
*''roedd y bechgyn yn hapus'' → the boys were happy |
|||
*''roedd y cwningod yn hapus'' → the rabbits were happy |
|||
*''gwelodd y dyn y llyfr'' → the man saw the book |
|||
===Conjunctive genitive with proper names=== |
|||
;"yn" as "-ing" |
|||
The sentence from 1.3.29 above: |
|||
*''yn mynd'' → going |
|||
; mi gafodd yr athro radd dda o Brifysgol Bangor -> **mi the teacher got good degree of Bangor-University - the teacher got a good degree from the University of Bangor |
|||
*''yn gweld'' → seeing |
|||
suggests the need to deal separately with proper nouns; a useful shorthand would be "any words not coming at the beginning of a sentence which are capitalised". |
|||
The "Bangor-University" is, I believe, an earlier rule intended to deal with compound nouns ( get "*woods-things" for "pethau pren" - wooden things, for instance), and I wonder whether it might be commented out for the present? |
|||
;Conjugation of 'mynd' |
|||
{{comment|Yep, actually I haven't yet implemented the rule at the very top (in fact the first rule, as I'm still considering how to do it most effectively. I will make this a priority. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
*''aeth fo'' → he went |
|||
{{comment|:::Ah, I see. Yes, it would be nice to get that sorted, because it's probably the most distinctive Welsh construction vis-a-vis English. - [[User:Donnek|Donnek]]}} |
|||
;Comparative adjectives |
|||
{{comment|::::For more discussion, see [[Talk:Welsh_to_English#Welsh_to_English|1.3.1.1]]. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
* ''tyfodd y twnnel yn fwy'' -> the tunnel grew bigger |
|||
* ''tyfodd y twnnel yn llai'' -> the tunnel grew smaller |
|||
* ''tyfodd y twnnel yn hirach'' -> the tunnel grew longer |
|||
* ''tyfodd y twnnel yn uwch'' -> the tunnel grew higher |
|||
In this and similar instances it would be preferable for variations of the normal genitive rule (1.3.1) to apply (a capitalised noun is by definition definite), and we can therefore suggest an additional rule: |
|||
;Insertion of indefinite determiner |
|||
For a Welsh phrase of the type "!det.def + noun1 + noun_capitalised" |
|||
* ''daeth Taid â lamp'' → Grandfather came with a lamp |
|||
output English to "det.def + noun1 + of + noun_capitalised" |
|||
* ''dychwelodd y rheolwr gyda gŵr tew'' → the manager returned with a fat man |
|||
{{comment|Should we make this noun_capitalised or "proper name" (we mark proper names separately in the dictionary and separate them into "anthroponyms", "cognomens" and "toponyms" - [[User:Francis Tyers|Francis Tyers]]}} |
|||
{{comment|:::Ah, excellent. Would this offer a way of dealing with items like "Red Branch" below? Any of these three categories would use the rule above, but capitalised words that are not marked as one of the above would instead get the det.def added before the capitalised_noun. - [[User:Donnek|Donnek]]}} |
|||
The following is another example: |
|||
; Flwyddyn neu ddwy ar ôl priodas Cwchwlin, paratodd Bricriw Dafod Ffals wledd yn ei gastell a gwahoddodd y Brenin Conor ac arwyr y Gangen Goch. |
|||
This is very lightly edited actual text from p55 of "Cwchwlin" by Ivor Owen, a retelling of the Irish Cúchulain legend. Apertium's version of this is: |
|||
;Year or two after marriage *Cwchwlin, prepared *Bricriw *Dafod *Ffals feast in his castle that invited the King *Conor and the heroes Red Branch. |
|||
This is actually pretty good. |
|||
The rule above would deal with the following: |
|||
; ar ôl priodas Cwchwlin -> *after marriage Cwchwlin - after the marriage of Cwchwlin |
|||
{{comment|This is kind of done, if we choose a "known" name, we get: "Year or two after the marriage of Elin." - [[User:Francis Tyers|Francis Tyers]]}} |
|||
{{comment|:Well done.- [[User:Donnek|Donnek]]}} |
|||
The other example: |
|||
; ac arwyr y Gangen Goch -> *and the heroes Red Branch - and the heroes of the Red Branch |
|||
should already be caught under the second rule under 1.3.1 above, but isn't. Unfortunately, the rule in this section would give the suboptimal "and the heroes of Red Branch". It may be possible to differentiate between different types of proper name (ie those which have an English gloss and those which don't), but in the meantime, on balance, it makes sense to implement the rule in this section. |
|||
The extended rule in 1.3.10 would deal with "(a) year" and "(a) feast". |
|||
There is the unsolved issue with word-sequence for unknown words (in my view, some pragmatic response to this has to be found, because it is unlikely that Apertium will ever be dealing with a text in which it knows every word). The worst infelicity in the above is that tagger has selected "a" as the relative particle "who, which, that" instead of the much more likely "and" (which is the correct one in this case). This should actually have been ruled out in this case because relative-a is followed by SM, while conjunctive-a is not. |
|||
{{comment|Yep, this is a tagger error and I'm working to resolve it. Unfortunately the forbid/enforce rules don't appear to be applying themselves :/ - [[User:Francis Tyers|Francis Tyers]]}} |
|||
===Preverbal particles - negative=== |
|||
As noted in 1.3.27, preverbal particles need to be addressed. Note that this and the following sections apply mainly to formal, written Welsh. |
|||
; ond ni threuliodd Cwchwlin ond ychydig o'r misoedd a oedd yn weddill iddo fo yn Nhŷ'r Bechgyn |
|||
; ->*but We spent *Cwchwlin but #a little<n><sg> of the months that was in remnant to him in the House of the Boys (p14) |
|||
; but Cwchwlin spent only a few of the months remaining to him in the Boys' House |
|||
For Welsh pattern "ni + MM_verb.inflected + subject" |
|||
output English "subject + auxiliary + not + verb" |
|||
Without the auxiliary, this would produce archaic English ("he spent not"), which would be acceptable at a pinch, but Apertium may have standard rules for producing English negatives where a simple affirmative is replaced by a periphrastic negative (I go - I don't go; I went - I didn't go; etc). |
|||
MM is the "mixed mutation" - use AM where possible, and SM everywhere else: |
|||
* c -> ch |
|||
* p -> ph |
|||
* t -> th |
|||
* g -> --- |
|||
* b -> f |
|||
* d -> dd |
|||
* ll -> l |
|||
* m -> f |
|||
* rh -> r |
|||
; ni phrynodd yr athro'r papur -> *We the teacher bought the paper - the teacher did not buy the paper |
|||
"ni" becomes "nid" before a vowel, but not before a vowel "exposed" by the soft mutation of a "g": |
|||
; nid aeth i'r dref -> *Came went to the town - he did not go to the town |
|||
; ni welodd y bachgen y gath -> *We the boy saw the cat - the boy did not see the cat |
|||
; ni orffennodd fo'r dasg -> *We completed the task was - he did not complete the task |
|||
Note here that we seem to have a regression where the tagger is choosing an unlikely inflected form of "bod" instead of the pr.3p.sing.m. This may be related to the 1.3.29 issue with "i". |
|||
For Welsh pattern "verb_inflected + {fo, o, fe, e}" |
|||
output English "he + verb_inflected" |
|||
Can we also try a more general rule based on negation, namely: |
|||
For Welsh pattern "previous_negation + ond" |
|||
output "only" |
|||
That may not be possible, of course, in which case we'll have to try something else. |
|||
The "ni" particle is typical of formal or written Welsh. Spoken or informal Welsh tends to use "ddim" after the verb. But we can leave that till later. |
|||
{{comment|::I've added the particle, "ni" and "nid", written a couple of disambiguation rules and done a basic rule which just puts "not" after the verb. This doesn't work in most cases, but does seem to work nicely with "to be" — "nid yw fe yn dref" → "he is not in town". This ruleset is likely to be extremely fragile. - [[User:Francis Tyers|Francis Tyers]]}} |
|||
==="to be" in conditional subjunctive=== |
|||
;Roedd e'n meddwl mai gwaed anifail a welodd. → *He was thinking animal blood might be and saw. |
|||
:[Was] [he] [thinking] [might be] [blood animal] [and] [saw] |
|||
:Analysis: <code>^bod<vbser><pii><p3><sg>$ ^prpers<prn><subj><p3><m><sg>+yn<pr>$ ^meddwl<vblex><inf>$ ^bod<vbser><cns><p3><sg>$ ^gwaed<n><m><sg>$ ^anifail<n><m><sg>$ ^a<cnjcoo>$ ^gweld<vblex><past><p3><sg>$</code> |
|||
This could be something like: |
|||
:He was thinking '''that''' it '''might be''' animal blood that he saw. |
|||
Suggestions welcome. |
|||
==Regression tests== |
|||
{{main|Welsh to English/Regression tests}} |
|||
==Coolness factor== |
==Coolness factor== |
||
:''Disclaimer: We're not deliberately aiming the translator at crime texts, it just seems to work best with these — a subject for investigation perhaps?'' |
|||
:Roedd y Comisiwn yn ymchwilio i'r honiadau bod yr AS wedi methu datgan £103,000 o roddion. |
:Roedd y Comisiwn yn ymchwilio i'r honiadau bod yr AS wedi methu datgan £103,000 o roddion. |
||
:''the Commission Was investigating |
:''the Commission Was investigating the allegations that the MP has failed to declare £103,000 of gifts.'' |
||
:<span style="color: grey">"He was the Commission crookedly ymchwiliad I ' group claims be he drives ACE has failed declare he gifts."</span> (InterTran) |
|||
:Dywedodd yr heddlu fod y troseddau honedig wedi digwydd rhwng 2003 a 2007 yn Sir Benfro a Sir Gaerfyrddin. |
|||
:''the police Said that the alleged crimes have happened between 2003 and 2007 in Pembrokeshire and Carmarthenshire.'' |
|||
:<span style="color: grey">"He said he drives police force be the transgressions alleged has happened between 2003 I go 2007 crookedly Shire ble I go Shire Gaerfyrddin."</span> (InterTran) |
|||
:Mae'r heddlu hefyd yn ymchwilio i honiadau ei bod hi'n cael perthynas â dyn llawer hŷn. |
|||
:''the police Are also investigating his allegations be she getting relation with a much older man.'' |
|||
:<span style="color: grey">"He ' is being group police force also crookedly ymchwiliad I claims you go be she ' heartburn have relation he goes tight much hn & #375."</span> (InterTran) |
|||
:Mae Ymddiriedolaeth Caerdydd a'r Fro yn gwrthod dweud faint mae'r driniaeth yn costio, ond yn ôl papur newydd y Sun mae'r driniaeth cyffuriau yn costio £2,500 y mis. |
|||
:Cardiff Trust and the Region Is refusing say size the treatment is costing , but according to the newspaper Sun the treatment is drugs costing £2,500 the month. |
|||
:<span style="color: grey">"He is being Trust Cardiff I ' go group Land refusing say as many he ' is being group treatment costing , except according to paper news the Sun he ' is being group treatment ingredients costing the month."</span> (InterTran) |
|||
:Ond byddwn ni'n parhau i weithio gyda'n staff, gwirfoddolwyr, cystadleuwyr, cwsmeriaid, partneriaid, a chyrff eraill i wella ein gwasanaeth Cymraeg. |
|||
:But we will be continuing to work within a staff, volunteers, competitors, customers, partners, and other bodies to improve our Welsh service. |
|||
:<span style="color: grey">"Except we will be we ' heartburn last I work with ' heartburn staff , volunteers , competitors , customers partneriaid , I go bodies other I improve our service Welsh."</span> (InterTran) |
Latest revision as of 18:14, 10 July 2010
- Section numbers for existing sections from this version — each section / topic should probably be re-numbered to remove reliance on automatic numbering
- Talk:Welsh to English/Archive 1 (? → 12:54, 16 July 2008 (UTC))
- Note: Comments should not include '=' as it confuses the Wiki templating system (as I just found out myself)
- Note 2: Suggestions for part-of-speech disambiguation should go here.
- OK, I'll try, but I'm not entirely sure of the distinction. some of the stuff at the end of that page, for instance, is covered here. - Donnek
- Note 3: Comments should not include the '|' symbol either, at least within double quotes, since it too confuses the wiki.
Notes for areas to be covered[edit]
A sort of scratchpad / todo list, based on things that come up when putting phrases into the testing webform.
Conjunctive genitive[edit]
- gwallt yr eneth - *hair the girl - the hair of the girl - the girl's hair
- llaw y bachgen - *hand the boy - the hand of the boy - the boy's hand
Note that the noun phrase in English is definite - contrast "merch y meddyg" (the doctor's daughter) and "merch meddyg" (a doctor's daughter).
For an English phrase of the type "def + noun1 + of + def + noun2" or of the type "def + noun2 + 's + noun1" convert in Welsh to "noun1 + def + noun2".
- Here can noun1 be a simple noun, or can it be a noun phrase? For example "the red cat of the young boy" - Francis Tyers
- e.g.
- For the pattern det.def + noun1 + of + det.def + noun2:
- Output noun1 + det.def + noun2
- For the pattern det.def + noun1 + of + det.def + noun2:
- Yes, as long as you like, eg,
- cath goch bachgen bach merch ifanc bert rheolwr y banc mawr du
- the red cat of the little boy of the pretty young daughter of the manager of the big black bank
- It's only the last NP of the sequence that gets the def.det. Donnek
- Ok, so this requires a three level rule.
- t1x -> t2x SN_(the cat red) of_(of) SN_(the boy little) of_(of) SN_(the daughter young pretty) of_(of) SN_(the manager) of_(of) SN_(the bank big black)
- t2x -> t3x SN_(the cat red) SN_(the boy little) SN_(the daughter young pretty) SN_(the manager) SN_(the bank big black)
- t3x -> gen (cat red boy little daughter young pretty manager the bank big black)
- What I'll do for now is get the chunks working ('SN' -- noun phrase, and 'of'), for values of 'noun', 'det noun', 'det adj noun', 'det adj adj noun', 'det adj adj adj noun', etc. Then look at taking care of more frequent cases (e.g. the first example). Francis Tyers
For a Welsh phrase of the type "!det + noun1 + def + noun2" convert in English to "def + noun1 + of + def + noun2" or to "def + noun2 + 's + noun1".
The second noun is probably historically a genitive, but it has lost all case markers. The equivalent in Irish would be:
- ceann an chapaill - *head the of-horse (gen) - the head of the horse - the horse's head
- ceann capaill - *head of-horse (gen) - the head of a horse - a horse's head
Welsh to English[edit]
A couple of things have come up:
Using a NOUN1 + DET.DEF + NOUN2, we get "the daughter of the doctor", but:
- Ond mae fy nhad, mam a thad fy ngŵr wedi talu trethi ac erioed wedi defnyddio'r gwasanaeth iechyd. → But My father is, mother
thatand my father man after pay taxes and never after use the service health.
If we use NOUN1 + DET + NOUN2, we get "the daughter of the doctor", and:
- Ond mae fy nhad, mam a thad fy ngŵr wedi talu trethi ac erioed wedi defnyddio'r gwasanaeth iechyd. → But My father is, mother
thatand my father of my man after pay taxes and never after use the service health.
Is NOUN1 + DET.POS + NOUN2 reliably going to be DET.POS + NOUN2 + 's + NOUN1 (or DET.DEF + NOUN1 + of + DET.POS + NOUN2)?
- Yes. - Donnek
Actually, this is trickier, I guess:
- fy nhad, mam a thad fy ngŵr → my father, mother and my husband's father
We'd need to consider "DET NOUN COMMA NOUN CNJCOO NOUN DET NOUN" or at least "DET NOUN COMMA NOUN" → "my father and mother" + "NOUN DET NOUN" → "my husband's father", which would give:
- my father and mother and my husband's father
Which although awkward, isn't as horrific as the current translation. I'd prefer to make small chunks where possible, even if (for now) it makes a worse translation, as when we are able to collapse NP CNJCOO NP → NP, in an intermediate stage, it will be more clean.
- I think that reads well, actually. I realise there is a tension between doing something that will work for common examples, but is too much of a hack for less common examples, and doing something more comprehensive but which will take longer to finalise. All other things being equal, I would tend towards the former for 0.1, but only you can say what is the best way forward, since you know how Apertium is put together. You've probably struck a decent balance here. - Donnek
- Ok, the output is now:
- But my father, mother and the father of my man is after pay taxes and never after use the service health.
- Which is reasonably ok... The way I have done it is:
- In level-1, made three chunks:
- SN(my father , mother)
- CC(and)
- SN(the father of my man)
- Then at level-2, I detect:
- SV SN CC SN and transform to: SN CC SN SV.
- It doesn't seem to have caused any regressions, but might warrant further testing. - Francis Tyers
- This seems to work pretty well for different phrases. The only things that break it are:
- insertion of a comma after the second noun: "mae fy nhad, mam, a thad fy ngŵr - My father, mother is, and the father of my man"
- replacing the second noun by a conjunctive genitive: "mae fy nhad, mam fy ngŵr a thad fy ngŵr - My father, mother is my man and the father of my man"
- I think this is probably good enough for this stage. - Donnek
- This seems to work pretty well for different phrases. The only things that break it are:
Marking and word-order[edit]
The above brings up a useful point about this. If the standard VSO sequence is changed to SVO (ie unchanged from the English standard), this is a marked pattern, conveying a relative clause. In written Welsh, the verb will be preceded by "a" + soft mutation, but in spoken Welsh the "a" usually disappears.
- y bachgen [a] fu yn yr ardd ddydd Llun (the boy who was in the garden on Monday)
- yr eneth [a] welodd y ci (the girl who saw the dog)
contrast
- gwelodd yr eneth y ci (the girl saw the dog)
Hmmm. Relative clauses are going to be difficult.
For Welsh pattern "noun + a + soft-mutated_verb" output English pattern "noun + who/which + verb".
- The dictionary only has 'a' down as a co-ordinating conjunction "and", does it have other meanings? - Francis Tyers
- Yes. "a" - relative "who, which" in a relative clause where the subject is the same as that of the main clause, and "a" - interrogative pre-verbal particle (eg a weles ti hwnna? - did you see that?). Both are followed by soft mutation. Note that interrogative "a" is usually omitted in speech, leaving only the mutation. - Donnek
"yn" as stative[edit]
For Welsh pattern "yn + adj" output English "adj"
There is a problem here in that this pattern can also be an adverb:
- siaradodd yn hapus am ei fywyd - he talked happily about his life
For English pattern "adverb_formed_from_adj + ly" output Welsh "yn + adj"
- This second one will be difficult to do, as we don't have adverbs in the English dictionary marked as derivatives from adjectives or not. - Francis Tyers
- OK. Unfortunately, since "yn + adj" can be either an adj or an adv in Welsh, I don't even mark them separately in Eurfa - perhaps I should. Would one option be to replicate all the Welsh adj entries in Apertium by preceding them with "yn + space", and adding "-ly" to the English side? This would get the EW direction, but I don't know whether it would cause problems on the WE direction. - Donnek
The above rule has been applied (way!), but does not catch mutated adjectives ("yn" causes soft mutation):
- *tyfodd fo yn mawr -> he grew big
- tyfodd fo yn fawr -> *he grew in *fawr
- This was a dictionary error, 'fawr' did not have the initial-m paradigm. Now added. - Francis Tyers
- OK - there are a couple of others I've come across: mwy (fwy), bach (fach), gwyn (wyn). there may be a few more. - Donnek
- Taken care of the first two, 'gwyn' doesn't seem to appear in the dictionary (only as 'complaint'). does it inflect at all? - Francis Tyers
- LOL! There are some obvious words not in Eurfa, tut tut to me! gwyn (white), *gwen (in practice "wen", fem), gwynion (occasionally, plural), gwynnach (whiter), gwynnaf (whitest). There may be fem comp and super forms too, but we can ignore those. By the way, "da" also has this problem too. - Donnek
:) -- Ok, I've added gwyn/gwnnach/gwynnaf for now, adding the genders would probably mess up some rules and these are probably fairly low frequency and can be taken care of later. - Francis Tyers
We could also extend this to nouns:
- roedd hi'n waith anodd -> *was in ~a #difficult<adj><sint> work - (it) was hard work
(though "work" gets lost in the second proc run).
For Welsh pattern "yn + non-place noun-phrase" output English "noun-phrase"
This is a bit complex. There are two "yn"s in Welsh: "yn" showing state or condition, or extension in time (yn hapus - happy; yn mynd - going), and "yn" the preposition showing location in a specific place (yn y tŷ - in the house (contrast: mewn tŷ - in a house); yn Nolgellau - in Dolgellau). (They are probably related historically.) The stative "yn" soft-mutates nouns and adjectives, but not verbs; the location "yn" nasal-mutates (and changes to "ym" to match an initial "m" in the following noun, eg ym Mangor - in Bangor).
So - as it stands, the above will clash with 1.3.10 (change prep+noun to prep+det.indef+noun), even though "yn" the preposition will never occur before a non-specific noun (it must have specificity), and even though the above is not actually "yn" the preposition (it's "yn" the stative). We can't use the stative soft-mutation to decide, because (a) that doesn't apply to some consonant initials, and (b) other prepositions cause mutation too, and it would be overkill to check for each one. So the easiest thing is to adjust 1.3.10 to exclude "yn" as one of the prepositions that will be caught. Is it easy? I don't know :-)
- I've added the yn "stative" to the analyser as well as the yn "preposition", but until we retrain the tagger it will not pick the former. If you could think of any rules that will choose the right one in a given context it would help (for ideas on the kinds of restrictions to these rules, see here and here). - Francis Tyers
- The simplest would be:
- Welsh word "yn" is a preposition
- when it is followed by "det.def" or by a capitalised word
- otherwise it is a stative
- when it is followed by "det.def" or by a capitalised word
- Welsh word "yn" is a preposition
- That may not be perfect, but it is good enough. I'll bear in mind the tagger pages, but it may take a while to get to that stage. - Donnek
- The simplest would be:
Insert det.indef before non-definite noun[edit]
- daeth Taid â lamp -> Grandfather came with lamp (preferably "with a lamp")
- dychwelodd y rheolwr gyda gŵr tew -> the manager returned with fat man (preferably "with a fat man")
For Welsh pattern "prep + noun" output English pattern "prep + det.indef + noun"
this should probably be working now, - Francis Tyers
I am tempted to retire this in favour of a broader rule:
For Welsh pattern "non-specific noun.sg" output English "a + non-specific noun.sg"
"Non-specific" here means a noun that is not qualified by det.def, pr.poss, etc.
- daeth car ar hyd y ffordd -> *car came along the road - a car ...
- mae'r athro yn licio chwarae gêm o golff -> *the teacher is liking play game he golf - the teacher likes to play a game of golf
1.3.16 would deal with "play". For "gêm o golff" we would need to prevent "*a game of a golf" (which the existing rule would in fact have produced for "o golff"). Perhaps:
For Welsh pattern "non-specific noun + o + non-specific noun" output English pattern "a + noun + of + noun"
This would need to fire before the revised rule above, or we need some other way of sorting out the possible doubling of "a" (a a game of golf) - LOL - let it happen and then have a rule:
For English pattern "det.indef + det.indef" output "det.indef"
- gwelodd y bachgen gath yn yr ardd -> the boy saw cat in the garden (preferably "a cat")
Preferential choice between verbforms[edit]
- bydd y lamp yn rhoi golau -> *are the lamp giving light - the lamp will be giving light (and presumably we could massage this into "the lamp will give light" later, since that would be the more natural English equivalent)
A couple of things here. The most important is that tagger chooses the less frequent imperative out of the imperative/future choice for the verb. Presumably this then means that the subject shift can't take place. But even with the imperative choice, the imperative 2p sing info gets lost between interchunk and postchunk, and replaced with a generic? present which gets output as "are". Odd.
(I'm assuming that "bydd" would get output as "will be", since that would be the correct English tense.)
- I fixed this (crudely) by commenting out the imperative for "be" 2pSg (You are!) When I train the tagger next I'll see if i can take care of it there. - Francis Tyers
Another example:
- roedd y bechgyn yn gallu croesi dan y ffordd -> *the boys were #be<vbmod><ger># able to #vblex><vblex><pres> under the road - the boys were able to cross under the road
Here, proc decides to ignore "croesi" (cy-en.dix line 6203) as an infinitive in favour of conjugating it as present 2p sing. The infinitive option doesn't even show! Can we add some such rule as:
When you have homographous Welsh verb options <inf> and conjugated_verb choose <inf> unless verb is followed by pr
This isn't very good, because you could have the conjugated from without a pronoun, but it might deal with this to some extent.
The example above also has some funkiness going on with "gallu" - "were be able" needs to be transformed into "were able". However, I don't know enough about how Apertium treats modal verbs to make a suggestion.
- This is now giving:
- the boys were being able to cross under the road
- Which is an improvement. On the other hand, "^be<vbser><past><p3><pl>$ ^be able to<vbmod><ger>$" is probably redundant :) -- Not sure how to deal with this. - Francis Tyers
- This is now giving:
- What about a purely surface hack, eg:
- For English pattern "was/were + being + able"
- output "was/were + able"
- Or a mapping to convert "be able" to "can/could"? - Donnek
- What about a purely surface hack, eg:
Comparative adjectives with "less/more"[edit]
- tyfodd y twnnel yn llai llachar -> *the tunnel grew small bright - the tunnel grew less bright
- tyfodd y twnnel yn fwy tywyll -> *the tunnel grew big dark - the tunnel grew more dark
For Welsh pattern "fwy/llai + adj" output English "more/less + adj"
For English pattern "more/less + adj" output Welsh "fwy/llai + adj"
- I'll see if i can copy in a rule from Spanish--English for this :) - Francis Tyers
Synthetic comparative adjectives[edit]
Many of these seem to have faulty dictionary entries:
- tyfodd y twnnel yn fwy -> the tunnel grew big (should be "bigger")
- tyfodd y twnnel yn llai -> the tunnel grew small (should be "smaller")
- tyfodd y twnnel yn hirach -> the tunnel grew long (should be "longer")
- tyfodd y twnnel yn uwch -> the tunnel grew high (should be "higher")
- Dictionary error in the bidix, now fixed. - Francis Tyers
- Cool! - Donnek
Note, also a rule needs to be written for:
- the expensive house → y tŷ drud
- the more expensive house → y tŷ drutach
- the most expensive house → y tŷ drutaf
- You mean for the consonant change t->d? - Donnek
Subordinate ("reported speech") clauses with "bod" + noun[edit]
Also referring to the cool sentence, we have two sentences as follows:
- (1) roedd y Comisiwn yn ymchwilio i'r honiadau - the Commission was investigating the allegations
- (2) mae yr AS wedi methu datgan £103,000 o roddion - the MP has failed to declare £103,000 of gifts
Subordinate clauses, like the relative clauses, will be difficult. But a first stab at this might be as follows:
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + wedi + verb" output English "that + [det.def] + noun + [qualifiers] + has/have (number agreeing with noun) + verb_past_participle"
- clywodd y dyn bod y trên wedi cyrraedd yn hwyr -> *the man heard be the train after arrive #late
- This would be:
- VBSER(INF) + DEFINITE_NP + wedi + VERB
- THAT + DEFINITE_NP + HAS + VERBPP
- Where DEFINITE_NP is any noun phrase preceded by the definite article? - Francis Tyers
- Actually, thinking about this again, it doesn't have to be definite - it just so happened that in those sentences it was. You could have something like:
- clywodd ysbïwr bod y trên .... -> a spy heard that the train ....
- So the NP could be "[det.def, rhyw (some), pr.poss] + [adj - eg hen] + noun + [qualifiers - adjectives, demonstratives, etc]", or it could just be "pr.subj" (clywodd fo bod y trên ...). The same applies to the "am" construction below. Another point is the the VBSER can be soft-mutated - "fod" instead of "bod". - Donnek
- Actually, thinking about this again, it doesn't have to be definite - it just so happened that in those sentences it was. You could have something like:
- This rule is broadly working for now. At least it is inserting the 'that', a form of 'have' and changing the verb to a pp. It is not however robust, and seems to me a bit hacky. Could you give some more examples so I can fine tune it? - Francis Tyers
- Hacky? Surely not .... I did say that relative and subordinate clauses will be difficult, so we may have to refactor as we go along. An alternative to the above (which would also cover the adjective example below) would be:
- For Welsh pattern "[b/f]od + NP + complement"
- output English "that + NP + is + complement"
- This would give:
- clywodd y dyn bod y trên wedi cyrraedd yn hwyr -> *the man heard that the train is after arrive #late
- You would then have further rules to transform "is + after + verb" to "has + verbpp", and "is + for + verb" to "will + verb". (Irish and Gaidhlig have a similar construction, by the way, using "ar, air" instead of "wedi", so whatever rule bundle you use here would be transferable to that branch of Celtic too.)
- Hacky? Surely not .... I did say that relative and subordinate clauses will be difficult, so we may have to refactor as we go along. An alternative to the above (which would also cover the adjective example below) would be:
- There is also another similar construction using "ar" in place of "wedi" and "am" - this one means "about to":
- clywodd y dyn bod y trên ar gyrraedd yn hwyr -> *the man heard be the train on arrive late - the man heard that the train was about to arrive late
- So an additional rule "is + on + verb -> was + about to + arrive".
- There is also another similar construction using "ar" in place of "wedi" and "am" - this one means "about to":
- Oh, there's another one too, with "newydd", meaning "just now":
- clywodd y dyn bod y trên newydd gyrraedd yn hwyr -> *the man heard be the new train arrive late - the man heard that the train had just arrived late
- This one has been caught by the adjective rule, but "newydd" belongs to the VP, not the NP in this case, so we'd need some prioritisation.
- Oh, there's another one too, with "newydd", meaning "just now":
- Other examples:
- roedd y bachgen yn dweud bod y tŷ wedi mynd ar werth -> the boy was saying that the house has gone on value
- (fine, except that "ar werth" means "on sale" - see 1.3.19 below.
- dywedodd hi bod y trên yn hwyr -> *she said be the train late - she said that the train is late.
- You could deal with this one by adding a similar rule:
- VBSER(INF) + NP + ADJ
- THAT + NP + VBSER + ADJ
- dwi'n meddwl bod y glaw wedi stopio -> *dwithinking that the rain has stopped - I think that ....
- (We need to get the present tense of "bod" sorted out too)
- dywedodd yr eneth bod y siop am agor ar amser -> the girl said that the shop will open on #time
- (I'm noticing the lack of adverbs, eg "tomorrow", "today", "afterwards", etc. I suppose the remaining bits of Eurfa need importing at some point.)
- Donnek
The above rule would give "the man heard that the train has arrived late" - not perfect, since in English we would use pluperfect rather than perfect here, but a lot better.
We can extend this to another construction:
For Welsh pattern "[b/f]od + [det.def] + noun + [qualifiers] + am + verb" output English "that + [det.def] + noun + [qualifiers] + will + verb"
- clywodd y dyn bod y trên am gyrraedd yn hwyr -> *the man heard be the train for arrive #late
The above rule would give "the man heard that the train will arrive late" - not perfect, since in English we would use conditional rather than future here.
- This now seems to be working. - Francis Tyers
These could be improved if it were possible to refer back to the verb of the main clause. Thus where it is past, the subordinate would use pluperfect or conditional; where it is non-past, the subordinate would use perfect or future.
- We can probably set this as a variable, but what would be the triggers to set/unset the variable? - Francis Tyers
- End of the clause? Full stop or comma, perhaps? - Donnek
There are other varieties of subordinate clause that I give other suggestions about.
Incidentally, in the above the det.def should be taken to include other prenominal qualifiers like possessives.
Verbal nouns / Infinitives[edit]
- roedd y dyn yn gwerthu pethau rhad -> the man was selling cheap things
- roedd fo yn palu -> he was digging
Both of these are fine.
Making the verbal noun/infinitive the subject doesn't work quite so well:
- roedd gwerthu pethau rhad yn hawdd -> *was sell cheap things easy - selling cheap things was easy
- roedd palu yn waith caled -> *was dig in a hard work - digging was hard work
The latter would be benefit from the extension of the "yn as stative" rule to nouns, as suggested in 1.3.4 above. But we also need to define the VN as a subject, so that it can be shifted. This is not easy, because the rule may cause problems with other constructions later. But we can take a stab at it.
First, we can use the infinitival form in English - "to sell cheap things was easy" and "to dig was hard work" are equivalent to the above sentences.
For Welsh pattern "verb<vblex><inf>" output English "to + verb"
For English pattern "to + verb" output Welsh "verb<vblex><inf>"
This allows the same rule to be used in sentences like:
- ceisiodd y dyn agor y bocs -> *the man sought open the box
which should produce "the man tried to open the box". (Can we delete the "seek" entry for "ceisio" until we have refined choices between different entries? The "try" entry is more frequent.)
"seek" entry commented out and replaced with "try". Francis Tyers
Second, we can assume that a verbal noun phrase will occur after an inflected verb (mostly forms of "bod"). So we might try expanding the above to say:
For Welsh pattern "verb_inflected + verb<vblex><inf> + [noun phrase]" output English "to + verb + [noun phrase] + verb_inflected"
for the first two sentences ("roedd gwerthu pethau rhad" and "roedd palu"), and:
For Welsh pattern "verb_inflected + subject + verb<vblex><inf> + noun phrase" output English "subject + verb_inflected + to + verb + noun phrase"
for the third ("ceisiodd y dyn agor y bocs").
This is not perfect, and I am not sure how it would cut across the existing rule for subject shift.
There are also interesting issues with nesting of infinitival subject phrases:
- roedd gwerthu pethau rhad yn hawdd yn neis -> *was sell cheap things easy nice - to sell cheap things easily was nice
- roedd gwerthu pethau rhad yn hawdd yn beth neis -> *was sell cheap things easy in a nice thing - to sell cheap things easily was a nice thing
where I'm not sure how you specify the boundaries of the noun phrase. Any views, or is that too complex for the present iteration.
- At the moment to define noun phrases we just define fixed length patterns of tags which are matched in left-to-right, longest-match way. So for example for Welsh to English:
- NOUN → SN_(NOUN)
- PRNSUBJ → SN_(PRNSUBJ)
- DET → SN_(DET)
- DET NOUN → SN_(DET NOUN)
- NOUN1 NOUN2 → SN_(NOUN2-NOUN1)
- NOUN ADJ → SN_(ADJ NOUN)
- yn ADJ → SN_(ADJ)
- DET NOUN ADJ → SN_(DET ADJ NOUN)
- NOUN ADJ1 ADJ2 → SN_(ADJ2 ADJ1 NOUN)
- The first is the pattern we detect in Welsh, and the second is the "chunk" that we output in English. Any suggestions on defining more of these (the most frequently occurring), or changing them would be appreciated. - Francis Tyers
Note also in the above that we have the adverb problem from 1.3.4.
Non-compositional multiword phrases[edit]
This section is for phrases that have to be scoped as a whole, rather than broken down to their constituent parts. Note: Will probably be marked as "adv" if nothing is given.
ar werth - on sale(adv)ar ôl y cwbl - after all(adv)erbyn hyn - by now(adv)doed a ddelo - come what may(adv)hyd yn oed - even(adv)pan fo angen - as needed(adv)wrth gwrs - of course(adv)yn ôl i - back to(adv)yn weddill i - remaining to(adv)
Superlative adjective + "oll"[edit]
- y rhai lleiaf oll -> *the some smallest *oll - the smallest ones of all
- yn gyntaf oll -> *first *oll - first of all
For Welsh pattern "adj.super + oll" output English "adj.super + of all"
We need to add "oll" (all) to the dictionary, but this would still be a necessary rule.
"rhai"[edit]
- y rhai bach -> *the some small - the small ones
- rhai mawr -> *some big - big ones
"rhai" (some) can be considered the plural of "un" (one).
For Welsh pattern "rhai + adj" output English "adj + ones"
This also applies to phrases like "y rhai lleiaf oll" above. We need to convert "oll" first on the basis that is follows an adj, and then we need to convert "rhai" on the basis that it precedes an adj.
Lexis[edit]
- cwm → cirque (replace with: cwm → valley)
- Yes, the Eurfa entry is tagged (geography) so this makes sense. But "cym" would be wrong - cwm (valley), cymau or cymoedd (valleys) - Donnek
- Oops, typo :) - Francis Tyers
- Some verb lemmas are being generated incorrectly, (see here)
"cael" - get / have[edit]
"cael" is in the dix as "get", because in most cases that is the better gloss:
- mi aeth o i'r banc i gael pres allan -> *mi he went in the bank I get money *allan - he went to the bank to get money out
Apertium's version shows some regressions (we can omit the "mi" - see comment to 1.3.27, and "allan", along with the other Eurfa adverbs, has yet to be added to the dix). "i" should not be glossed as "in", but "to" (see comment to 1.3.7). Second, the shortened form of pr.sub.1p.sing ("i") will *never* occur without a preceding conjugated verb, so that option has to be banned in this context.)
- Regarding the shortened prn.subj, yep, this is the biggest regression we've had and I'm currently trying to fix it. - Francis Tyers 14:14, 1 July 2008 (UTC)
- mae o'n cael mynd ar ôl y cwbl -> *he is getting go after the all - he's getting to go after all
Pretty creditable attempt, but could be improved if 1.3.16 was applied (if infin is not preceded by "yn", translate as "to"+ infin).
- mi gafodd yr athro radd dda o Brifysgol Bangor -> **mi the teacher got good degree of Bangor-University - the teacher got a good degree from the University of Bangor
"mi" again (1.3.27) and inserted "a" (1.3.10) would improve this, but choosing "of" or "from" for "o" will be difficult. For "Bangor-University" see 1.3.31 below.
Note that "cael" in this sentence focusses on the "getting" - focussing on possession uses a different construction:
- mae gan yr athro radd dda o Brifysgol Bangor -> *is with the teacher good degree of Bangor-University - the teacher has a good degree from the University of Bangor
"cael"is also used to create a passive construction:
- mi gafodd y dyn ei daro gan gar -> **mi the man got his strike with car - the man was struck by a car
Apertium's version is pretty close to the literal - "he got his striking by a car" - which is close to the English "he got himself struck by a car"
However, in the coolness sentence this gloss is inappropriate:
- ei bod hi'n cael perthynas â dyn llawer hŷn -> *that she is getting relation with a much older man - that she is having a relationship with a much older man
Can we try the following rule:
For Welsh pattern "cael + NP + {â, gyda}" output English "have + NP + with"
"i" + infin[edit]
After most time conjunctions (eg
- cyn - before,
- ar ôl - after,
- ers - since,
- nes - until,
- erbyn - by the time that,
- wedi - after,
- wrth - while),
- rhag ofn (in case),
- er mwyn (in order to),
the conjunctive clause uses "i" + SM_infin to express person (with "i" being conjugated as appropriate), with tense being dependent on the first part of the sentence.
- In terms of conjunctions, we currently have:
<l>a</l><r>and</r> <l>ond</l><r>but</r> <l>neu</l><r>or</r>
- I'll add these, although some of them already exist as prepositions (which will be fun for the tagger to distinguish between). Should I just add them all to the bidix and Welsh monodix as given above? - Francis Tyers
- I tend to think of them just as prepositions with "i" after them, ie a special form of prepositional phrase. But you're right, it would probably be best to tag them as conjunctions. You could get the tagger to distinguish them from prepositions by specifying that the conjunctive version will always be follows by "i" (or a conjugated form of it). - Donnek
- Ok, I've added some others I found in my grammar too... probably some of them are variants (north/south) maybe? e.g. it has "gan" for "since". A quick other question:
- Until I get the tagger up to speed, will it make much difference if I identify these in the transfer rules based only on lemma as opposed to on lemma+POS ? e.g. this means that if the tagger accidentally tags "wrth" as a preposition, it could still apply it in the transfer rules below. - Francis Tyers
- Yes, there are others I haven't got around to yet. "gan" (because, since) and "am" (because, since) are two of them, but they use a "bod" construction similar to 1.3.26, so that's why they're not in this section.
- Re the tagger question, I think that should be cause no problems, because the key point is that all of these are followed with "i" - that's why I said above that I tend to think of them as "preposition + i construction". - Donnek
(Note in the examples below that "i'r" is still coming up as "in the", which needs to be addressed.)
- Done. This was confusing me for a while, but turns out it was down as "yn+yr" not "i+yr"! - Francis Tyers
- The little rotter. Hopefully it has been squashed now. - Donnek
- aeth y bachgen wedi i'r bws ddod - *the boy went after in the bus come - the boy went after the bus came
- aeth y bachgen wedi iddo fo ddod -> *the boy went after to him come - the boy went after he/it came
- bydd o wedi gadael erbyn i'r llythyr gyrraedd -> *that he has left by in the letter arrive - he will have left by the time the letter arrives
- bydd o wedi gadael erbyn iddo fo gyrraedd -> *that he has left by to him arrive - he will have left by the time he/it arrives
- wrth i'r heddlu ddod i mewn, rhedodd y dyn i mewn i'r stryd -> *beside in the police come I in, the man ran intothe road - as the police came in, the man ran into the street
- wrth iddyn nhw ddod i mewn, rhedodd y dyn i mewn i'r stryd -> *beside to them come I in, the man ran intothe road - as they came in, the man ran into the street
For Welsh pattern "cyn + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" output English "before + {NP, pr.subj) + verb.present/verb.past (see below)
- Done. - Francis Tyers
For Welsh pattern "wedi + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" output English "after + {NP, pr.subj) + verb.present/verb.past
- Done. - Francis Tyers
For Welsh pattern "ar ôl + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" output English "after + {NP, pr.subj) + verb.present/verb.past
For Welsh pattern "nes + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" output English "until + {NP, pr.subj) + verb.present/verb.past
For Welsh pattern "erbyn + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" output English "by the time that + {NP, pr.subj) + verb.present/verb.past
For Welsh pattern "wrth + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" output English "as + {NP, pr.subj) + verb.present/verb.past
- Done. - Francis Tyers
For Welsh pattern "rhag ofn + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" output English "in case + {NP, pr.subj) + verb.present/verb.past
For Welsh pattern "er mwyn + {"i" + NP, "i".conj + [pr.subj]} + SM_verb.infin" output English "in order for + {NP, pr.subj) + verb.present/verb.past
For Welsh pattern "ers + {"i" + NP, "i".conj + [pr.subj]} + SM_verb" output English "since + {NP, pr.subj) + verb.past
Note: the use of present or past tense depends on the main verb (except with "ers", when you really only use it in the past). If Apertium can only choose one, then use the present as default. But it would be nice to be able to switch based on the main verb - if this is past, use the past tense (preterite) in English; if it is non-past, use the present tense in English (see comments to 1.3.15).
- I'll try setting a variable which tracks the tense of the last conjugated verb (e.g. no inf, ger, etc.) - Francis Tyers
- That would be great, becasue it makes quite a difference in terms of how the translation reads to an English speaker. - Donnek
- It's done, but could I get examples for each of the rules above so I can test them? - Francis Tyers
- I'll do full versions for "cyn":
- roedd y siop wedi cau cyn i'r bachgen orffen ei ginio -> *The shop was after close before the boy completed his dinner - the shop had closed before the boy finished his dinner
- roedd y siop wedi cau cyn iddo orffen ei ginio -> *The shop was after close before him completed his dinner - the shop had closed before he finished his dinner
- Great - the "after close" is because we don't have a rule yet for periphrastic tenses.
- I'll do full versions for "cyn":
- caeodd y siop cyn i'r bachgen orffen ei ginio -> The shop closed before the boy completed his dinner - the shop closed before the boy finished his dinner
- caeodd y siop cyn iddo orffen ei ginio -> The shop closed before him completed his dinner - the shop closed before he finished his dinner
- Great.
- mae'r siop yn cau cyn i'r bachgen orffen ei ginio -> *The shop is closing before the boy complete his dinner - the shop closes before the boy finishes his dinner
- mae'r siop yn cau cyn iddo orffen ei ginio -> *The shop is closing before him complete his dinner - the shop closes before he finishes his dinner
- Fine, apart from the fact that the pr number is not being carried across to the verb.
- bydd y siop yn cau cyn i'r bachgen orffen ei ginio -> *Are the shop closing before the boy #complete<vblex><imp> his dinner
- bydd y siop yn cau cyn iddo orffen ei ginio -> *Are the shop closing before him #complete<vblex><imp> his dinner
- Not so good. The main problem is that the imperative is being chosen instead of the future, and I suspect this then mucks up your tense choice.
- Is there any way to disambiguate between imperative / future here? - Francis Tyers
- Note that with the pronoun versions the pr.obj is being used instead of the pr.subj.
- Similar sentences for "wedi" (after):
- roedd y siop wedi cau wedi i'r bachgen orffen ei ginio
- roedd y siop wedi cau wedi iddo orffen ei ginio
- Similar sentences for "wedi" (after):
- caeodd y siop wedi i'r bachgen orffen ei ginio
- caeodd y siop wedi iddo orffen ei ginio
- mae'r siop yn cau wedi i'r bachgen orffen ei ginio
- mae'r siop yn cau wedi iddo orffen ei ginio
- bydd y siop yn cau wedi i'r bachgen orffen ei ginio
- bydd y siop yn cau wedi iddo orffen ei ginio
- Similar sentences for "wrth" (as, while):
- roedd y siop wedi cau wrth i'r bachgen orffen ei ginio
- roedd y siop wedi cau wrth iddo orffen ei ginio
- Similar sentences for "wrth" (as, while):
- caeodd y siop wrth i'r bachgen orffen ei ginio
- caeodd y siop wrth iddo orffen ei ginio
- mae'r siop yn cau wrth i'r bachgen orffen ei ginio
- mae'r siop yn cau wrth iddo orffen ei ginio
- bydd y siop yn cau wrth i'r bachgen orffen ei ginio
- bydd y siop yn cau wrth iddo orffen ei ginio
- - Donnek
Conjunctive genitive with proper names[edit]
The sentence from 1.3.29 above:
- mi gafodd yr athro radd dda o Brifysgol Bangor -> **mi the teacher got good degree of Bangor-University - the teacher got a good degree from the University of Bangor
suggests the need to deal separately with proper nouns; a useful shorthand would be "any words not coming at the beginning of a sentence which are capitalised".
The "Bangor-University" is, I believe, an earlier rule intended to deal with compound nouns ( get "*woods-things" for "pethau pren" - wooden things, for instance), and I wonder whether it might be commented out for the present?
Yep, actually I haven't yet implemented the rule at the very top (in fact the first rule, as I'm still considering how to do it most effectively. I will make this a priority. - Francis Tyers
- Ah, I see. Yes, it would be nice to get that sorted, because it's probably the most distinctive Welsh construction vis-a-vis English. - Donnek
- For more discussion, see 1.3.1.1. - Francis Tyers
In this and similar instances it would be preferable for variations of the normal genitive rule (1.3.1) to apply (a capitalised noun is by definition definite), and we can therefore suggest an additional rule:
For a Welsh phrase of the type "!det.def + noun1 + noun_capitalised" output English to "det.def + noun1 + of + noun_capitalised"
Should we make this noun_capitalised or "proper name" (we mark proper names separately in the dictionary and separate them into "anthroponyms", "cognomens" and "toponyms" - Francis Tyers
- Ah, excellent. Would this offer a way of dealing with items like "Red Branch" below? Any of these three categories would use the rule above, but capitalised words that are not marked as one of the above would instead get the det.def added before the capitalised_noun. - Donnek
The following is another example:
- Flwyddyn neu ddwy ar ôl priodas Cwchwlin, paratodd Bricriw Dafod Ffals wledd yn ei gastell a gwahoddodd y Brenin Conor ac arwyr y Gangen Goch.
This is very lightly edited actual text from p55 of "Cwchwlin" by Ivor Owen, a retelling of the Irish Cúchulain legend. Apertium's version of this is:
- Year or two after marriage *Cwchwlin, prepared *Bricriw *Dafod *Ffals feast in his castle that invited the King *Conor and the heroes Red Branch.
This is actually pretty good.
The rule above would deal with the following:
- ar ôl priodas Cwchwlin -> *after marriage Cwchwlin - after the marriage of Cwchwlin
This is kind of done, if we choose a "known" name, we get: "Year or two after the marriage of Elin." - Francis Tyers
- Well done.- Donnek
The other example:
- ac arwyr y Gangen Goch -> *and the heroes Red Branch - and the heroes of the Red Branch
should already be caught under the second rule under 1.3.1 above, but isn't. Unfortunately, the rule in this section would give the suboptimal "and the heroes of Red Branch". It may be possible to differentiate between different types of proper name (ie those which have an English gloss and those which don't), but in the meantime, on balance, it makes sense to implement the rule in this section.
The extended rule in 1.3.10 would deal with "(a) year" and "(a) feast".
There is the unsolved issue with word-sequence for unknown words (in my view, some pragmatic response to this has to be found, because it is unlikely that Apertium will ever be dealing with a text in which it knows every word). The worst infelicity in the above is that tagger has selected "a" as the relative particle "who, which, that" instead of the much more likely "and" (which is the correct one in this case). This should actually have been ruled out in this case because relative-a is followed by SM, while conjunctive-a is not.
Yep, this is a tagger error and I'm working to resolve it. Unfortunately the forbid/enforce rules don't appear to be applying themselves :/ - Francis Tyers
Preverbal particles - negative[edit]
As noted in 1.3.27, preverbal particles need to be addressed. Note that this and the following sections apply mainly to formal, written Welsh.
- ond ni threuliodd Cwchwlin ond ychydig o'r misoedd a oedd yn weddill iddo fo yn Nhŷ'r Bechgyn
- ->*but We spent *Cwchwlin but #a little<n><sg> of the months that was in remnant to him in the House of the Boys (p14)
- but Cwchwlin spent only a few of the months remaining to him in the Boys' House
For Welsh pattern "ni + MM_verb.inflected + subject" output English "subject + auxiliary + not + verb"
Without the auxiliary, this would produce archaic English ("he spent not"), which would be acceptable at a pinch, but Apertium may have standard rules for producing English negatives where a simple affirmative is replaced by a periphrastic negative (I go - I don't go; I went - I didn't go; etc).
MM is the "mixed mutation" - use AM where possible, and SM everywhere else:
- c -> ch
- p -> ph
- t -> th
- g -> ---
- b -> f
- d -> dd
- ll -> l
- m -> f
- rh -> r
- ni phrynodd yr athro'r papur -> *We the teacher bought the paper - the teacher did not buy the paper
"ni" becomes "nid" before a vowel, but not before a vowel "exposed" by the soft mutation of a "g":
- nid aeth i'r dref -> *Came went to the town - he did not go to the town
- ni welodd y bachgen y gath -> *We the boy saw the cat - the boy did not see the cat
- ni orffennodd fo'r dasg -> *We completed the task was - he did not complete the task
Note here that we seem to have a regression where the tagger is choosing an unlikely inflected form of "bod" instead of the pr.3p.sing.m. This may be related to the 1.3.29 issue with "i".
For Welsh pattern "verb_inflected + {fo, o, fe, e}" output English "he + verb_inflected"
Can we also try a more general rule based on negation, namely:
For Welsh pattern "previous_negation + ond" output "only"
That may not be possible, of course, in which case we'll have to try something else.
The "ni" particle is typical of formal or written Welsh. Spoken or informal Welsh tends to use "ddim" after the verb. But we can leave that till later.
- I've added the particle, "ni" and "nid", written a couple of disambiguation rules and done a basic rule which just puts "not" after the verb. This doesn't work in most cases, but does seem to work nicely with "to be" — "nid yw fe yn dref" → "he is not in town". This ruleset is likely to be extremely fragile. - Francis Tyers
"to be" in conditional subjunctive[edit]
- Roedd e'n meddwl mai gwaed anifail a welodd. → *He was thinking animal blood might be and saw.
- [Was] [he] [thinking] [might be] [blood animal] [and] [saw]
- Analysis:
^bod<vbser><pii><p3><sg>$ ^prpers<prn><subj><p3><m><sg>+yn<pr>$ ^meddwl<vblex><inf>$ ^bod<vbser><cns><p3><sg>$ ^gwaed<n><m><sg>$ ^anifail<n><m><sg>$ ^a<cnjcoo>$ ^gweld<vblex><past><p3><sg>$
This could be something like:
- He was thinking that it might be animal blood that he saw.
Suggestions welcome.
Regression tests[edit]
- Main article: Welsh to English/Regression tests
Coolness factor[edit]
- Disclaimer: We're not deliberately aiming the translator at crime texts, it just seems to work best with these — a subject for investigation perhaps?
- Roedd y Comisiwn yn ymchwilio i'r honiadau bod yr AS wedi methu datgan £103,000 o roddion.
- the Commission Was investigating the allegations that the MP has failed to declare £103,000 of gifts.
- "He was the Commission crookedly ymchwiliad I ' group claims be he drives ACE has failed declare he gifts." (InterTran)
- Dywedodd yr heddlu fod y troseddau honedig wedi digwydd rhwng 2003 a 2007 yn Sir Benfro a Sir Gaerfyrddin.
- the police Said that the alleged crimes have happened between 2003 and 2007 in Pembrokeshire and Carmarthenshire.
- "He said he drives police force be the transgressions alleged has happened between 2003 I go 2007 crookedly Shire ble I go Shire Gaerfyrddin." (InterTran)
- Mae'r heddlu hefyd yn ymchwilio i honiadau ei bod hi'n cael perthynas â dyn llawer hŷn.
- the police Are also investigating his allegations be she getting relation with a much older man.
- "He ' is being group police force also crookedly ymchwiliad I claims you go be she ' heartburn have relation he goes tight much hn & #375." (InterTran)
- Mae Ymddiriedolaeth Caerdydd a'r Fro yn gwrthod dweud faint mae'r driniaeth yn costio, ond yn ôl papur newydd y Sun mae'r driniaeth cyffuriau yn costio £2,500 y mis.
- Cardiff Trust and the Region Is refusing say size the treatment is costing , but according to the newspaper Sun the treatment is drugs costing £2,500 the month.
- "He is being Trust Cardiff I ' go group Land refusing say as many he ' is being group treatment costing , except according to paper news the Sun he ' is being group treatment ingredients costing the month." (InterTran)
- Ond byddwn ni'n parhau i weithio gyda'n staff, gwirfoddolwyr, cystadleuwyr, cwsmeriaid, partneriaid, a chyrff eraill i wella ein gwasanaeth Cymraeg.
- But we will be continuing to work within a staff, volunteers, competitors, customers, partners, and other bodies to improve our Welsh service.
- "Except we will be we ' heartburn last I work with ' heartburn staff , volunteers , competitors , customers partneriaid , I go bodies other I improve our service Welsh." (InterTran)