Talk:Norwegian Nynorsk and Norwegian Bokmål

From Apertium
Revision as of 09:52, 22 September 2010 by Unhammer (talk | contribs) (old)
Jump to navigation Jump to search

Notes on bokmål NP structure

Possible phrases to put in an NP slot (based on Dyvik 2000, p.11--13)) with Apertium tags:

  • året<n>
  • et<det><ind> år<n>
  • mange<adj> år<n>
  • de<det> mange<adj> årene<n>
  • alle<det><qnt> de<det><def> mange<adj> årene<n> dine<det><pos>
    • all the many years yours
  • alle<det><qnt> disse<det><def> dine<det><pos> seksti<num> år<n> som<cnjsub> gikk<vblex>
    • all these your sixty years which went
  • alle<det><qnt> som<cnjsub> gikk<vblex>
  • mange<adj>
  • mange<adj> raske<adj>
  • *raske<adj>
    • (That is, we can't say "Gi meg raske." (Give me quick (ones).) but we can say "Gi meg noen raske." (Give me some quick (ones).).)

Dyvik's analysis (based on Vangsnes 1999) looks more or less like:

NOM = { Prop | PRON | AllQP | DP | PossP | QuantP | NP } 

AllQP    →  allq ({ DP  | PossP | QuantP | NP }) 
DP       →         det ({ PossP | QuantP | NP }) 
PossP    → { Poss |  NOM<gen> } ({ QuantP | NP }) 
QuantP   →    { QP | num | art }    qnt    (NP)
NP       →                            AP*   n   (poss)  (CP)

(This is flattened a lot and disregards f-structure.) Assuming AP gives adj, we get:

[AllQP alle<allq> [DP disse<det> [PossP dine<poss> [QuantP mange<qnt> [NP gode<adj> år<n> [CP som gikk] ] ] ] ] ]

[AllQP [QuantP mange<qnt> ] ] 

and so on, but this doesn't allow "Gi meg mange raske (*som gikk)" or "Gi meg alle de raske (som gikk)", so the NP needs to be more lax on the presence of n.

Compounds

There are several issues with compounding. First of all, we need to do the decompounding analysis. This could happen by changing lt-proc (fst_processor.cc) so that unknown words are sent to a decompounding-function that tries various strategies (looking up longest-match left-to-right / minimum cuts etc.). But that should be relatively easy (People Have Done This Before).

The practical problems come when trying to integrate this with disambiguation, transfer and generation in Apertium.

A naïve attempt: say we read språksivilisert informasjonssamfunn (neither in the dictionary), lt-proc could output something like:

^språk/språk<n><compound><nt><pl><ind>/språk<n><compound><nt><sg><ind>$^sivilisert/sivilisere<vblex><pp>/sivilisere<adj><pp><nt><sg><ind>/sivilisere<adj><pp><mf><sg><ind>$

^informasjon/informasjon<n><compound><m><sg><ind>$^s/s<epenthetic><compound> $^samfunn/samfunn<n><nt><pl><ind>/samfunn<n><nt><sg><ind>$

If the decompounder doesn't add spaces, we don't get spaces in final output. s<epenthetic><compound> is RL, only generated by the lt-proc decompounder; probably we can have a language-specific list of possible epenthetics somewhere, like s and e for Norwegian.

This will seriously mess up CG though, since we can have eg. verb+noun/adj+noun/adj+verb compounds, etc. Only the last part of it matters for disambiguation (ie. språksivilisert is equivalent to sivilsert wrt. disambiguation, etc.), so I guess the simplest way would be to somehow make CG pretend that the first part doesn't exist, make CG ignore words with <compound> or something. Should be easy, just like superblanks, although it feels like a rather arbitrary change. Actually making sure CG ignores <compound>-tagged words without changing cg-proc would be a whole lot more work.

For transfer, we could probably do multi-stage transfer and chunk all <compound>-tagged words. The second stage does the actual transfer stuff (moving things around, concordance), using these chunks and thus ignoring any <compound>-tagged words. So we can probably get by without changing the transfer module.

Another option would be to send it to CG as ^språksivilisert/språk<n><compound><nt><pl><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><sg><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><pl><ind>+sivilisere<adj><pp><nt><sg><ind>/...$ And then change apertium-pretransfer to not put a space where + is. Alternatively we could just make a new symbol, .e.g = or & to take the place of + that pretransfer just splits without inserting a space. - Francis Tyers 11:06, 16 September 2009 (UTC)

Disambiguating between compound types

Some heuristics given by Johannesen & Hauglin:

  • Compounds are always binary – they contain two members only (apart from possible epenthetic phones).
  • Choose the analysis (or analyses) with the fewest compound members
  • Lexical compounding is preferable to compounding with epenthetic phones.
    • Note: the first part is always just a stem. We don't have "cars.sick", only "car.sick", etc. (Well, that's what they say, but then there's "savnet.melding" ("missing.message") where the first part is inflected)
  • Epenthetic -s- is preferred to lexical compounding when the -s- can be ambiguous between epenthetic use and the first letter of a verbal last member.
    • First-longest-match gets in the way of this...
  • Epenthetic -s- can only follow noun stems.
    • Sounds like a job for CG.
  • Epenthetic -s- is preferred to lexical compounding when the first member is itself a compound.
    • but that would involve marking lots of words in monodix... (unless we want to do multistem compounding, which we probably don't). Most likely handled by longest-match left-to-right, though
  • Epenthetic -s- cannot follow epenthetic -e- and vice versa.
    • This could be hardcoded into a decompounder, since if a language allows something that looks like two epenthetics, you just add that as another epenthetic in your language-specific list.
  • If two analyses have the same number of members and there is no epenthesis involved, choose the one, if any, that is a noun.
    • We can't do this, can we? That'd involve sending not only all analyses of the first longest matches, but all possible analyses into some compound disambiguation.
  • If two analyses are equal with respect to epenthesis and part of speech, and one has a first member that is itself a compound, then choose that one.
    • Most likely handled by longest-match left-to-right.
  • If the first member is unknown, choose the analysis with the longest last member.
    • Not handled by longest-match left-to-right? Or?

Stuff we most likely won't do:

  • Epenthetic -e- can only be attached to a stem that is monosyllabic. This does not mean that epenthetic -e- must always make up the second syllable in a word, however. Other possible stems can be prior to the stem preceding the -e-, if they do not form a compound with that stem.
  • Epenthetic -s- cannot occur after a sibilant or a final consonant sequence containing a sibilant except when the consonant sequence belongs to a compound.

Some compound stats

[1] gives a list of compounds, POS-tagged on both members and classified by epenthesis type etc. Of ~25000 compounds: ~10000 had epenthesis, 954 had an adjective as one member (about 50/50 on first and last), 327 had a verb as one member (again 50/50).

Of the 15904 non-epenthetic compounds, 674 were analysed by apertium-nn-nb. When putting nouns, adjectives and verbs into an "inconditional" section, 15889 were analysed, although many of these analyses were of course wrong. Of the 15889 naïvely decompounded analyses of non-epenthetic compounds, only 5369 were analysed as two words, so we definitely need to restrict decompounding to two-word analyses (with possible epenthetics). (After using only nouns, and removing single letters, this figure went up to 7512.)