Difference between revisions of "Norwegian Nynorsk and Norwegian Bokmål/arkiv"

From Apertium
Jump to navigation Jump to search
 
Line 174: Line 174:
   
 
Note: only Bokmål has both infinitive and present st-forms of non-st-verbs – there are no Nynorsk <code><pres><pst></code> entries (only <code><inf><pst></code>).
 
Note: only Bokmål has both infinitive and present st-forms of non-st-verbs – there are no Nynorsk <code><pres><pst></code> entries (only <code><inf><pst></code>).
  +
  +
[[Category:Norwegian Nynorsk and Norwegian Bokmål]]

Latest revision as of 10:24, 22 September 2010

Old stuff that isn't very relevant or helpful any more.

Decisions to make, variants[edit]

«Dagens BT» (tidssubstanstiv og genitiv)[edit]

Det finst allerei unntak som omsett «et års tid» til «eit års tid» (heller enn «tida til eit år»); me kunne lagt ei klasse med substantiv som alle skulle omsettast med ein annan regel om dei opptredde i bestemd form:

  • (nb) dagens BT → BT i dag
  • (nb) årets festival → festivalen i år

For no er dette gjort med ein spesiell kategori ngen_time_meas i transfer; sidan regelen for ngen_time_meas kjem først i fila, prøver me ikkje meir generelle ngen-reglane nedanfor.

LR og valfrie formar[edit]

Eg har nytta nynorsk.org-malen i mange tilfelle, berre for å ha noko litt konsekvent å gå etter, når det gjeld val mellom suffiks ol. --Unhammer 18:07, 6 July 2009 (UTC)

LR-ane er foreløpig slik at me har, for nynorske substantiv:

  • tempusane (ikkje tempora(a))
  • sagaa (ikkje sagai)
  • dunderar (ikkje dundrar)

for nynorske verb:

  • vemdest og har vemst (ikkje vemtest eller har vems)
  • tømde og har tømt (ikkje tømte eller har tømd)

for nynorske adj.partisipp:

  • laten og latne (mf, pl) (ikkje lata)
  • late (nt) (ikkje lati eller lata)
  • degd (mf) og degt (nt)
  • treden (mf), trede (nt), tredne (pl/def) (ikkje tredd/tredt/tredde, der me har valet mellom dei to subparadigma)

(visse partisipp-paradigme med LR-ar dekkjer fleire lemma, det står «wsd todo» på dei fleste av desse...)

osb.

For bokmålssubstantiv har eg au unngått tempora osb., og har valt

  • peppere (ikkje pepperer eller peprer)
  • kapitlet (ikkje kapittelet)

Skriv gjerne på diskusjonssida om du ikkje er samd i formvala! (Det kan t.d. vere at det finst grupper av ord som ikkje skal følgje regelen over, medan andre ord skal dette.)

(Sjå au delen om WSD-problem nedanfor, visse LR-ar er der ganske enkelt fordi iallfall eitt lemma må veljast.)

bli/vart/vorte?[edit]

  • bli -r, blei, blitt ?
  • bli -r, vart, vorte ?
  • verte, vert, vart, vorte ?

(Go for frequency? In nn.dix I chose to put LR on 'litle', 'vetle' & 'lisle' since 'vesle' had the highest frequency in [avis.uib.no]; nb of course has LR on 'vesle'.)

Chosen: verte all the way, by popular demand.

samsvarsbøying for partisippar[edit]

Eit raskt søk i Oslo-korpuset av tagga nynorsktekster tyder på at dette rett og slett ikkje skjer, sjekk t.d. «levd/levt» fulgt av substantiv, sjølv om no.wikipedia seier at det skal bøyast. Erik frå i18n-no vil «avgrense bruken» men nemner at «køyrd - køyrt - køyrde og dømd - dømt - dømde» er obligatorisk. Sånn implementeringsmessig er det kanskje like lett å innføre det for alle formar då? (Eller det er kanskje enklare med adjektivformar for desse. Fram til me får variantar.)

mange-fleire-flest as adjective?[edit]

Oslo-Bergen-taggeren represents as adjectives anything that can have pst/comp/sup; but other language pairs have this as a determiner... so, tagging it as an adjective makes it easier to work with OBT, but perhaps harder to move between other Scandinavian languages.


Notes on bokmål NP structure[edit]

Possible phrases to put in an NP slot (based on Dyvik 2000, p.11--13)) with Apertium tags:

  • året<n>
  • et<det><ind> år<n>
  • mange<adj> år<n>
  • de<det> mange<adj> årene<n>
  • alle<det><qnt> de<det><def> mange<adj> årene<n> dine<det><pos>
    • all the many years yours
  • alle<det><qnt> disse<det><def> dine<det><pos> seksti<num> år<n> som<cnjsub> gikk<vblex>
    • all these your sixty years which went
  • alle<det><qnt> som<cnjsub> gikk<vblex>
  • mange<adj>
  • mange<adj> raske<adj>
  • *raske<adj>
    • (That is, we can't say "Gi meg raske." (Give me quick (ones).) but we can say "Gi meg noen raske." (Give me some quick (ones).).)

Dyvik's analysis (based on Vangsnes 1999) looks more or less like:

NOM = { Prop | PRON | AllQP | DP | PossP | QuantP | NP } 

AllQP    →  allq ({ DP  | PossP | QuantP | NP }) 
DP       →         det ({ PossP | QuantP | NP }) 
PossP    → { Poss |  NOM<gen> } ({ QuantP | NP }) 
QuantP   →    { QP | num | art }    qnt    (NP)
NP       →                            AP*   n   (poss)  (CP)

(This is flattened a lot and disregards f-structure.) Assuming AP gives adj, we get:

[AllQP alle<allq> [DP disse<det> [PossP dine<poss> [QuantP mange<qnt> [NP gode<adj> år<n> [CP som gikk] ] ] ] ] ]

[AllQP [QuantP mange<qnt> ] ] 

and so on, but this doesn't allow "Gi meg mange raske (*som gikk)" or "Gi meg alle de raske (som gikk)", so the NP needs to be more lax on the presence of n.

Compounds[edit]

There are several issues with compounding. First of all, we need to do the decompounding analysis. This could happen by changing lt-proc (fst_processor.cc) so that unknown words are sent to a decompounding-function that tries various strategies (looking up longest-match left-to-right / minimum cuts etc.). But that should be relatively easy (People Have Done This Before).

The practical problems come when trying to integrate this with disambiguation, transfer and generation in Apertium.

A naïve attempt: say we read språksivilisert informasjonssamfunn (neither in the dictionary), lt-proc could output something like:

^språk/språk<n><compound><nt><pl><ind>/språk<n><compound><nt><sg><ind>$^sivilisert/sivilisere<vblex><pp>/sivilisere<adj><pp><nt><sg><ind>/sivilisere<adj><pp><mf><sg><ind>$

^informasjon/informasjon<n><compound><m><sg><ind>$^s/s<epenthetic><compound> $^samfunn/samfunn<n><nt><pl><ind>/samfunn<n><nt><sg><ind>$

If the decompounder doesn't add spaces, we don't get spaces in final output. s<epenthetic><compound> is RL, only generated by the lt-proc decompounder; probably we can have a language-specific list of possible epenthetics somewhere, like s and e for Norwegian.

This will seriously mess up CG though, since we can have eg. verb+noun/adj+noun/adj+verb compounds, etc. Only the last part of it matters for disambiguation (ie. språksivilisert is equivalent to sivilsert wrt. disambiguation, etc.), so I guess the simplest way would be to somehow make CG pretend that the first part doesn't exist, make CG ignore words with <compound> or something. Should be easy, just like superblanks, although it feels like a rather arbitrary change. Actually making sure CG ignores <compound>-tagged words without changing cg-proc would be a whole lot more work.

For transfer, we could probably do multi-stage transfer and chunk all <compound>-tagged words. The second stage does the actual transfer stuff (moving things around, concordance), using these chunks and thus ignoring any <compound>-tagged words. So we can probably get by without changing the transfer module.

Another option would be to send it to CG as ^språksivilisert/språk<n><compound><nt><pl><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><sg><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><pl><ind>+sivilisere<adj><pp><nt><sg><ind>/...$ And then change apertium-pretransfer to not put a space where + is. Alternatively we could just make a new symbol, .e.g = or & to take the place of + that pretransfer just splits without inserting a space. - Francis Tyers 11:06, 16 September 2009 (UTC)

Disambiguating between compound types[edit]

Some heuristics given by Johannesen & Hauglin:

  • Compounds are always binary – they contain two members only (apart from possible epenthetic phones).
  • Choose the analysis (or analyses) with the fewest compound members
  • Lexical compounding is preferable to compounding with epenthetic phones.
    • Note: the first part is always just a stem. We don't have "cars.sick", only "car.sick", etc. (Well, that's what they say, but then there's "savnet.melding" ("missing.message") where the first part is inflected)
  • Epenthetic -s- is preferred to lexical compounding when the -s- can be ambiguous between epenthetic use and the first letter of a verbal last member.
    • First-longest-match gets in the way of this...
  • Epenthetic -s- can only follow noun stems.
    • Sounds like a job for CG.
  • Epenthetic -s- is preferred to lexical compounding when the first member is itself a compound.
    • but that would involve marking lots of words in monodix... (unless we want to do multistem compounding, which we probably don't). Most likely handled by longest-match left-to-right, though
  • Epenthetic -s- cannot follow epenthetic -e- and vice versa.
    • This could be hardcoded into a decompounder, since if a language allows something that looks like two epenthetics, you just add that as another epenthetic in your language-specific list.
  • If two analyses have the same number of members and there is no epenthesis involved, choose the one, if any, that is a noun.
    • We can't do this, can we? That'd involve sending not only all analyses of the first longest matches, but all possible analyses into some compound disambiguation.
  • If two analyses are equal with respect to epenthesis and part of speech, and one has a first member that is itself a compound, then choose that one.
    • Most likely handled by longest-match left-to-right.
  • If the first member is unknown, choose the analysis with the longest last member.
    • Not handled by longest-match left-to-right? Or?

Stuff we most likely won't do:

  • Epenthetic -e- can only be attached to a stem that is monosyllabic. This does not mean that epenthetic -e- must always make up the second syllable in a word, however. Other possible stems can be prior to the stem preceding the -e-, if they do not form a compound with that stem.
  • Epenthetic -s- cannot occur after a sibilant or a final consonant sequence containing a sibilant except when the consonant sequence belongs to a compound.

Some compound stats[edit]

[1] gives a list of compounds, POS-tagged on both members and classified by epenthesis type etc. Of ~25000 compounds: ~10000 had epenthesis, 954 had an adjective as one member (about 50/50 on first and last), 327 had a verb as one member (again 50/50).

Of the 15904 non-epenthetic compounds, 674 were analysed by apertium-nn-nb. When putting nouns, adjectives and verbs into an "inconditional" section, 15889 were analysed, although many of these analyses were of course wrong. Of the 15889 naïvely decompounded analyses of non-epenthetic compounds, only 5369 were analysed as two words, so we definitely need to restrict decompounding to two-word analyses (with possible epenthetics). (After using only nouns, and removing single letters, this figure went up to 7512.)

Transfer[edit]

Genitive/possessive[edit]

Eit søk i Oslo Bergen-korpuset av taggede bokmålstekster for genitivssubstantiv følgt av ein streng med minst eitt adjektiv viste at berre 1 av 16258 treff hadde >4 adjektiv i strengen. Så me treng berre ei endeleg mengd transferreglar for å få til:

   * (nb) Min snute → Snuten min		
   * (nb) Min sorte snute → Den svarte snuten min
   * (nb) Min katts snute → Snuten til katten min
   * (nb) Min gamle katts snute → Snuten til den gamle katten min
   * (nb) Min katts sorte snute → Den svarte snuten til katten min
   * (nb) Min lille gamle katts sorte snute → Den svarte snuten til den vesle gamle katten min

Det er til no 3 typar reglar for eigedomsfrasane:

  • POSGEN ADJ* NIND
    • min/naboens (sorte) katt
  • POSGEN ADJ* NGEN ADJ* NIND
    • min/naboens (sorte) katts (hvite) snute
  • DETNONPOS ADJ* NGEN ADJ* NOM
    • en (sort) katts (hvite) snute

(the last two as of yet only have single ADJ-rules, some copy-paste still todo)

Passive[edit]

At the moment, we have:

   * (nb) Boken leses → Boka blir lese (pres)
   * (nb) Boken må leses → Boka må lesast (inf)
   * (nb) Boken ble lest → Boka vart lese (past)
   * (nn) Boka blir lese → Boken leses 
   * (nn) Å bli lese → Å leses 
   * (nn) Boka kan lesast → Boken kan leses
   * (nn) Boka vart lese → Boken ble lest 

So past-tense morphological passive in Bokmål, "boken lestes", is currently not in dix (nor in Norsk Ordbank, it seems), and is low-frequency enough not to matter much yet(?). The nn=>nb transfer rule for "bli vblex" only matches present and infinitive.

Note: only Bokmål has both infinitive and present st-forms of non-st-verbs – there are no Nynorsk <pres><pst> entries (only <inf><pst>).