Difference between revisions of "Talk:Norwegian Nynorsk and Norwegian Bokmål"
Line 58: | Line 58: | ||
:Another option would be to send it to CG as <code>^språksivilisert/språk<n><compound><nt><pl><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><sg><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><pl><ind>+sivilisere<adj><pp><nt><sg><ind>/...$</code> And then change apertium-pretransfer to not put a space where <code>+</code> is. Alternatively we could just make a new symbol, .e.g <code>=</code> or <code>&</code> to take the place of <code>+</code> that pretransfer just splits without inserting a space. - [[User:Francis Tyers|Francis Tyers]] 11:06, 16 September 2009 (UTC) |
:Another option would be to send it to CG as <code>^språksivilisert/språk<n><compound><nt><pl><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><sg><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><pl><ind>+sivilisere<adj><pp><nt><sg><ind>/...$</code> And then change apertium-pretransfer to not put a space where <code>+</code> is. Alternatively we could just make a new symbol, .e.g <code>=</code> or <code>&</code> to take the place of <code>+</code> that pretransfer just splits without inserting a space. - [[User:Francis Tyers|Francis Tyers]] 11:06, 16 September 2009 (UTC) |
||
===Disambiguating between compound types=== |
|||
Some heuristics given by Johannesen & Hauglin: |
|||
* Lexical compounding is preferable to compounding with epenthetic phones. |
|||
* Epenthetic -s- is preferred to lexical compounding when the -s- can be ambiguous between epenthetic use and the first letter of a verbal last member. |
|||
* Epenthetic -s- can only follow noun stems. |
|||
* Epenthetic -s- is preferred to lexical compounding when the first member is itself a compound. |
|||
** but that would involve marking lots of words in monodix... (unless we want to do multistem compounding, which we probably don't). Most likely handled by longest-match left-to-right, though |
|||
* Epenthetic -s- cannot follow epenthetic -e- and vice versa. |
|||
* If two analyses have the same number of members and there is no epenthesis involved, choose the one, if any, that is a noun. |
|||
* If two analyses are equal with respect to epenthesis and part of speech, and one has a first member that is itself a compound, then choose that one. |
|||
** Most likely handled by longest-match left-to-right. |
Revision as of 10:01, 18 September 2009
Contents
Notes on bokmål NP structure
Possible phrases to put in an NP slot (based on Dyvik 2000, p.11--13)) with Apertium tags:
- året<n>
- et<det><ind> år<n>
- mange<adj> år<n>
- de<det> mange<adj> årene<n>
- alle<det><qnt> de<det><def> mange<adj> årene<n> dine<det><pos>
- all the many years yours
- alle<det><qnt> disse<det><def> dine<det><pos> seksti<num> år<n> som<cnjsub> gikk<vblex>
- all these your sixty years which went
- alle<det><qnt> som<cnjsub> gikk<vblex>
- mange<adj>
- mange<adj> raske<adj>
- *raske<adj>
- (That is, we can't say "Gi meg raske." (Give me quick (ones).) but we can say "Gi meg noen raske." (Give me some quick (ones).).)
Dyvik's analysis (based on Vangsnes 1999) looks more or less like:
NOM = { Prop | PRON | AllQP | DP | PossP | QuantP | NP } AllQP → allq ({ DP | PossP | QuantP | NP }) DP → det ({ PossP | QuantP | NP }) PossP → { Poss | NOM<gen> } ({ QuantP | NP }) QuantP → { QP | num | art } qnt (NP) NP → AP* n (poss) (CP)
(This is flattened a lot and disregards f-structure.) Assuming AP gives adj, we get:
[AllQP alle<allq> [DP disse<det> [PossP dine<poss> [QuantP mange<qnt> [NP gode<adj> år<n> [CP som gikk] ] ] ] ] ] [AllQP [QuantP mange<qnt> ] ]
and so on, but this doesn't allow "Gi meg mange raske (*som gikk)" or "Gi meg alle de raske (som gikk)", so the NP needs to be more lax on the presence of n.
TODO
I've put my private TODO list and general notes file (extremely messy) up at http://www.student.uib.no/~kun041/doc/apertium.html (though I doubt it's helpful for anyone but me...). Also, igrep for "todo" in apertium-nn-nb/*x. Unhammer 07:42, 8 June 2009 (UTC)
Compounds
There are several issues with compounding. First of all, we need to do the decompounding analysis. This could happen by changing lt-proc (fst_processor.cc) so that unknown words are sent to a decompounding-function that tries various strategies (looking up longest-match left-to-right / minimum cuts etc.). But that should be relatively easy (People Have Done This Before).
The practical problems come when trying to integrate this with disambiguation, transfer and generation in Apertium.
A naïve attempt: say we read språksivilisert informasjonssamfunn (neither in the dictionary), lt-proc could output something like:
^språk/språk<n><compound><nt><pl><ind>/språk<n><compound><nt><sg><ind>$^sivilisert/sivilisere<vblex><pp>/sivilisere<adj><pp><nt><sg><ind>/sivilisere<adj><pp><mf><sg><ind>$ ^informasjon/informasjon<n><compound><m><sg><ind>$^s/s<epenthetic><compound> $^samfunn/samfunn<n><nt><pl><ind>/samfunn<n><nt><sg><ind>$
If the decompounder doesn't add spaces, we don't get spaces in final output. s<epenthetic><compound>
is RL, only generated by the lt-proc decompounder; probably we can have a language-specific list of possible epenthetics somewhere, like s and e for Norwegian.
This will seriously mess up CG though, since we can have eg. verb+noun/adj+noun/adj+verb compounds, etc. Only the last part of it matters for disambiguation (ie. språksivilisert is equivalent to sivilsert wrt. disambiguation, etc.), so I guess the simplest way would be to somehow make CG pretend that the first part doesn't exist, make CG ignore words with <compound>
or something. Should be easy, just like superblanks, although it feels like a rather arbitrary change. Actually making sure CG ignores <compound>
-tagged words without changing cg-proc would be a whole lot more work.
For transfer, we could probably do multi-stage transfer and chunk all <compound>
-tagged words. The second stage does the actual transfer stuff (moving things around, concordance), using these chunks and thus ignoring any <compound>
-tagged words. So we can probably get by without changing the transfer module.
- Another option would be to send it to CG as
^språksivilisert/språk<n><compound><nt><pl><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><sg><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><pl><ind>+sivilisere<adj><pp><nt><sg><ind>/...$
And then change apertium-pretransfer to not put a space where+
is. Alternatively we could just make a new symbol, .e.g=
or&
to take the place of+
that pretransfer just splits without inserting a space. - Francis Tyers 11:06, 16 September 2009 (UTC)
Disambiguating between compound types
Some heuristics given by Johannesen & Hauglin:
- Lexical compounding is preferable to compounding with epenthetic phones.
- Epenthetic -s- is preferred to lexical compounding when the -s- can be ambiguous between epenthetic use and the first letter of a verbal last member.
- Epenthetic -s- can only follow noun stems.
- Epenthetic -s- is preferred to lexical compounding when the first member is itself a compound.
- but that would involve marking lots of words in monodix... (unless we want to do multistem compounding, which we probably don't). Most likely handled by longest-match left-to-right, though
- Epenthetic -s- cannot follow epenthetic -e- and vice versa.
- If two analyses have the same number of members and there is no epenthesis involved, choose the one, if any, that is a noun.
- If two analyses are equal with respect to epenthesis and part of speech, and one has a first member that is itself a compound, then choose that one.
- Most likely handled by longest-match left-to-right.