Difference between revisions of "Talk:Norwegian Nynorsk and Norwegian Bokmål"
Jump to navigation
Jump to search
(→NB-NN: new section) |
(Blanked the page) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Notes on bokmål NP structure == |
|||
Possible phrases to put in an NP slot (based on [http://www.hf.uib.no/i/LiLi/SLF/Dyvik/norsknoder.pdf Dyvik 2000], p.11--13)) with Apertium tags: |
|||
* året<n> |
|||
* et<det><ind> år<n> |
|||
* mange<adj> år<n> |
|||
* de<det> mange<adj> årene<n> |
|||
* alle<det><qnt> de<det><def> mange<adj> årene<n> dine<det><pos> |
|||
** all the many years yours |
|||
* alle<det><qnt> disse<det><def> dine<det><pos> seksti<num> år<n> som<cnjsub> gikk<vblex> |
|||
** all these your sixty years which went |
|||
* alle<det><qnt> som<cnjsub> gikk<vblex> |
|||
* mange<adj> |
|||
* mange<adj> raske<adj> |
|||
* *raske<adj> |
|||
** (That is, we can't say "Gi meg raske." (''Give me quick (ones).'') but we can say "Gi meg noen raske." (''Give me some quick (ones).'').) |
|||
Dyvik's analysis (based on [http://www.hf.uib.no/i/LiLi/SLF/ans/Vangsnes/Contents.html Vangsnes 1999]) looks more or less like: |
|||
<pre> |
|||
NOM = { Prop | PRON | AllQP | DP | PossP | QuantP | NP } |
|||
AllQP → allq ({ DP | PossP | QuantP | NP }) |
|||
DP → det ({ PossP | QuantP | NP }) |
|||
PossP → { Poss | NOM<gen> } ({ QuantP | NP }) |
|||
QuantP → { QP | num | art } qnt (NP) |
|||
NP → AP* n (poss) (CP) |
|||
</pre> |
|||
(This is flattened a lot and disregards f-structure.) Assuming AP gives adj, we get: |
|||
<pre> |
|||
[AllQP alle<allq> [DP disse<det> [PossP dine<poss> [QuantP mange<qnt> [NP gode<adj> år<n> [CP som gikk] ] ] ] ] ] |
|||
[AllQP [QuantP mange<qnt> ] ] |
|||
</pre> |
|||
and so on, but this doesn't allow "Gi meg mange raske (*som gikk)" or "Gi meg alle de raske (som gikk)", so the NP needs to be more lax on the presence of n. |
|||
== TODO == |
|||
I've put my private TODO list and general notes file (extremely messy) up at http://www.student.uib.no/~kun041/doc/apertium.html (though I doubt it's helpful for anyone but me...). Also, igrep for "todo" in apertium-nn-nb/*x. [[User:Unhammer|Unhammer]] 07:42, 8 June 2009 (UTC) |
|||
== Compounds == |
|||
There are several issues with [[compounds|compounding]]. First of all, we need to do the decompounding analysis. This could happen by changing lt-proc (fst_processor.cc) so that unknown words are sent to a decompounding-function that tries various strategies (looking up longest-match left-to-right / minimum cuts etc.). But that should be relatively easy (People Have Done This Before). |
|||
The practical problems come when trying to integrate this with disambiguation, transfer and generation in Apertium. |
|||
A naïve attempt: say we read ''språksivilisert informasjonssamfunn'' (neither in the dictionary), lt-proc could output something like: |
|||
<pre> |
|||
^språk/språk<n><compound><nt><pl><ind>/språk<n><compound><nt><sg><ind>$^sivilisert/sivilisere<vblex><pp>/sivilisere<adj><pp><nt><sg><ind>/sivilisere<adj><pp><mf><sg><ind>$ |
|||
^informasjon/informasjon<n><compound><m><sg><ind>$^s/s<epenthetic><compound> $^samfunn/samfunn<n><nt><pl><ind>/samfunn<n><nt><sg><ind>$ |
|||
</pre> |
|||
If the decompounder doesn't add spaces, we don't get spaces in final output. <code>s<epenthetic><compound></code> is RL, only generated by the lt-proc decompounder; probably we can have a language-specific list of possible epenthetics somewhere, like ''s'' and ''e'' for Norwegian. |
|||
This will seriously mess up CG though, since we can have eg. verb+noun/adj+noun/adj+verb compounds, etc. Only the last part of it matters for disambiguation (ie. ''språksivilisert'' is equivalent to ''sivilsert'' wrt. disambiguation, etc.), so I guess the simplest way would be to somehow make CG pretend that the first part doesn't exist, make CG ignore words with <code><compound></code> or something. Should be easy, just like superblanks, although it feels like a rather arbitrary change. Actually making sure CG ignores <code><compound></code>-tagged words ''without'' changing cg-proc would be a whole lot more work. |
|||
For transfer, we could probably do multi-stage transfer and chunk all <code><compound></code>-tagged words. The second stage does the actual transfer stuff (moving things around, concordance), using these chunks and thus ignoring any <code><compound></code>-tagged words. So we can probably get by without changing the transfer module. |
|||
:Another option would be to send it to CG as <code>^språksivilisert/språk<n><compound><nt><pl><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><sg><ind>+sivilisere<vblex><pp>/språk<n><compound><nt><pl><ind>+sivilisere<adj><pp><nt><sg><ind>/...$</code> And then change apertium-pretransfer to not put a space where <code>+</code> is. Alternatively we could just make a new symbol, .e.g <code>=</code> or <code>&</code> to take the place of <code>+</code> that pretransfer just splits without inserting a space. - [[User:Francis Tyers|Francis Tyers]] 11:06, 16 September 2009 (UTC) |
|||
===Disambiguating between compound types=== |
|||
Some heuristics given by [[Compounds#Further_reading|Johannesen & Hauglin]]: |
|||
* Compounds are always binary – they contain two members only (apart from possible epenthetic phones). |
|||
* Choose the analysis (or analyses) with the fewest compound members |
|||
* Lexical compounding is preferable to compounding with epenthetic phones. |
|||
** Note: the first part is always just a stem. We don't have "cars.sick", only "car.sick", etc. (Well, that's what they say, but then there's "savnet.melding" ("missing.message") where the first part is inflected) |
|||
* Epenthetic -s- is preferred to lexical compounding when the -s- can be ambiguous between epenthetic use and the first letter of a verbal last member. |
|||
** First-longest-match gets in the way of this... |
|||
* Epenthetic -s- can only follow noun stems. |
|||
** Sounds like a job for CG. |
|||
* Epenthetic -s- is preferred to lexical compounding when the first member is itself a compound. |
|||
** but that would involve marking lots of words in monodix... (unless we want to do multistem compounding, which we probably don't). Most likely handled by longest-match left-to-right, though |
|||
* Epenthetic -s- cannot follow epenthetic -e- and vice versa. |
|||
** This could be hardcoded into a decompounder, since ''if'' a language allows something that looks like two epenthetics, you just add that as another epenthetic in your language-specific list. |
|||
* If two analyses have the same number of members and there is no epenthesis involved, choose the one, if any, that is a noun. |
|||
** We can't do this, can we? That'd involve sending not only all analyses of the first longest matches, but all possible analyses into some compound disambiguation. |
|||
* If two analyses are equal with respect to epenthesis and part of speech, and one has a first member that is itself a compound, then choose that one. |
|||
** Most likely handled by longest-match left-to-right. |
|||
* If the first member is unknown, choose the analysis with the longest last member. |
|||
** Not handled by longest-match left-to-right? Or? |
|||
Stuff we most likely won't do: |
|||
* Epenthetic -e- can only be attached to a stem that is monosyllabic. This does not mean that epenthetic -e- must always make up the second syllable in a word, however. Other possible stems can be prior to the stem preceding the -e-, if they do not form a compound with that stem. |
|||
* Epenthetic -s- cannot occur after a sibilant or a final consonant sequence containing a sibilant except when the consonant sequence belongs to a compound. |
|||
===Some compound stats=== |
|||
[http://www.dokpro.uio.no/bokmaal/nyord/nyord_ramme.html] gives a list of compounds, POS-tagged on both members and classified by epenthesis type etc. Of ~25000 compounds: ~10000 had epenthesis, 954 had an adjective as one member (about 50/50 on first and last), 327 had a verb as one member (again 50/50). |
|||
Of the 15904 non-epenthetic compounds, 674 were analysed by apertium-nn-nb. When putting nouns, adjectives and verbs into an "inconditional" section, 15889 were analysed, although many of these analyses were of course wrong. Of the 15889 naïvely decompounded analyses of non-epenthetic compounds, only 5369 were analysed as two words, so we definitely need to restrict decompounding to two-word analyses (with possible epenthetics). (After using only nouns, and removing single letters, this figure went up to 7512.) |
|||
== NB-NN == |
|||
"Boken ble lest → Boka vart lese (past)" - better:"Boken ble lest → Boka vart lesen (past)" |