Difference between revisions of "Talk:Norwegian Nynorsk and Norwegian Bokmål"

From Apertium
Jump to navigation Jump to search
Line 46: Line 46:
A naïve attempt: say we read ''språksivilisert informasjonssamfunn'' (neither in the dictionary), lt-proc could output something like:
A naïve attempt: say we read ''språksivilisert informasjonssamfunn'' (neither in the dictionary), lt-proc could output something like:
<pre>
<pre>
^språk/språk<n><compound><nt><pl><ind>/språk<n><compound><nt><sg><ind>$
^språk/språk<n><compound><nt><pl><ind>/språk<n><compound><nt><sg><ind>$^sivilisert/sivilisere<vblex><pp>/sivilisere<adj><pp><nt><sg><ind>/sivilisere<adj><pp><mf><sg><ind>$
^sivilisert/sivilisere<vblex><pp>/sivilisere<adj><pp><nt><sg><ind>/sivilisere<adj><pp><mf><sg><ind>$


^informasjon/informasjon<n><compound><m><sg><ind>$
^informasjon/informasjon<n><compound><m><sg><ind>$^s/s<epenthetic><compound> $^samfunn/samfunn<n><nt><pl><ind>/samfunn<n><nt><sg><ind>$
^s/s<epenthetic><compound> $
^samfunn/samfunn<n><nt><pl><ind>/samfunn<n><nt><sg><ind>$
</pre>
</pre>


lt-proc would have to ignore the <code><compound></code> tags when looking up in generation (or we could use a <code>+</code> afterwards or something instead of a tag), and we could have a recompounder after generation which deletes spaces after <code><compound></code>. (<code>s<epenthetic><compound></code> is RL, only generated by the lt-proc decompounder; probably we can have a language-specific list of possible epenthetics somewhere, like ''s'' and ''e'' for Norwegian.)
If the decompounder doesn't add spaces, we don't get spaces in final output. <code>s<epenthetic><compound></code> is RL, only generated by the lt-proc decompounder; probably we can have a language-specific list of possible epenthetics somewhere, like ''s'' and ''e'' for Norwegian.


This will seriously mess up CG though, since we can have eg. verb+noun/adj+noun/adj+verb compounds, etc. Only the last part of it matters for disambiguation (ie. ''språksivilisert'' is equivalent to ''sivilsert'' wrt. disambiguation), so I guess the simplest way would be to somehow make CG pretend that the first part doesn't exist, make CG ignore words with <code><compound></code> or something. Should be easy, just like superblanks, although it feels like a rather arbitrary change. Actually making sure CG ignores <code><compound></code>-tagged words ''without'' changing cg-proc would be a whole lot more work.
This will seriously mess up CG though, since we can have eg. verb+noun/adj+noun/adj+verb compounds, etc. Only the last part of it matters for disambiguation (ie. ''språksivilisert'' is equivalent to ''sivilsert'' wrt. disambiguation, etc.), so I guess the simplest way would be to somehow make CG pretend that the first part doesn't exist, make CG ignore words with <code><compound></code> or something. Should be easy, just like superblanks, although it feels like a rather arbitrary change. Actually making sure CG ignores <code><compound></code>-tagged words ''without'' changing cg-proc would be a whole lot more work.


For transfer, we could probably do multi-stage transfer and chunk all <code><compound></code>-tagged words. The second stage does the actual transfer stuff (moving things around, concordance), using these chunks and thus ignoring any <code><compound></code>-tagged words. So we can probably get by without changing the transfer module.
For transfer, we could probably do multi-stage transfer and chunk all <code><compound></code>-tagged words. The second stage does the actual transfer stuff (moving things around, concordance), using these chunks and thus ignoring any <code><compound></code>-tagged words. So we can probably get by without changing the transfer module.

Revision as of 18:08, 15 September 2009

Notes on bokmål NP structure

Possible phrases to put in an NP slot (based on Dyvik 2000, p.11--13)) with Apertium tags:

  • året<n>
  • et<det><ind> år<n>
  • mange<adj> år<n>
  • de<det> mange<adj> årene<n>
  • alle<det><qnt> de<det><def> mange<adj> årene<n> dine<det><pos>
    • all the many years yours
  • alle<det><qnt> disse<det><def> dine<det><pos> seksti<num> år<n> som<cnjsub> gikk<vblex>
    • all these your sixty years which went
  • alle<det><qnt> som<cnjsub> gikk<vblex>
  • mange<adj>
  • mange<adj> raske<adj>
  • *raske<adj>
    • (That is, we can't say "Gi meg raske." (Give me quick (ones).) but we can say "Gi meg noen raske." (Give me some quick (ones).).)

Dyvik's analysis (based on Vangsnes 1999) looks more or less like:

NOM = { Prop | PRON | AllQP | DP | PossP | QuantP | NP } 

AllQP    →  allq ({ DP  | PossP | QuantP | NP }) 
DP       →         det ({ PossP | QuantP | NP }) 
PossP    → { Poss |  NOM<gen> } ({ QuantP | NP }) 
QuantP   →    { QP | num | art }    qnt    (NP)
NP       →                            AP*   n   (poss)  (CP)

(This is flattened a lot and disregards f-structure.) Assuming AP gives adj, we get:

[AllQP alle<allq> [DP disse<det> [PossP dine<poss> [QuantP mange<qnt> [NP gode<adj> år<n> [CP som gikk] ] ] ] ] ]

[AllQP [QuantP mange<qnt> ] ] 

and so on, but this doesn't allow "Gi meg mange raske (*som gikk)" or "Gi meg alle de raske (som gikk)", so the NP needs to be more lax on the presence of n.

TODO

I've put my private TODO list and general notes file (extremely messy) up at http://www.student.uib.no/~kun041/doc/apertium.html (though I doubt it's helpful for anyone but me...). Also, igrep for "todo" in apertium-nn-nb/*x. Unhammer 07:42, 8 June 2009 (UTC)

Compounds

There are several issues with compounding. First of all, we need to do the decompounding analysis. This could happen by changing lt-proc (fst_processor.cc) so that unknown words are sent to a decompounding-function that tries various strategies (looking up longest-match left-to-right / minimum cuts etc.). But that should be relatively easy (People Have Done This Before).

The practical problems come when trying to integrate this with disambiguation, transfer and generation in Apertium.

A naïve attempt: say we read språksivilisert informasjonssamfunn (neither in the dictionary), lt-proc could output something like:

^språk/språk<n><compound><nt><pl><ind>/språk<n><compound><nt><sg><ind>$^sivilisert/sivilisere<vblex><pp>/sivilisere<adj><pp><nt><sg><ind>/sivilisere<adj><pp><mf><sg><ind>$

^informasjon/informasjon<n><compound><m><sg><ind>$^s/s<epenthetic><compound> $^samfunn/samfunn<n><nt><pl><ind>/samfunn<n><nt><sg><ind>$

If the decompounder doesn't add spaces, we don't get spaces in final output. s<epenthetic><compound> is RL, only generated by the lt-proc decompounder; probably we can have a language-specific list of possible epenthetics somewhere, like s and e for Norwegian.

This will seriously mess up CG though, since we can have eg. verb+noun/adj+noun/adj+verb compounds, etc. Only the last part of it matters for disambiguation (ie. språksivilisert is equivalent to sivilsert wrt. disambiguation, etc.), so I guess the simplest way would be to somehow make CG pretend that the first part doesn't exist, make CG ignore words with <compound> or something. Should be easy, just like superblanks, although it feels like a rather arbitrary change. Actually making sure CG ignores <compound>-tagged words without changing cg-proc would be a whole lot more work.

For transfer, we could probably do multi-stage transfer and chunk all <compound>-tagged words. The second stage does the actual transfer stuff (moving things around, concordance), using these chunks and thus ignoring any <compound>-tagged words. So we can probably get by without changing the transfer module.