Difference between revisions of "Talk:Norwegian Nynorsk and Norwegian Bokmål"
Jump to navigation
Jump to search
m (→Compounds) |
(Blanked the page) |
||
(16 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | == Notes on bokmål NP structure == |
||
− | Possible phrases to put in an NP slot (based on [http://www.hf.uib.no/i/LiLi/SLF/Dyvik/norsknoder.pdf Dyvik 2000], p.11--13)) with Apertium tags: |
||
− | |||
− | * året<n> |
||
− | * et<det><ind> år<n> |
||
− | * mange<adj> år<n> |
||
− | * de<det> mange<adj> årene<n> |
||
− | * alle<det><qnt> de<det><def> mange<adj> årene<n> dine<det><pos> |
||
− | ** all the many years yours |
||
− | * alle<det><qnt> disse<det><def> dine<det><pos> seksti<num> år<n> som<cnjsub> gikk<vblex> |
||
− | ** all these your sixty years which went |
||
− | * alle<det><qnt> som<cnjsub> gikk<vblex> |
||
− | * mange<adj> |
||
− | * mange<adj> raske<adj> |
||
− | * *raske<adj> |
||
− | ** (That is, we can't say "Gi meg raske." (''Give me quick (ones).'') but we can say "Gi meg noen raske." (''Give me some quick (ones).'').) |
||
− | |||
− | Dyvik's analysis (based on [http://www.hf.uib.no/i/LiLi/SLF/ans/Vangsnes/Contents.html Vangsnes 1999]) looks more or less like: |
||
− | <pre> |
||
− | NOM = { Prop | PRON | AllQP | DP | PossP | QuantP | NP } |
||
− | |||
− | AllQP → allq ({ DP | PossP | QuantP | NP }) |
||
− | DP → det ({ PossP | QuantP | NP }) |
||
− | PossP → { Poss | NOM<gen> } ({ QuantP | NP }) |
||
− | QuantP → { QP | num | art } qnt (NP) |
||
− | NP → AP* n (poss) (CP) |
||
− | </pre> |
||
− | (This is flattened a lot and disregards f-structure.) Assuming AP gives adj, we get: |
||
− | <pre> |
||
− | [AllQP alle<allq> [DP disse<det> [PossP dine<poss> [QuantP mange<qnt> [NP gode<adj> år<n> [CP som gikk] ] ] ] ] ] |
||
− | |||
− | [AllQP [QuantP mange<qnt> ] ] |
||
− | </pre> |
||
− | and so on, but this doesn't allow "Gi meg mange raske (*som gikk)" or "Gi meg alle de raske (som gikk)", so the NP needs to be more lax on the presence of n. |
||
− | |||
− | == TODO == |
||
− | |||
− | I've put my private TODO list and general notes file (extremely messy) up at http://www.student.uib.no/~kun041/doc/apertium.html (though I doubt it's helpful for anyone but me...). Also, igrep for "todo" in apertium-nn-nb/*x. [[User:Unhammer|Unhammer]] 07:42, 8 June 2009 (UTC) |
||
− | |||
− | == Compounds == |
||
− | |||
− | There are several issues with compounding. First of all, we need to do the decompounding analysis. This could happen by changing lt-proc (fst_processor.cc) so that unknown words are sent to a decompounding-function that tries various strategies (looking up longest-match left-to-right / minimum cuts etc.). But that should be relatively easy (People Have Done This Before). |
||
− | |||
− | The practical problems come when trying to integrate this with disambiguation, transfer and generation in Apertium. |
||
− | |||
− | A naïve attempt: say we read ''språksivilisert informasjonssamfunn'' (neither in the dictionary), lt-proc could output something like: |
||
− | <pre> |
||
− | ^språk/språk<n><compound><nt><pl><ind>/språk<n><compound><nt><sg><ind>$ |
||
− | ^sivilisert/sivilisere<vblex><pp>/sivilisere<adj><pp><nt><sg><ind>/sivilisere<adj><pp><mf><sg><ind>$ |
||
− | |||
− | ^informasjon/informasjon<n><compound><m><sg><ind>$ |
||
− | ^s/s<epenthetic><compound> $ |
||
− | ^samfunn/samfunn<n><nt><pl><ind>/samfunn<n><nt><sg><ind>$ |
||
− | </pre> |
||
− | |||
− | lt-proc would have to ignore the <code><compound></code> tags when looking up in generation (or we could use a <code>+</code> afterwards or something instead of a tag), and we could have a recompounder after generation which deletes spaces after <code><compound></code>. (<code>s<epenthetic><compound></code> is RL, only generated by the lt-proc decompounder; probably we can have a language-specific list of possible epenthetics somewhere, like ''s'' and ''e'' for Norwegian.) |
||
− | |||
− | This will seriously mess up CG though, since we can have eg. verb+noun/adj+noun/adj+verb compounds, etc. Only the last part of it matters for disambiguation (ie. ''språksivilisert'' is equivalent to ''sivilsert'' wrt. disambiguation), so I guess the simplest way would be to somehow make CG pretend that the first part doesn't exist, make CG ignore words with <code><compound></code> or something. Should be easy, just like superblanks, although it feels like a rather arbitrary change. Actually making sure CG ignores <code><compound></code>-tagged words ''without'' changing cg-proc would be a whole lot more work. |
||
− | |||
− | For transfer, we could probably do multi-stage transfer and chunk all <code><compound></code>-tagged words. The second stage does the actual transfer stuff (moving things around, concordance), using these chunks and thus ignoring any <code><compound></code>-tagged words. So we can probably get by without changing the transfer module. |