User:Firespeaker/Steps for writing a morphological transducer
This is a short guide....
This outlines steps that should be followed more or less in order; however, it can be an iterative process, and sometimes you need to go ahead a step or two to figure out what you did wrong or missed a couple steps back.
Contents
Document Resources
Document Morphotactics
Phonology
with clear rules for any condition, documenting variation
Word classes
Not just nouns/verbs/adjectives/etc., but types of these, how they pattern
Decide what the best formalism is
Start writing morphophonology
Add some nouns
Add some noun morphology
For example, plural. Where do cases come in relation to plural?
Add some verbs
Add some verbal morphology
For example, simple present tense.
Figure out more complex categories
Major evaluation
You should be evaluating all along, but at this point stop and do a major evaluation. Run your transducer over several large corpora (if available). What's missing? Major chunks of morphology? Some common words? Try to fill these in, and your coverage should jump by several percent.
Diambiguation
This is a good point to stop and do a different type of evaluation. Run through the analyses of a couple sentences (ideally several paragraphs' worth) and see if there are any words that aren't being evaluated right. Focus on words with multiple analyses where only one is correct in the context. (Words without a correct analysis also will need work, but that's part of expanding coverage.)
Document some rules
The first step of setting up good disambiguation rules should really be to document the rules. Write some plain-text rules (e.g., on the wiki) that you can later implement. For example, is there a past tense form that sometimes makes a verb look like a similar-looking noun in the accusative case? Could this be consistently disambiguated based on the part of speech of the following word? What about the preceding word? If that doesn't work, is there semi-consistent way to know whether one reading or the other is right? Could you implement a couple of these semi-consistent ways to get a very consistent disambiguation? Document everything you think will help.
Implement the rules in CG
Expand coverage
Tweak morphophonology
Add lexemes in bulk
based on frequency lists, words in a corpus not covered, things that occur to you off the top of your head