User:Firespeaker/Steps for writing a morphological transducer

From Apertium
Jump to navigation Jump to search

This is meant as a short overview of what goes into designing a morphological transducer, starting with familiarising yourself with the structures of the language and thinking about them as a [computational] linguist to a production-ready transducer which can be used as a spell checker, in a translation pair, etc. More detail about each step can be found elsewhere on this wiki; this page is just meant as way to think about what needs to be done in a larger-picture sort of way.

This outlines steps that should be followed more or less in order; however, it can be an iterative process, and sometimes you need to go ahead a step or two to figure out what you did wrong or missed a couple steps back.

Most of the steps or substeps in here make good GCI tasks (at least with reasonable numbers, e.g. "add 100 noun stems to transducer" is a good task, but "write morphophonology" isn't). Many of them, of course, may require people to complete them who know the language fairly well (either through study or growing up speaking it).

In general, flipping through a grammar and a dictionary can get you quite a ways with a transducer, but unless you know the language really well, or have access to a native speaker of the language who doesn't mind answering questions about things, you're going to run into severe limitations in how much progress you can make.

Document Resources

Document Morphotactics

Phonology

with clear rules for any condition, documenting variation

Word classes

Not just nouns/verbs/adjectives/etc., but types of these, how they pattern

Decide what the best formalism is

Start writing morphophonology

Add some nouns

Add some noun morphology

For example, plural. Where do cases come in relation to plural?

Add some verbs

Add some verbal morphology

For example, simple present tense.

Figure out more complex categories

Major evaluation

You should be evaluating all along, but at this point stop and do a major evaluation. Run your transducer over several large corpora (if available). What's missing? Major chunks of morphology? Some common words? Try to fill these in, and your coverage should jump by several percent.

Disambiguation

This is a good point to stop and do a different type of evaluation. Run through the analyses of a couple sentences (ideally several paragraphs' worth) and see if there are any words that aren't being analysed right. For this step, focus on words with multiple analyses where only one is correct in the context. Words without a correct analysis also will need work, but that's part of expanding coverage.

Document some rules

The first step of setting up good disambiguation rules should really be to document the rules. Write some plain-text rules (e.g., on the wiki) that you can later implement. For example, is there a past tense form that sometimes makes a verb look like a similar-looking noun in the accusative case? Could this be consistently disambiguated based on the part of speech of the following word? What about the preceding word? If that doesn't work, is there semi-consistent way to know whether one reading or the other is right? How about a way to know whether a reading is wrong? Could you implement a couple of these semi-consistent ways to get a very consistent overall disambiguation? Document everything you think will help.

Implement the rules in CG

Expand coverage

Tweak morphophonology

Add lexemes in bulk

based on frequency lists, words in a corpus not covered, things that occur to you off the top of your head