Infrastructure discussion

From Apertium
Revision as of 21:24, 24 February 2008 by Francis Tyers (talk | contribs) (→‎Example formats: better order)
Jump to navigation Jump to search

Modularity

To what extent is the apertium system modular? To be specific: In our work we do not use agglutinative paradigms, but cascaded finite-state transducers (the Sámi languages we analyse have all to much non-concatenative morphology to be treated with such methods). We use Xerox tools, but the open-source sfst tools would be a possible alternative. Also, our disambiguation and syntactic analysis does not use HMM, but rather constraint grammar (the vislcg variety). Now, the question is to what extent the system is modular enough to take out one or more apertium component(s) and replace them with other components? Trondtr 14:03, 24 February 2008 (UTC).

The Apertium system is implemented as a unix pipeline. Currently most language pairs have the following components:
  • Deformat
  • Morphological analyser
  • POS Tagger
  • Transfer (syntactic + lexical)
  • Generation
  • Post-generation
  • Reformat
The more advanced pairs replace the "Transfer" stage with a three-stage transfer, which incorporates a chunker (rule-based) and the ability to move around chunks (NP, V, etc.) rather than lexical units.
All modules communicate using text streams, so it would be definitely possible to say use your morphological analysis + tagging, and then use our format management and transfer. e.g.:
  • Apertium deformat
  • sfst morphological analysis
  • VISL constraint grammar
  • Transfer (syntactic + lexical)
  • sfst morphological generation
  • Reformat
-Francis Tyers 14:20, 24 February 2008 (UTC)
Problems in incompatibility between text formats can be either taken care of in the code (e.g. add a mode to SFST to output Apertium style analyses), or using perl/python/awk scripts. - Francis Tyers 14:23, 24 February 2008 (UTC)

Example formats

xsft
orða	orð+N+Neu+Pl+Gen+Indef
orða	orða+V+Inf
orða	orða+V+Ind+Prs+1Sg
orða	orða+V+Imp+Sg
Apertium
^orða/orða<vblex><inf>/orð<n><nt><pl><gen><ind>$