Talk:Matxin 1.0 New Language Pair HOWTO

From Apertium
Jump to navigation Jump to search

What would be interesting is the principle of Spanish-Basque.

How is the dictionary set up? How are relations of subjects generated?

For example:

  • etxera joaten naiz: I am going home; Me voy a casa (bad: Etxera noa)
  • etxetik etortzen da: he comes from the house; él viene de la casa (bad: Hura etxetik dator)

what kind of generator is used, how are the grammatical forms transferred from Spanish to Basque.

A step for step guide: What happens after the user enters Me voy a casa, how it gets transferred to Basque.

Hey, I'm planning to add this. We've got to the end of the analysis stage (more or less), next up is the transfer stage. :) When I make a guide, I like to try and go as much as possible from start to finish. - Francis Tyers 23:31, 16 June 2009 (UTC)


él viene de la casa

<?xml version='1.0' encoding='UTF-8' ?>
<corpus>
<SENTENCE ord='1' alloc='0'>
<CHUNK ord='2' alloc='4' type='grup-verb' si='top'>
  <NODE ord='2' alloc='4' form='viene' lem='venir' mi='VMIP3S0'>
  </NODE>
  <CHUNK ord='1' alloc='0' type='s-adj' si='modnomatch'>
    <NODE ord='1' alloc='0' form='él' lem='él' mi='AQ0CS0'>
    </NODE>
  </CHUNK>
  <CHUNK ord='3' alloc='10' type='sp-de' si='sp-obj'>
    <NODE ord='3' alloc='10' form='de' lem='de' mi='SPS00'>
      <NODE ord='5' alloc='16' form='casa' lem='casa' mi='NCFS000'>
        <NODE ord='4' alloc='13' form='la' lem='el' mi='DA0FS0'>
        </NODE>
      </NODE>
    </NODE>
  </CHUNK>
</CHUNK>
</SENTENCE>
</corpus>


<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='4' si='top'>
<NODE ref='2' alloc='4' UpCase='none' lem='_etorri_' mi='VMIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='adjs' alloc='0' si='modnomatch'>
<NODE ref='1' alloc='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='post-izls' alloc='10' si='sp-obj'>
<NODE ref='3' alloc='10' UpCase='none' lem='' prep='de'>
<NODE ref='5' alloc='16' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='4' alloc='13' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='4' si='top'>
<NODE ref='2' alloc='4' UpCase='none' lem='_etorri_' mi='VMIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='adjs' alloc='0' si='modnomatch'>
<NODE ref='1' alloc='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='post-izls' alloc='10' si='sp-obj'>
<NODE ref='3' alloc='10' UpCase='none' lem='' prep='de'>
<NODE ref='5' alloc='16' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='4' alloc='13' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='4' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='4' UpCase='none' lem='_etorri_' mi='VMIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='adjs' alloc='0' si='modnomatch' length='1' cas='[ABS]'>
<NODE ref='1' alloc='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='post-izls' alloc='10' si='sp-obj' length='3' cas='[ABS]'>
<NODE ref='3' alloc='10' UpCase='none' lem='' prep='de'>
<NODE ref='5' alloc='16' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='4' alloc='13' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='4' si='top' length='1' trans='DU' cas='[ABS]'>
<NODE ref='2' alloc='4' UpCase='none' lem='_etorri_' mi='VMIP3S0' pos='[ADI][SIN]'>
</NODE>
<CHUNK ref='1' type='adjs' alloc='0' si='modnomatch' length='1' cas='[ABS]'>
<NODE ref='1' alloc='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='post-izls' alloc='10' si='sp-obj' length='3' cas='[ABS]'>
<NODE ref='3' alloc='10' UpCase='none' lem='' prep='de'>
<NODE ref='5' alloc='16' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='4' alloc='13' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='4' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='4' lem='etorri' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='adjs' alloc='0' si='modnomatch' cas='[ABS]' length='1'>
<NODE ref='1' alloc='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='post-izls' alloc='10' si='sp-obj' cas='[ABS]' length='3'>
<NODE ref='3' alloc='10' UpCase='none' lem='' prep='de'>
<NODE ref='5' alloc='16' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='4' alloc='13' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ref='1' alloc='0'>
<CHUNK ref='2' type='adi-kat' alloc='4' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='4' lem='etorri' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ref='1' type='adjs' alloc='0' si='modnomatch' cas='[ABS]' length='1'>
<NODE ref='1' alloc='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ref='3' type='post-izls' alloc='10' si='sp-obj' cas='[ABS]' length='3'>
<NODE ref='3' alloc='10' UpCase='none' lem='' prep='de'>
<NODE ref='5' alloc='16' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='4' alloc='13' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='4' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ref='2' alloc='4' lem='etorri' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='adjs' alloc='0' si='modnomatch' cas='[ABS]' length='1'>
<NODE ref='1' alloc='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='post-izls' alloc='10' si='sp-obj' cas='[ABS]' length='3'>
<NODE ref='3' alloc='10' UpCase='none' lem='' prep='de'>
<NODE ref='5' alloc='16' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ref='4' alloc='13' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

<?xml version='1.0' encoding='UTF-8' ?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='4' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE ord='0' ref='2' alloc='4' lem='etorri' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='adjs' alloc='0' si='modnomatch' cas='[ABS]' length='1'>
<NODE ord='0' ref='1' alloc='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='post-izls' alloc='10' si='sp-obj' cas='[ABS]' length='3'>
<NODE ord='1' ref='3' alloc='10' UpCase='none' lem='' prep='de'>
<NODE ord='0' ref='5' alloc='16' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE ord='2' ref='4' alloc='13' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

<?xml version='1.0' encoding='UTF-8'?>
<corpus >
<SENTENCE ord='1' ref='1' alloc='0'>
<CHUNK ord='2' ref='2' type='adi-kat' alloc='4' si='top' cas='[ABS]' trans='DU' length='1'>
<NODE form='dator' ref ='2' alloc ='4' ord='0' lem='etorri' pos='[NAG]' mi='[ADT][A1][NR_HU]'>
</NODE>
<CHUNK ord='0' ref='1' type='adjs' alloc='0' si='modnomatch' cas='[ABS]' length='1'>
<NODE form='él' ref ='1' alloc ='0' ord='0' UpCase='none' lem='él' parol='AQ0CS0' unknown='transfer'>
</NODE>
</CHUNK>
<CHUNK ord='1' ref='3' type='post-izls' alloc='10' si='sp-obj' cas='[ABS]' length='3'>
<NODE form='' ref ='3' alloc ='10' ord='1' UpCase='none' lem='' prep='de'>
<NODE form='etxe' ref ='5' alloc ='16' ord='0' UpCase='none' lem='etxe' pos='[IZE][ARR]' mi='[NUMS]' sem='[BIZ-]'>
<NODE form='' ref ='4' alloc ='13' ord='2' UpCase='none' lem='' mi='[NUMS]'>
</NODE>
</NODE>
</NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

</corpus>

Result: él etxe dator

There is an encoding error here, which is why it is not translating 'él'. - Francis Tyers 07:53, 17 June 2009 (UTC)

Exactly. Since él is entered as utf8, él, and also xml coding is set to utf8, the question is, why?Muki987 09:40, 17 June 2009 (UTC)

As mentioned on the Matxin list, Freeling does not currently support input in UTF-8, you can try using iconv before and after the Analyser stage. - Francis Tyers 10:36, 17 June 2009 (UTC)

This idea does not make sense. Freeling gets it and passes it over to the second stage as UTF8. It might handle it incorrectly since it does not understand it, but that is not the point here. The second stage, the translation one refuses translation, even though it gets the data in UTF8. That's the point. Matxin traces things quite well. Muki987 12:04, 17 June 2009 (UTC)

You have a problem with encoding, try emailing the Matxin list. - Francis Tyers 12:40, 17 June 2009 (UTC)

Low priority wishlist[edit]

  • Abandon 'configuration file', change simple programs to have simple command line arguments.
  • Make every module output pretty-printed XML
  • Convert simple tab-separated file formats to XML and write DTDs for easy validation