Difference between revisions of "Development ideas for dictionary format"

From Apertium
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
The idea of this page is to collect ideas for how to expand the Apertium <code>.dix</code> format such that it could be a drop-in replacement for [[lexc]]. Currently it has many advantages over lexc: Convenient / easy validation, more restrictive syntax, support for multiword queues. The problem is that it doesn't support some useful features that lexc has, or not comfortably.
+
The idea of this page is to collect ideas for how to expand the Apertium <code>.dix</code> format such that it could be a drop-in replacement for [[lexc]]. Currently it has many advantages over lexc: Convenient / easy validation, more restrictive syntax, support for multiword queues and inbuilt support for analysis/generation restrictions. The problem is that it doesn't support some useful features that lexc has, or not comfortably. Also it would be desirable to standardise on some of the typical lexc stuff, e.g. ''one'' way of writing the morpheme boundary, not 100.
   
 
== Archiphonemes ==
 
== Archiphonemes ==
Line 25: Line 25:
   
 
Might be liveable ? These would then be converted by the compiler into <code>{L}{A}{G}{I}</code> tags ?
 
Might be liveable ? These would then be converted by the compiler into <code>{L}{A}{G}{I}</code> tags ?
  +
  +
Further reading: [http://flylib.com/books/en/4.384.1.22/1/ 2.5 Entities]
   
 
== Morpheme boundary ==
 
== Morpheme boundary ==
Line 41: Line 43:
   
 
== Flags ==
 
== Flags ==
  +
  +
<pre>
  +
  +
@P.NEG.0@ ! Set if -{I}š is not present
  +
@P.NEG.1@ ! Set if -{I}š is present
  +
@D.NEG.0@ ! Disallow if -{I}š is not present
  +
@D.NEG.1@ ! Disallow if -{I}š is present
  +
  +
LEXICON V-PERS-PRES ! P_6
  +
  +
@D.NEG.1@%<p1%>%<sg%>:@D.NEG.1@%>м # ;
  +
@D.NEG.1@%<p2%>%<sg%>:@D.NEG.1@%>с%{I%}ң # ;
  +
  +
LEXICON V-INFL-FINITE
  +
%<aor%>:%>%{E%} V-PERS-PRES ;
  +
  +
LEXICON V-INFL-COMMON-SECOND
  +
  +
V-INFL-FINITE ;
  +
@D.NEG.1@:@D.NEG.1@ V-INFL-NON-FINITE ;
  +
  +
LEXICON V-INFL-COMMON
  +
  +
V-INFL-COMMON-SECOND ;
  +
@P.NEG.1@:%>%{I%}ш@P.NEG.1@ V-INFL-COMMON-SECOND ; ! Dir/LR
  +
%<coop%>:%>%{I%}ш V-INFL-COMMON-SECOND ;
  +
  +
</pre>
  +
  +
<pre>
  +
  +
<pardef n="V-INFL-COMMON">
  +
<e> <par n="V-INFL-COMMON-SECOND"/></e>
  +
<e r="LR"><f n="neg"/><p><l><m/>&I;ш</l><r><s n="coop"/></r></p></f></e>
  +
<e> <p><l><m/>&I;ш</l><r><s n="coop"/></r></p></e>
  +
</pardef>
  +
</pre>
   
 
== Phonology ==
 
== Phonology ==

Latest revision as of 08:17, 20 February 2012

The idea of this page is to collect ideas for how to expand the Apertium .dix format such that it could be a drop-in replacement for lexc. Currently it has many advantages over lexc: Convenient / easy validation, more restrictive syntax, support for multiword queues and inbuilt support for analysis/generation restrictions. The problem is that it doesn't support some useful features that lexc has, or not comfortably. Also it would be desirable to standardise on some of the typical lexc stuff, e.g. one way of writing the morpheme boundary, not 100.

Archiphonemes[edit]

Perhaps use entities ?

The option of just using <s> is pretty much out,

<e><p><l><s n="pron"/></l><r><s n="L"/><s n="A"/><s n="G"/><s n="I"/></r></p><par n="CASE"/></e>

For

%<pron%>:%>%{L%}%{I%}%{K%}%{I%} CASE ;

Something like:

<e><p><l><s n="pron"/></l><r>&L;&A;&G;&I;</r></p><par n="CASE"/></e>

Might be liveable ? These would then be converted by the compiler into {L}{A}{G}{I} tags ?

Further reading: 2.5 Entities

Morpheme boundary[edit]

Current tags:

  • <a> = "alarm"
  • <s> = "symbol"
  • <b> = "blank"
  • <j> = "join"
  • <g> = "group"

It's desirable that it be a single letter.

Available: c d f h k m n o q t u v w x y z

Flags[edit]


@P.NEG.0@   ! Set if -{I}š is not present
@P.NEG.1@   ! Set if -{I}š is present
@D.NEG.0@   ! Disallow if -{I}š is not present
@D.NEG.1@   ! Disallow if -{I}š is present

LEXICON V-PERS-PRES ! P_6

@D.NEG.1@%<p1%>%<sg%>:@D.NEG.1@%>м # ;
@D.NEG.1@%<p2%>%<sg%>:@D.NEG.1@%>с%{I%}ң # ;

LEXICON V-INFL-FINITE
%<aor%>:%>%{E%} V-PERS-PRES ;     

LEXICON V-INFL-COMMON-SECOND

V-INFL-FINITE ;
@D.NEG.1@:@D.NEG.1@ V-INFL-NON-FINITE ;

LEXICON V-INFL-COMMON

V-INFL-COMMON-SECOND ;
@P.NEG.1@:%>%{I%}ш@P.NEG.1@ V-INFL-COMMON-SECOND ; ! Dir/LR
%<coop%>:%>%{I%}ш V-INFL-COMMON-SECOND ;


<pardef n="V-INFL-COMMON">
  <e>        <par n="V-INFL-COMMON-SECOND"/></e>
  <e r="LR"><f n="neg"/><p><l><m/>&I;ш</l><r><s n="coop"/></r></p></f></e>
  <e>       <p><l><m/>&I;ш</l><r><s n="coop"/></r></p></e>
</pardef>

Phonology[edit]

Further reading[edit]

  • Anssi Yli-Jyrä (2011) "Explorations on Positionwise Flag Diacritics in Finite-State Morphology". NODALIDA
    • This paper adds flag diacritics for implementing morphophonology to a single-tape (e.g. like lttoolbox, no intersect/compose) finite-state transducer.