Difference between revisions of "Replacement for flag diacritics"

From Apertium
Jump to navigation Jump to search
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
People like to use [[flag diacritics]] for stuff. But they are bad because they are ugly and get in the way of stuff.
 
People like to use [[flag diacritics]] for stuff. But they are bad because they are ugly and get in the way of stuff.
   
Alternative: Use symbols and finite-state operations!
+
Alternative: Use distinct symbols with well defined behaviours and finite-state operations!
   
 
We have <code>&lt;</code> and <code>&gt;</code> for morphological tags, and <code>{</code> and <code>}</code> for [[archiphonemes]] and morphological features. We add a new type of symbol with <code>[</code> and <code>]</code> for modelling morphotactic restrictions.
 
We have <code>&lt;</code> and <code>&gt;</code> for morphological tags, and <code>{</code> and <code>}</code> for [[archiphonemes]] and morphological features. We add a new type of symbol with <code>[</code> and <code>]</code> for modelling morphotactic restrictions.
   
==Example==
+
==Examples==
  +
  +
===Turkish===
   
 
<pre>
 
<pre>
 
Multichar_Symbols
 
Multichar_Symbols
   
%<v%> %<cop%> %<tv%> %<aor%> %<prog%> %<p1% %<p3%> %<sg%>
+
%<v%> %<cop%> %<tv%> %<aor%> %<prog%> %<p1%> %<p3%> %<sg%>
   
 
%[%-aor%] %[%+aor%]
 
%[%-aor%] %[%+aor%]
   
%+
+
%+ ;
   
 
LEXICON Root
 
LEXICON Root
Line 31: Line 33:
 
LEXICON V-TV
 
LEXICON V-TV
   
%<v%>%<tv%>%<aor%>%[%+aor%]:ar PERS ;
+
%<v%>%<tv%>%<aor%>%[%+aor%]:ir PERS ;
%<v%>%<tv%>%<aor%>%[%+aor%]:ar COP ;
+
%<v%>%<tv%>%<aor%>%[%+aor%]:ir COP ;
 
%<v%>%<tv%>%<prog%>%[%-aor%]:iyor COP ;
 
%<v%>%<tv%>%<prog%>%[%-aor%]:iyor COP ;
   
Line 43: Line 45:
 
Alphabet
 
Alphabet
   
b i l m i y o r u m
+
a b c d e f g h i j k l m n o p q r s t u v w x y z
   
%<v%> %<tv%> %<prog%> %<aor%> %<p1%> %<p2%> %<p3%> %<sg%> %<cop%>
+
%<v%> %<tv%> %<prog%> %<aor%> %<p1%> %<p2%> %<p3%> %<sg%> %<cop%>
   
%[%+aor%]:0 %[%-aor%]:0
+
%[%+aor%]:0 %[%-aor%]:0 ;
 
;
 
   
 
Sets
 
Sets
Line 57: Line 57:
 
Rules
 
Rules
   
"No consecutive +aor tags"
+
"No consecutive [+aor] tags"
 
%[%+aor%]:0 /<= %[%+aor%]:0 :* _ ;
 
%[%+aor%]:0 /<= %[%+aor%]:0 :* _ ;
 
</pre>
 
</pre>
Line 68: Line 68:
 
biliyorim:bil<v><tv><prog>+i<cop><aor><p1><sg>
 
biliyorim:bil<v><tv><prog>+i<cop><aor><p1><sg>
 
biliyor:bil<v><tv><prog>+i<cop><aor><p3><sg>
 
biliyor:bil<v><tv><prog>+i<cop><aor><p3><sg>
bilarim:bil<v><tv><aor><p1><sg>
+
bilirim:bil<v><tv><aor><p1><sg>
bilar:bil<v><tv><aor><p3><sg>
+
bilir:bil<v><tv><aor><p3><sg>
   
 
</pre>
 
</pre>
  +
  +
===Persian===
  +
  +
<pre>
  +
Multichar_Symbols
  +
  +
%<v%> %<tv%> %<pri%> %<cni%> %<prs%> %<p1%> %<p3%> %<sg%>
  +
  +
%[%-prs%] %[%+prs%] %[%-cni%] %[%+cni%]
  +
  +
  +
%+
  +
  +
LEXICON Root
  +
  +
Prefix ;
  +
  +
LEXICON Prefix
  +
  +
%[%+prs%]%[%-cni%]:be Verbs ;
  +
%[%-prs%]%[%+cni%]:mi Verbs ;
  +
  +
%[%-prs%]%[%-cni%]: Verbs ;
  +
  +
LEXICON PERS
  +
  +
%<p1%>%<sg%>:im # ;
  +
%<p3%>%<sg%>: # ;
  +
  +
LEXICON V-TV
  +
  +
%<v%>%<tv%>%<cni%>%[%+cni%]%[%-prs%]: PERS ;
  +
%<v%>%<tv%>%<prs%>%[%-cni%]%[%+prs%]: PERS ;
  +
%<v%>%<tv%>%<pri%>%[%-cni%]%[%-prs%]: PERS ;
  +
  +
LEXICON Verbs
  +
  +
kardan:kard V-TV ; ! ""
  +
</pre>
  +
  +
<pre>
  +
$ cat prefix-const.twol
  +
  +
Alphabet
  +
  +
a b c d e f g h i j k l m n o p q r s t u v w x y z
  +
  +
%<v%> %<tv%> %<p1%> %<p3%> %<sg%> %<pri%> %<prs%> %<cni%>
  +
  +
%[%+prs%]:0 %[%-prs%]:0 %[%+cni%]:0 %[%-cni%]:0
  +
 
;
  +
  +
Sets
  +
  +
Verb = %<v%> ;
  +
  +
Rules
  +
  +
"Match prefixes"
  +
Tx:0 /<= Ty:0 :* _ ;
  +
where
  +
Tx in ( %[%+cni%] %[%+prs%] %[%-cni%] %[%-prs%] )
  +
Ty in ( %[%-cni%] %[%-prs%] %[%+cni%] %[%+prs%] ) matched ;
  +
</pre>
  +
  +
<pre>
  +
$ hfst-lexc prefix.lexc | hfst-invert -o prefix.hfst
  +
$ hfst-twolc prefix-const.twol -o prefix-const.hfst
  +
$ hfst-compose-intersect -1 prefix.hfst -2 prefix-const.hfst | hfst-fst2strings
  +
  +
bekardim:kardan<v><tv><prs><p1><sg>
  +
bekard:kardan<v><tv><prs><p3><sg>
  +
kardim:kardan<v><tv><pri><p1><sg>
  +
kard:kardan<v><tv><pri><p3><sg>
  +
mikardim:kardan<v><tv><cni><p1><sg>
  +
mikard:kardan<v><tv><cni><p3><sg>
  +
  +
</pre>
  +
  +
== See also ==
  +
* [[Morphotactic constraints with twol]]
  +
  +
[[Category:Development]]
  +
[[Category:Flag diacritics]]

Latest revision as of 22:04, 7 February 2017

People like to use flag diacritics for stuff. But they are bad because they are ugly and get in the way of stuff.

Alternative: Use distinct symbols with well defined behaviours and finite-state operations!

We have < and > for morphological tags, and { and } for archiphonemes and morphological features. We add a new type of symbol with [ and ] for modelling morphotactic restrictions.

Examples[edit]

Turkish[edit]

Multichar_Symbols

%<v%> %<cop%> %<tv%> %<aor%> %<prog%> %<p1%> %<p3%> %<sg%>

%[%-aor%] %[%+aor%]

%+ ;

LEXICON Root

Verbs ; 

LEXICON PERS

%<p1%>%<sg%>:im # ;
%<p3%>%<sg%>: # ;

LEXICON COP

%+i%<cop%>%<aor%>%[%+aor%]: PERS ;

LEXICON V-TV 

%<v%>%<tv%>%<aor%>%[%+aor%]:ir PERS ;
%<v%>%<tv%>%<aor%>%[%+aor%]:ir COP ;
%<v%>%<tv%>%<prog%>%[%-aor%]:iyor COP ;

LEXICON Verbs 

bil:bil V-TV ; ! ""
Alphabet

 a b c d e f g h i j k l m n o p q r s t u v w x y z 

 %<v%> %<tv%> %<prog%> %<aor%> %<p1%> %<p2%> %<p3%> %<sg%> %<cop%>

 %[%+aor%]:0  %[%-aor%]:0  ;

Sets 

Verb = %<v%> ;

Rules 

"No consecutive [+aor] tags"
%[%+aor%]:0 /<= %[%+aor%]:0 :* _ ; 
$ hfst-lexc test.lexc | hfst-invert -o test.hfst
$ hfst-twolc test-const.twol -o const.hfst
$ hfst-compose-intersect -1 test.hfst -2 const.hfst | hfst-fst2strings 

biliyorim:bil<v><tv><prog>+i<cop><aor><p1><sg>
biliyor:bil<v><tv><prog>+i<cop><aor><p3><sg>
bilirim:bil<v><tv><aor><p1><sg>
bilir:bil<v><tv><aor><p3><sg>

Persian[edit]

Multichar_Symbols

%<v%> %<tv%> %<pri%> %<cni%> %<prs%> %<p1%> %<p3%> %<sg%>

%[%-prs%] %[%+prs%] %[%-cni%] %[%+cni%]


%+

LEXICON Root

Prefix ; 

LEXICON Prefix

%[%+prs%]%[%-cni%]:be Verbs ; 
%[%-prs%]%[%+cni%]:mi Verbs ; 

%[%-prs%]%[%-cni%]: Verbs ;

LEXICON PERS

%<p1%>%<sg%>:im # ;
%<p3%>%<sg%>: # ;

LEXICON V-TV 

%<v%>%<tv%>%<cni%>%[%+cni%]%[%-prs%]: PERS ;
%<v%>%<tv%>%<prs%>%[%-cni%]%[%+prs%]: PERS ;
%<v%>%<tv%>%<pri%>%[%-cni%]%[%-prs%]: PERS ;

LEXICON Verbs

kardan:kard V-TV ; ! ""
$ cat prefix-const.twol 

Alphabet

 a b c d e f g h i j k l m n o p q r s t u v w x y z 

 %<v%> %<tv%> %<p1%> %<p3%> %<sg%> %<pri%> %<prs%> %<cni%>

 %[%+prs%]:0  %[%-prs%]:0  %[%+cni%]:0  %[%-cni%]:0 

;

Sets 

Verb = %<v%> ;

Rules 

"Match prefixes"
Tx:0 /<= Ty:0 :* _ ; 
   where 
         Tx in ( %[%+cni%] %[%+prs%] %[%-cni%] %[%-prs%] )   
         Ty in ( %[%-cni%] %[%-prs%] %[%+cni%] %[%+prs%] )  matched ; 
$ hfst-lexc prefix.lexc | hfst-invert -o prefix.hfst
$ hfst-twolc prefix-const.twol -o prefix-const.hfst
$ hfst-compose-intersect -1 prefix.hfst -2 prefix-const.hfst | hfst-fst2strings 

bekardim:kardan<v><tv><prs><p1><sg>
bekard:kardan<v><tv><prs><p3><sg>
kardim:kardan<v><tv><pri><p1><sg>
kard:kardan<v><tv><pri><p3><sg>
mikardim:kardan<v><tv><cni><p1><sg>
mikard:kardan<v><tv><cni><p3><sg>

See also[edit]