Difference between revisions of "Replacement for flag diacritics"

From Apertium
Jump to navigation Jump to search
(Created page with "People like to use flag diacritics for stuff. But they are bad because they are ugly and get in the way of stuff. Alternative: Use symbols and finite-state operations! =...")
 
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
People like to use [[flag diacritics]] for stuff. But they are bad because they are ugly and get in the way of stuff.
People like to use [[flag diacritics]] for stuff. But they are bad because they are ugly and get in the way of stuff.


Alternative: Use symbols and finite-state operations!
Alternative: Use distinct symbols with well defined behaviours and finite-state operations!


We have <code>&lt;</code> and <code>&gt;</code> for morphological tags, and <code>{</code> and <code>}</code> for [[archiphonemes]] and morphological features. We add a new type of symbol with <code>[</code> and <code>]</code> for modelling morphotactic restrictions.
==Example==

==Examples==

===Turkish===


<pre>
<pre>
Multichar_Symbols
Multichar_Symbols


%<v%> %<cop%> %<tv%> %<aor%> %<prog%> %<p1% %<p3%> %<sg%>
%<v%> %<cop%> %<tv%> %<aor%> %<prog%> %<p1%> %<p3%> %<sg%>


%[%-aor%] %[%+aor%]
%[%-aor%] %[%+aor%]


%+
%+ ;


LEXICON Root
LEXICON Root
Line 29: Line 33:
LEXICON V-TV
LEXICON V-TV


%<v%>%<tv%>%<aor%>%[%+aor%]:ar PERS ;
%<v%>%<tv%>%<aor%>%[%+aor%]:ir PERS ;
%<v%>%<tv%>%<aor%>%[%+aor%]:ar COP ;
%<v%>%<tv%>%<aor%>%[%+aor%]:ir COP ;
%<v%>%<tv%>%<prog%>%[%-aor%]:iyor COP ;
%<v%>%<tv%>%<prog%>%[%-aor%]:iyor COP ;


Line 41: Line 45:
Alphabet
Alphabet


b i l m i y o r u m
a b c d e f g h i j k l m n o p q r s t u v w x y z


%<v%> %<tv%> %<prog%> %<aor%> %<p1%> %<p2%> %<p3%> %<sg%> %<cop%>
%<v%> %<tv%> %<prog%> %<aor%> %<p1%> %<p2%> %<p3%> %<sg%> %<cop%>


%[%+aor%]:0 %[%-aor%]:0
%[%+aor%]:0 %[%-aor%]:0 ;

;


Sets
Sets
Line 55: Line 57:
Rules
Rules


"No consecutive +aor tags"
"No consecutive [+aor] tags"
%[%+aor%]:0 /<= %[%+aor%]:0 :* _ ;
%[%+aor%]:0 /<= %[%+aor%]:0 :* _ ;
</pre>
</pre>
Line 66: Line 68:
biliyorim:bil<v><tv><prog>+i<cop><aor><p1><sg>
biliyorim:bil<v><tv><prog>+i<cop><aor><p1><sg>
biliyor:bil<v><tv><prog>+i<cop><aor><p3><sg>
biliyor:bil<v><tv><prog>+i<cop><aor><p3><sg>
bilarim:bil<v><tv><aor><p1><sg>
bilirim:bil<v><tv><aor><p1><sg>
bilar:bil<v><tv><aor><p3><sg>
bilir:bil<v><tv><aor><p3><sg>


</pre>
</pre>

===Persian===

<pre>
Multichar_Symbols

%<v%> %<tv%> %<pri%> %<cni%> %<prs%> %<p1%> %<p3%> %<sg%>

%[%-prs%] %[%+prs%] %[%-cni%] %[%+cni%]


%+

LEXICON Root

Prefix ;

LEXICON Prefix

%[%+prs%]%[%-cni%]:be Verbs ;
%[%-prs%]%[%+cni%]:mi Verbs ;

%[%-prs%]%[%-cni%]: Verbs ;

LEXICON PERS

%<p1%>%<sg%>:im # ;
%<p3%>%<sg%>: # ;

LEXICON V-TV

%<v%>%<tv%>%<cni%>%[%+cni%]%[%-prs%]: PERS ;
%<v%>%<tv%>%<prs%>%[%-cni%]%[%+prs%]: PERS ;
%<v%>%<tv%>%<pri%>%[%-cni%]%[%-prs%]: PERS ;

LEXICON Verbs

kardan:kard V-TV ; ! ""
</pre>

<pre>
$ cat prefix-const.twol

Alphabet

a b c d e f g h i j k l m n o p q r s t u v w x y z

%<v%> %<tv%> %<p1%> %<p3%> %<sg%> %<pri%> %<prs%> %<cni%>

%[%+prs%]:0 %[%-prs%]:0 %[%+cni%]:0 %[%-cni%]:0

;

Sets

Verb = %<v%> ;

Rules

"Match prefixes"
Tx:0 /<= Ty:0 :* _ ;
where
Tx in ( %[%+cni%] %[%+prs%] %[%-cni%] %[%-prs%] )
Ty in ( %[%-cni%] %[%-prs%] %[%+cni%] %[%+prs%] ) matched ;
</pre>

<pre>
$ hfst-lexc prefix.lexc | hfst-invert -o prefix.hfst
$ hfst-twolc prefix-const.twol -o prefix-const.hfst
$ hfst-compose-intersect -1 prefix.hfst -2 prefix-const.hfst | hfst-fst2strings

bekardim:kardan<v><tv><prs><p1><sg>
bekard:kardan<v><tv><prs><p3><sg>
kardim:kardan<v><tv><pri><p1><sg>
kard:kardan<v><tv><pri><p3><sg>
mikardim:kardan<v><tv><cni><p1><sg>
mikard:kardan<v><tv><cni><p3><sg>

</pre>

== See also ==
* [[Morphotactic constraints with twol]]

[[Category:Development]]
[[Category:Flag diacritics]]

Latest revision as of 22:04, 7 February 2017

People like to use flag diacritics for stuff. But they are bad because they are ugly and get in the way of stuff.

Alternative: Use distinct symbols with well defined behaviours and finite-state operations!

We have < and > for morphological tags, and { and } for archiphonemes and morphological features. We add a new type of symbol with [ and ] for modelling morphotactic restrictions.

Examples[edit]

Turkish[edit]

Multichar_Symbols

%<v%> %<cop%> %<tv%> %<aor%> %<prog%> %<p1%> %<p3%> %<sg%>

%[%-aor%] %[%+aor%]

%+ ;

LEXICON Root

Verbs ; 

LEXICON PERS

%<p1%>%<sg%>:im # ;
%<p3%>%<sg%>: # ;

LEXICON COP

%+i%<cop%>%<aor%>%[%+aor%]: PERS ;

LEXICON V-TV 

%<v%>%<tv%>%<aor%>%[%+aor%]:ir PERS ;
%<v%>%<tv%>%<aor%>%[%+aor%]:ir COP ;
%<v%>%<tv%>%<prog%>%[%-aor%]:iyor COP ;

LEXICON Verbs 

bil:bil V-TV ; ! ""
Alphabet

 a b c d e f g h i j k l m n o p q r s t u v w x y z 

 %<v%> %<tv%> %<prog%> %<aor%> %<p1%> %<p2%> %<p3%> %<sg%> %<cop%>

 %[%+aor%]:0  %[%-aor%]:0  ;

Sets 

Verb = %<v%> ;

Rules 

"No consecutive [+aor] tags"
%[%+aor%]:0 /<= %[%+aor%]:0 :* _ ; 
$ hfst-lexc test.lexc | hfst-invert -o test.hfst
$ hfst-twolc test-const.twol -o const.hfst
$ hfst-compose-intersect -1 test.hfst -2 const.hfst | hfst-fst2strings 

biliyorim:bil<v><tv><prog>+i<cop><aor><p1><sg>
biliyor:bil<v><tv><prog>+i<cop><aor><p3><sg>
bilirim:bil<v><tv><aor><p1><sg>
bilir:bil<v><tv><aor><p3><sg>

Persian[edit]

Multichar_Symbols

%<v%> %<tv%> %<pri%> %<cni%> %<prs%> %<p1%> %<p3%> %<sg%>

%[%-prs%] %[%+prs%] %[%-cni%] %[%+cni%]


%+

LEXICON Root

Prefix ; 

LEXICON Prefix

%[%+prs%]%[%-cni%]:be Verbs ; 
%[%-prs%]%[%+cni%]:mi Verbs ; 

%[%-prs%]%[%-cni%]: Verbs ;

LEXICON PERS

%<p1%>%<sg%>:im # ;
%<p3%>%<sg%>: # ;

LEXICON V-TV 

%<v%>%<tv%>%<cni%>%[%+cni%]%[%-prs%]: PERS ;
%<v%>%<tv%>%<prs%>%[%-cni%]%[%+prs%]: PERS ;
%<v%>%<tv%>%<pri%>%[%-cni%]%[%-prs%]: PERS ;

LEXICON Verbs

kardan:kard V-TV ; ! ""
$ cat prefix-const.twol 

Alphabet

 a b c d e f g h i j k l m n o p q r s t u v w x y z 

 %<v%> %<tv%> %<p1%> %<p3%> %<sg%> %<pri%> %<prs%> %<cni%>

 %[%+prs%]:0  %[%-prs%]:0  %[%+cni%]:0  %[%-cni%]:0 

;

Sets 

Verb = %<v%> ;

Rules 

"Match prefixes"
Tx:0 /<= Ty:0 :* _ ; 
   where 
         Tx in ( %[%+cni%] %[%+prs%] %[%-cni%] %[%-prs%] )   
         Ty in ( %[%-cni%] %[%-prs%] %[%+cni%] %[%+prs%] )  matched ; 
$ hfst-lexc prefix.lexc | hfst-invert -o prefix.hfst
$ hfst-twolc prefix-const.twol -o prefix-const.hfst
$ hfst-compose-intersect -1 prefix.hfst -2 prefix-const.hfst | hfst-fst2strings 

bekardim:kardan<v><tv><prs><p1><sg>
bekard:kardan<v><tv><prs><p3><sg>
kardim:kardan<v><tv><pri><p1><sg>
kard:kardan<v><tv><pri><p3><sg>
mikardim:kardan<v><tv><cni><p1><sg>
mikard:kardan<v><tv><cni><p3><sg>

See also[edit]