https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Firespeaker&feedformat=atomApertium - User contributions [en]2024-03-28T08:44:18ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=Dependency_parsing_for_Turkic&diff=75530Dependency parsing for Turkic2024-03-06T01:32:05Z<p>Firespeaker: /* A potential option for Turkic */ кыргызча жазуусу менен, казакчаныкы эмес...</p>
<hr />
<div>{{TOCD}}<br />
<br />
==Introduction==<br />
<br />
<br />
For the first version we intend as shallow an analysis as the standard allows. E.g. different kinds of <code>nmod</code> (possession, adverbials, etc. will not be distinguished). For later versions we intend to deepen the analysis.<br />
<br />
==<code>acl</code>: Clausal modifier of a noun==<br />
<br />
<code>acl</code> stands for finite and non-finite clauses that modify a nominal. The <code>acl</code> relation contrasts with the <code>advcl</code> relation, which is used for adverbial clauses that modify a predicate. The head of the <code>acl</code> relation is the noun that is modified, and the dependent is the head of the clause that modifies the noun.<br />
<br />
In Turkic, <code>acl</code> will often be used for (relative) clauses headed by verbal adjectives (<code>gpr_</code>).<br />
<br />
<pre><br />
___acl___<br />
| |<br />
Үйге жүгіретін адам мені шошытты.<br />
to.home run.GPR man me.ACC startled.<br />
<br />
"The man running home startled me." / "The man who is running home startled me."<br />
<br />
</pre><br />
<br />
===Gerunds in indefinite genitive===<br />
<br />
We also use <code>acl</code> for gerunds that modify nouns.<br />
<br />
<pre><br />
______acl_______<br />
| |<br />
Әр адамның бейбіт жиналыстар және ассоциацияларды құру бостандығына құқығы бар.<br />
every man.GEN peaceful assemblies and association.PL.ACC build.GER.GEN freedom.SG3.DAT right.SG3 existing.<br />
<br />
"Every person has the right to freedom of [building associations] and [peaceful assemblies]."<br />
</pre><br />
<br />
===Conditionals with болса ===<br />
<br />
<br />
<br />
===Secondary predication===<br />
<br />
This relation is also used for optional depictives. The adjective is taken to modify the nominal of which it provides a secondary predication. See <code>xcomp</code> for further discussion of resultatives and depictives.<br />
<br />
==<code>advcl</code>: adverbial clause modifier==<br />
<br />
Adverbial clause modifiers (<code>advcl</code>) are subordinate clauses that are not complements. Also non-complement infinitival or temporal clauses and non-complement participles modifying verbs are marked as <code>advcl</code>.<br />
<br />
In Turkic, verbal adverbs (<code>gna_</code>) will take this label if they modify a main verb.<br />
<br />
<pre> <br />
________________________advcl_____________________________<br />
| |<br />
Ном номчааш, ол кижиниң чуртталгазын шуптузун билип алдым. <br />
book.ACC read.GNA.PAST, that person.GEN life.3.ACC all.3.ACC know.PRC.PERF make.PAST.1SG <br />
<br />
________________________advcl__________________________________<br />
| |<br />
Китапны укыгач, ул кешенең тормышы турында барысын да белдем.<br />
book.ACC read.GNA.PAST, that person.GEN life.3.NOM about all.3.ACC know.PAST.1SG<br />
<br />
"Having read the book, I found out everything about that person's life."<br />
</pre><br />
<br />
Note that unless there is a separate subject for the "subordinate" clause, the subject will be the same as for the main clause, but is not directly connected.<br />
<br />
===Comparison===<br />
<br />
We also use <code>advcl</code> for the comparator in comparison constructions like "X is bigger than Y", in Turkic, the "than Y" is in the ablative case and this depends on the adjective X.<br />
<br />
<pre><br />
<br />
<br />
</pre><br />
<br />
==<code>advmod</code>: adverb modifier==<br />
<br />
The dependency type advmod is used for adverb modifiers of verbs, nominals and adverbs alike.<br />
<br />
<pre><br />
_advmod_<br />
| |<br />
Света келзе, мени удавас чедип келир деп дамчыдыңар.<br />
Sveta come.COND, I.ACC soon reach.PRC.PERF come.PRC.AOR COMP tell.IMP.PL<br />
<br />
__advmod_<br />
| |<br />
Света килсә, <name of the speaker> тиздән кайтып җитә деп әйтегез.<br />
Sveta come.COND, soon return.PRES.3sg. COMP tell.IMP.PL<br />
<br />
"If Sveta comes, tell her I'll return soon."<br />
</pre><br />
<br />
==<code>amod</code>: adjectival modifier==<br />
<br />
Nouns may take adjectival modifiers, which are marked with the dependency type <code>amod</code>. It is also possible for an adjective to take another adjective as a modifier. (These adjectival modifiers are generally expressed with -ly adverbs in English.)<br />
<br />
<pre><br />
_____amod_____<br />
| |<br />
Мергенде солун номнар бар.<br />
Mergen.LOC interesting book.PL existing.<br />
<br />
_____amod_____<br />
| |<br />
Мәргәндә кызыклы китаплар бар.<br />
Mergen.LOC interesting book.PL existing.<br />
<br />
"Mergen has some interesting books."<br />
</pre><br />
<br />
The label <code>amod</code> is also used for ordinal numbers, which when rendered in digits may not be overtly marked.<br />
<br />
<pre><br />
___amod__ ___nmod_______<br />
| | | |<br />
1968 жылдан бастап Ширазда театр фестивалы өткізіліп тұрды.<br />
1968th year.ABL starting Shiraz.LOC theatre festival.3sg take.place.CAUS AUX.<br />
</pre><br />
<br />
It is also used for locative nouns in -DAGI.<br />
<br />
<pre><br />
<br />
<br />
</pre><br />
<br />
(Note: This is a provisional classification, pending discussion.)<br />
<br />
==*<code>appos</code>: appositional modifier==<br />
<br />
An appositional modifier of a noun is a nominal immediately following the first noun that serves to define or modify that noun. It includes examples in parentheses, as well as defining abbreviations in one of these structures.<br />
<br />
<pre><br />
<br />
</pre><br />
<br />
==*<code>aux</code>: auxiliary==<br />
<br />
An auxiliary of a clause is a non-main verb of the clause, e.g. one of тур-, кел-, ал- etc. The main verb in the case of auxiliary use is the participle (<code>prc_</code>).<br />
<br />
<pre><br />
<br />
</pre><br />
<br />
==<code>auxpass</code>: {{sc|unused}}==<br />
<br />
==<code>case</code>: case==<br />
<br />
The dependency type case is used for the postposition in postpositional phrases. The head of an postpositional phrase is the nominal, not the postposition, so as to analyse postpositional phrases similarly to nominal modifiers without a postposition (e.g. when using local "cases") To the same end, the type case is used in combination with the type <code>nmod</code>, which is also used for nominal modifiers when no adposition is present (see nmod).<br />
<br />
Note that <code>case</code> is not used with auxiliary nouns (sometimes called "postpositions") in the form of N¹.{{sc|gen}} N².{{sc|poss.case}}, for those <code>nmod</code> should be used (following treatment in English of prepositional constructions like "in front of").<br />
<br />
<pre><br />
_case__<br />
| |<br />
Meн кадайым-биле киноже чорук баар мен.<br />
I wife-with cinema.ALL go AUX.AOR.<br />
</pre><br />
<br />
<pre><br />
______case________<br />
| |<br />
Бис бүгү чүвени сээң чугаалааның ёзугаар кылган бис<br />
We all thing.ACC you.GEN say.GER.2SG.NOM according.to do.PAST.2PL<br />
</pre><br />
<br />
==<code>cc</code>: coordinating conjunction==<br />
<br />
For more on coordination, see the conj relation. A cc is the relation between the first conjunct and the coordinating conjunction delimiting another conjunct. (Note: different dependency grammars have different treatments of coordination. We take the last conjunct as the head of the coordination.)<br />
<br />
<pre><br />
___________________________conj_______<br />
| ____________cc__________________ | <br />
| | _______conj_____ | |<br />
| | | __cc__ | | |<br />
| | | | | | | |<br />
Барлық адамдар тумысынан азат және қадір-қасиеті мен құқықтары тең болып дүниеге келеді <br />
All people free and dignity and rights equal being world.DAT come.PAST<br />
<br />
"All people are born free in dignity and rights."<br />
</pre><br />
<br />
==*<code>ccomp</code>: clausal complement==<br />
<br />
===Non-finite complements (with <code>acc</code>)===<br />
<br />
<pre> <br />
_____subj__ _______ccomp______ ROOT<br />
| | | | |<br />
Кейінірек ФИФА ротация принципі өзгеретінін жариялады<br />
Later FIFA rotation principle change.IMPF.ACC declare.PAST<br />
<br />
"Later FIFA declared that the rotation principle was changing."<br />
</pre><br />
<br />
===Reported speech (with де-)===<br />
<br />
<br />
<pre><br />
________conj______ _____ccomp________<br />
| || |<br />
«Төрге шық, тамақ іш », - демепті.<br />
place.DAT go.IMP food.NOM drink.IMP say.NEG.IFI.EVID.3SG<br />
<br />
"go back to the tör and eat!" they did not say."<br />
</pre><br />
<br />
==<code>cmpnd</code>: compound==<br />
<br />
<code>cmpnd</code> is used for noun compounds. Nouns should modify the next noun in the compound in order to respect the branching structure.<br />
<br />
Most uses of <code>attr</code> will be tagged with <code>cmpnd</code>:<br />
<br />
Nouns in the izafet construction (e.g. possessive on the final noun) should not get the <code>cmpnd</code> tag.<br />
<br />
<pre><br />
__cmpnd_ ___cmpnd___<br />
| | | |<br />
Мартан-оол март айдан сентябрь айга чедир Кызылга чурттап турган . <br />
Martan-ool March month.ABL September month.DAT until Kyzyl.DAT live.PRC.PERF sit.PAST<br />
<br />
Мартан-оол март аеннан сентябрь аена кадәр Кызылда яшәгән . <br />
Martan-ool March month.3.ABL September month.3.DAT until Kyzyl.LOC live.PAST.3SG<br />
(Note 3 person possessives, hence no cmpnd labels).<br />
<br />
"Martan-ool was living in Kyzyl from March until September"<br />
</pre><br />
<br />
The <code>cmpnd</code> label should also be used for strings of numerals:<br />
<br />
<pre><br />
<br />
<br />
</pre><br />
<br />
==<code>conj</code>: conjunct==<br />
<br />
A conjunct is the relation between two elements connected by a coordinating conjunction, such as and, or, etc. We treat conjunctions asymmetrically: The head of the relation is the last conjunct and all the other conjuncts depend on it via the conj relation. Note that this differs from the UD practice of putting the head as the ''first'' conjunct. See [https://github.com/UniversalDependencies/docs/issues/189 here] for a discussion on this.<br />
<br />
<pre><br />
___________________________conj_______<br />
| ____________cc__________________ | <br />
| | _______conj_____ | |<br />
| | | __cc__ | | |<br />
| | | | | | | |<br />
Барлық адамдар тумысынан азат және қадір-қасиеті мен құқықтары тең болып дүниеге келеді <br />
All people free and dignity and rights equal being world.DAT come.PAST<br />
<br />
"All people are born free in dignity and rights."<br />
</pre><br />
<br />
'''Warning''': If two sentences are joined with a comma and there is no relation between them, the relation should be <code>parataxis</code>.<br />
<br />
==<code>cop</code>: copula==<br />
<br />
A copula is the relation between the complement of a copular verb and the copular verb to be (only). (We normally take a copula as a dependent of its complement.)<br />
<br />
The copula be is not treated as the head of a clause, but rather the dependent of a lexical predicate.<br />
<br />
In Turkic the copula is either бол or э. Third person copula forms in the present tense are not shown in the surface forms, but may be included by morphological analysers. In the following, <code>·</code> denotes a contraction boundary which is not present in the orthography):<br />
<br />
;Aorist copula (-Ø suffix):<br />
<br />
<pre> <br />
ROOT<br />
|<br />
_subj_ |<br />
| | |<br />
Меңээ ном херек<br />
I.DAT book necessary<br />
<br />
ROOT<br />
|<br />
__subj_ | cop<br />
| | | | |<br />
Меңээ ном херек·ø<br />
I.DAT book necessary·is<br />
<br />
"I need a book"<br />
<br />
</pre><br />
<br />
;Existentials with "бар" and "чок":<br />
<br />
<pre><br />
<br />
cop<br />
| |<br />
Бо бажыңда он үш квартира бар·ø <br />
This house.LOC ten three flat existing.are<br />
</pre><br />
<br />
;Aorist copula (with personal suffix):<br />
<br />
<pre><br />
_cop_<br />
| |<br />
Кызылга ынак мен<br />
Kyzyl.DAT favourite.am<br />
<br />
"Kyzyl is my favourite"<br />
</pre><br />
<br />
;Aorist evidential copula (-DIr suffix):<br />
<br />
<pre><br />
ROOT<br />
| _______cop_____<br />
| | |<br />
__det__ __nmod_ | | ____subj___ |<br />
| | | | | | | | |<br />
Бо институттуң директору Мерген·дир<br />
This institute.GEN director.3SG Mergen·is<br />
</pre><br />
<br />
<br />
;Freestanding copula with "бол":<br />
<br />
<pre><br />
<br />
|<br />
Мээң аас-кежик чогумдан шупту чүве болган .<br />
I.GEN happiness not.1SG.ABL all thing was<br />
<br />
"all of my troubles were due to the fact that I have no joy."<br />
</pre><br />
<br />
;Use of "бол" without predicate.<br />
<br />
<pre><br />
ROOT<br />
|<br />
Эрте заманда Эрназар деген киши болуптур.<br />
</pre><br />
<br />
;Subjectless use of "бол"<br />
<br />
<pre><br />
__ccomp_ ROOT<br />
| | | <br />
шылым шегуге болмайды .<br />
</pre><br />
<br />
==<code>csubj</code>: {{sc|unused}}==<br />
<br />
==<code>csubjpass</code>: {{sc|unused}}==<br />
<br />
<br />
==*<code>x</code> (<code>dep</code>): unspecified dependency==<br />
<br />
==<code>det</code>: determiner==<br />
<br />
The relation determiner (<code>det</code>) holds between a nominal head and its determiner. Most commonly, a word of POS <code>det</code> will have the relation <code>det</code> and vice versa. <br />
<br />
<pre><br />
__det__<br />
| |<br />
Баяғыда біреу той жасапты , тойға көп кісі жиналыпты , Қожа да келіпті . <br />
Long.ago someone feast make.PAST , feast.DAT a.lot people get.together.PAST , Koža also come.PAST.EVID <br />
<br />
"A long time ago someone had a feast, a lot of people came to the feast, and Koža also came."<br />
</pre><br />
<br />
==<code>disc</code>: discourse element==<br />
<br />
This is used for interjections and other discourse words and elements (which are not clearly linked to the structure of the sentence, except in an expressive way).<br />
<br />
<pre><br />
<br />
<br />
</pre><br />
<br />
The <code>disc</code> label is used for clitic words, including the question word (ма-, ба-, etc.).<br />
<br />
<pre><br />
<br />
</pre><br />
<br />
==<code>disl</code>: {{sc|unused}}==<br />
<br />
==*<code>obj</code>: direct object==<br />
<br />
The direct object of a verb is the noun phrase that denotes the entity acted upon.<br />
<br />
In Turkic languages the direct object will be marked with either the <code>acc</code> (if definite) or <code>nom</code> (if indefinite) cases.<br />
<br />
<pre><br />
<br />
<br />
</pre><br />
<br />
==<code>expl</code>: {{sc|unused}}==<br />
<br />
<br />
==*<code>barb</code> (<code>foreign</code>): foreign words==<br />
<br />
==<code>goeswith</code>: {{sc|unused}}==<br />
<br />
==*<code>arg</code> (<code>iobj</code>): argument which is not the direct object==<br />
<br />
==*<code>list</code>: list==<br />
<br />
==*<code>mark</code>: marker==<br />
<br />
<br />
==<code>mwe</code>: {{sc|unused}}==<br />
<br />
==<code>name</code>: name==<br />
<br />
Multiword named entities are marked as name (Владимир Карбый-оолович Чооду): the last element (Чооду) is the head, and all the other elements are attached to the one to its right with the relation name.<br />
<br />
<pre> <br />
___________________x____________________<br />
| |<br />
| ____name__ ___name__ |<br />
| | | | | | <br />
Культура бажыңының директору Роберт Адар-оолович Аракчаа.<br />
Culture house.3SG.GEN director.3SG Robert Adar-oolovič Arakčaa.<br />
<br />
"The director of the cultural centre is Robert Adar-oolovič Arakčaa."<br />
</pre><br />
<br />
==<code>neg</code>: {{sc|unused}}==<br />
<br />
==*<code>nmod</code>: nominal modifier==<br />
<br />
<br />
nmod is a noun (or noun phrase) functioning as a non-core (oblique) argument or adjunct. This means that it functionally corresponds to an adverbial when it attaches to a verb, adjective or other adverb. But when attaching to a noun, it corresponds to an attribute, or genitive complement (the terms are less standardized here).<br />
<br />
<pre><br />
<br />
<br />
</pre><br />
<br />
==*<code>subj</code> (<code>nsubj</code>): (nominal) subject==<br />
<br />
==<code>nsubjpass</code>: {{sc|unused}}==<br />
<br />
==*<code>nummod</code>: numeric modifier==<br />
<br />
A numeric modifier of a noun is any number phrase that serves to modify the meaning of the noun with a quantity.<br />
<br />
<pre><br />
_nummod_<br />
| |<br />
Бразилия өз жерінде чемпионатты екі рет өткізген бесінші ел болды (Мексика, Италия, Франция және Германиядан кейін).<br />
Brazil self land.3sg.LOC championship.ACC two time fifth country was (Mexico, Italy, France and Germany.ABL after)<br />
</pre><br />
<br />
Ordinals should not get this relation (see <code>amod</code>).<br />
<br />
==*<code>parataxis</code>: parataxis==<br />
<br />
<br />
===Side-by-side sentences===<br />
<br />
When two sentences share no relation but are written together in a single sentence (delimited by comma or semicolon or something) then we use the relation <code>parataxis</code>. <br />
<br />
<br />
<pre><br />
_________________________________________parataxis_______________________________________<br />
| |<br />
Футболдан әлем чемпионаты 2014 — ФИФА-ның 20-шы футболдан әлем чемпионаты, финалдық кезеңі 2014 жылдың 12 маусым мен 13 шілде күндері аралығында Бразилияда өтті. <br />
</pre><br />
<br />
==*<code>punct</code>: punctuation==<br />
<br />
==*<code>relcl</code>: Relative clause modifier==<br />
<br />
==<code>remnant</code>: remnant==<br />
<br />
The remnant relation is used to provide a satisfactory treatment of ellipsis (in the case of gapping and stripping, where a predicational or verbal head gets elided) without having to postulate empty nodes in the basic representation. <br />
<br />
UD adopts an analysis that notes that in ellipsis a remnant corresponds to a correlate in a preceding clause. The remnant relation connects each remnant to its correlate in the basic dependency representation. This is then a sufficient representation to reconstruct the predicate-argument structure in the enhanced representation.<br />
<br />
<pre><br />
_________remnant____________<br />
| |<br />
| _________________|___________remnant___________________________<br />
| | | |<br />
| | | |<br />
Ашылу матчы Сан-Паулуда, ал финалы Рио-де-Жанейродағы Маракана стадионында орын алды.<br />
Opening match San-Paulo.LOC, and final.3Sg Rio-de-Janeiro.LOC.ATTR Marcana stadium.LOC place.take.<br />
<br />
"The opening match took place in San Paulo and the final match took place in Rio de Janeiro's Marcana stadium."<br />
</pre><br />
<br />
==*<code>reparandum</code>: overridden disfluency==<br />
<br />
==*<code>root</code>: root==<br />
<br />
==*<code>vocative</code>: vocative==<br />
<br />
==*<code>xcomp</code>: open clausal complement==<br />
<br />
An open clausal complement (xcomp) of a verb or an adjective is a predicative or clausal complement that '''cannot have''' its own subject. The reference of the subject is necessarily determined by an argument external to the xcomp (normally by the object of the next higher clause, if there is one, or else by the subject of the next higher clause). This is often referred to as obligatory control. These complements are always non-finite, and they are complements (arguments of the higher verb or adjective) rather than adjuncts/modifiers, such as a purpose clause. The name xcomp is borrowed from Lexical-Functional Grammar.<br />
<br />
<pre><br />
<br />
_____obj_____ _____xcomp____<br />
| | | |<br />
Әлі де болса Азаматты табуға әрекет·етіп жүр .<br />
Azamat.ACC find.GER.DAT trying AUX.<br />
<br />
"... trying to find Azamat."<br />
<br />
</pre><br />
<br />
==Particular questions==<br />
<br />
===<code>conj</code> vs. <code>parataxis</code> vs. <code>remnant</code>===<br />
<br />
<br />
* <code>conj</code>:<br />
** if there is an explicit coordinator (жана, ал, биракъ, мен, etc.) then use the <code>conj</code> relation.<br />
<br />
* <code>parataxis</code>:<br />
** relation between a word and other elements, such as a sentential parenthetical or a clause after a “:” or a “;”, placed side by side without any explicit coordination, subordination, or argument relation with the head word. Parataxis is a discourse-like equivalent of coordination.<br />
** used for a pair of what could have been standalone sentences, but which are being treated together as a single sentence. They may be joined by punctuation such as a colon or comma, or not delimited by punctuation at all.<br />
** used for reported speech in the structure "xxx yyy" деп ... The "xxx yyy" is in parataxis with деп.<br />
** used for news article bylines "London (BBC)"<br />
** clause interjections.<br />
* <code>remnant</code>:<br />
<br />
=== Testing for argument status ===<br />
<br />
In Turkic languages, traditional tests for whether a constituent is the '''argument''' of a verb or the '''adjunct''' of a verb(/predicate) don't work well. Knowing whether something is an argument or an adjunct ("is required or not") is crucial in dependency grammars, since it determines whether a constituent gets e.g. an <code>obj</code> or <code>nmod</code> label. This page describes an alternative method that may perform better.<br />
<br />
==== Why traditional tests fail ====<br />
<br />
One traditional test of whether a constituent is an argument or an adjunct is the "do-so" test (*''I ate apples and oranges, and Bill did so too apples and oranges.''). Turkic and structurally similar languages, like Ewen, don't have this structure, so this test doesn't work in Turkic.<br />
<br />
Another traditional test of whether a constituent is an argument or adjunct is the grammaticality of the sentence when the constituent is left out (*''Jill really likes''). This doesn't work in Turkic either, since any argument ("no exceptions") can be left out if already present [in the right way] in discourse. While it will sound like information is missing when predicates are used on their own with none of their arguments, contexts can usually be thought of where most of these sound perfectly grammatical.<br />
<br />
==== A potential option for Turkic ====<br />
<br />
One test that seems to show something approaching argument status is an "additional information"-as-a-new-sentence test. In this test, you leave out one of the arguments in the original sentence, and provide it in a second sentence together with a repeated copy of the verb.<br />
<br />
For example, take the following sentence:<br />
<br />
Кечээ мугалим студентке китеп(ти) берген.<br />
yesterday teacher student-to book(the) gave.<br />
<br />
Using this test, you can create the following sentences, marked for grammaticality in spoken [not literary] Kyrgyz. (''Note that the grammaticality marking refers to the whole utterance, independent of other pragmatics, and not to either of the individual sentences, whether in combination with the other or alone, or in some other context.'')<br />
<br />
Мугалим студентке китеп(ти) берген. Кечээ берген.<br />
teacher student-to book gave. yesterday gave.<br />
<br />
*Кечээ студентке китеп(ти) берген. Мугалим берген.<br />
yesterday student-to book(the) gave. Teacher gave.<br />
<br />
Кечээ мугалим китеп(ти) берген. Студентке берген.<br />
yesterday teacher book(the) gave. student-to gave.<br />
<br />
?*Кечээ мугалим студентке берген. Китеп(ти) берген.<br />
yesterday teacher student-to gave. book(the) gave.<br />
<br />
The ungrammatical sentences show that мугалим and китеп are arguments and must be included with the original sentence. The grammaticality of moving кечээ and студентке out of the original sentence shows that they are probably adjuncts.<br />
<br />
==== One approach: oblique if not core ====<br />
<br />
This approach lists some tests for core arguments, and if a given noun doesn't match one of the tests, then it's considered oblique (non-core).<br />
<br />
Tests for core include:<br />
<br />
# Is it a nominative subject?<br />
:: Justification: nominative subjects agree in person/number with the verb, whether overt or not.<br />
# Is it an accusative object?<br />
:: Justification: accusative objects can be promoted to nominative subjects when the verb is passivised.<br />
# Is it genitive subject in a subordinated sentence?<br />
:: Actual test: <br />
:: Justification:<br />
# Is it a demoted dative/accusative/ablative subject?<br />
:: Actual test: can a sentence be created in which the noun is in nominative, the verb has one (or two) fewer causative morphemes, and the relationship between the verb and this noun preserved?<br />
<br />
<br />
[[Category:Turkic languages]]<br />
[[Category:Dependencies]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium-kat/stats&diff=75507Apertium-kat/stats2024-01-03T05:33:55Z<p>Firespeaker: /* Corpora */ 7c6f3b</p>
<hr />
<div>== Corpora ==<br />
wp2023<br />
* words: <section begin=wp2023-words />33.0M<section end=wp2023-words /><br />
* coverage: ~<section begin=wp2023-coverage />41.53<section end=wp2023-coverage />%<br />
* as of: 7c6f3b</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium-kat/stats&diff=75506Apertium-kat/stats2024-01-01T18:23:45Z<p>Firespeaker: /* Corpora */</p>
<hr />
<div>== Corpora ==<br />
wp2023<br />
* words: <section begin=wp2023-words />33.0M<section end=wp2023-words /><br />
* coverage: ~<section begin=wp2023-coverage />41.52<section end=wp2023-coverage />%<br />
* as of: 417dd4</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium-kat/stats&diff=75505Apertium-kat/stats2023-12-30T05:57:39Z<p>Firespeaker: /* Corpora */</p>
<hr />
<div>== Corpora ==<br />
wp2023<br />
* words: <section begin=wp2023-words />33.0M<section end=wp2023-words /><br />
* coverage: ~<section begin=wp2023-coverage />41.31<section end=wp2023-coverage />%<br />
* as of: cdd338</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium-kat/stats&diff=75504Apertium-kat/stats2023-12-30T03:32:28Z<p>Firespeaker: /* Corpora */</p>
<hr />
<div>== Corpora ==<br />
wp2023<br />
* words: <section begin=wp2023-words />33.0M<section end=wp2023-words /><br />
* coverage: ~<section begin=wp2023-coverage />38.46<section end=wp2023-coverage />%<br />
* as of: f020ad</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium-kat/stats&diff=75503Apertium-kat/stats2023-12-30T03:20:57Z<p>Firespeaker: /* Corpora */</p>
<hr />
<div>== Corpora ==<br />
wp2023<br />
* words: <section begin=wp2023-words />155K<section end=wp2023-words /><br />
* coverage: ~<section begin=wp2023-coverage />38.46<section end=wp2023-coverage />%<br />
* as of: f020ad</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium-kat/stats&diff=75502Apertium-kat/stats2023-12-30T03:20:46Z<p>Firespeaker: /* Corpora */</p>
<hr />
<div>== Corpora ==<br />
wp2023<br />
* words: <section begin=wp2023-words />155K<section end=wp2023-words /><br />
* coverage: ~<section begin=wp2023-coverage />38.46%<section end=wp2023-coverage />%<br />
* as of: f020ad</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium-kat/stats&diff=75501Apertium-kat/stats2023-12-30T03:20:23Z<p>Firespeaker: Created page with "== Corpora == wp2023 * words: <section begin=wp2023-words />155K<section end=wp2023-words /> * coverage: ~<section begin=wp2023-coverage /><section end=wp2023-coverage />% * a..."</p>
<hr />
<div>== Corpora ==<br />
wp2023<br />
* words: <section begin=wp2023-words />155K<section end=wp2023-words /><br />
* coverage: ~<section begin=wp2023-coverage /><section end=wp2023-coverage />%<br />
* as of: f020ad</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75205Turkic languages/Ki2023-09-13T13:17:15Z<p>Firespeaker: Chv</p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|chv}} -Tи<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|chv}} Сад'''ри''' йываҫсем<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|chv}} -(Ӑ)нни?<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|chv}} Ҫав йываҫ пир'''ĕнни'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|chv}} -хи<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|chv}} паян'''хи''', пĕлтĕр'''хи'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|sah}} —<br />
* {{slc|chv}} —<br />
* {{slc|khk}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* Khalkha -x seems not to occur with temporal adverbs as in Turkic? In some Turkic languages this usage is quite productive, cf. forms like эртең мененки (<tt>kir</tt>).<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы. In all other environments (except after vowels) it's impossible to distinguish the two (сарсыҥҥы, быйылгы, аныгыскы, etc.).<br />
* What is кэнники (<tt>sah</tt>)?<br />
* How do forms like биһиэнэ (<tt>sah</tt>) work, and can it apply to nouns? (Seems no?)<br />
* The forms биһиги, эһиги might be remnants of <tag>gen</tag><tag>subst</tag>, but they are not used that way currently.</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75204Turkic languages/Ki2023-09-13T13:08:48Z<p>Firespeaker: </p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|chv}} -(Ӑ)нни?<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|chv}} Ҫав йываҫ пир'''ĕнни'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|chv}} -хи<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|chv}} паян'''хи''', пĕлтĕр'''хи'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* Khalkha -x seems not to occur with temporal adverbs as in Turkic? In some Turkic languages this usage is quite productive, cf. forms like эртең мененки (<tt>kir</tt>).<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы. In all other environments (except after vowels) it's impossible to distinguish the two (сарсыҥҥы, быйылгы, аныгыскы, etc.).<br />
* What is кэнники (<tt>sah</tt>)?<br />
* How do forms like биһиэнэ (<tt>sah</tt>) work, and can it apply to nouns? (Seems no?)<br />
* The forms биһиги, эһиги might be remnants of <tag>gen</tag><tag>subst</tag>, but they are not used that way currently.</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75203Turkic languages/Ki2023-09-13T12:54:39Z<p>Firespeaker: </p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|chv}} -хи<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|chv}} паян'''хи''', пĕлтĕр'''хи'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* Khalkha -x seems not to occur with temporal adverbs as in Turkic? In some Turkic languages this usage is quite productive, cf. forms like эртең мененки (<tt>kir</tt>).<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы. In all other environments (except after vowels) it's impossible to distinguish the two (сарсыҥҥы, быйылгы, аныгыскы, etc.).<br />
* What is кэнники (<tt>sah</tt>)?<br />
* How do forms like биһиэнэ (<tt>sah</tt>) work, and can it apply to nouns? (Seems no?)<br />
* The forms биһиги, эһиги might be remnants of <tag>gen</tag><tag>subst</tag>, but they are not used that way currently.</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75202Turkic languages/Ki2023-09-13T12:54:00Z<p>Firespeaker: </p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|chv}} паян'''хи''', пĕлтĕр'''хи'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* Khalkha -x seems not to occur with temporal adverbs as in Turkic? In some Turkic languages this usage is quite productive, cf. forms like эртең мененки (<tt>kir</tt>).<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы. In all other environments (except after vowels) it's impossible to distinguish the two (сарсыҥҥы, быйылгы, аныгыскы, etc.).<br />
* What is кэнники (<tt>sah</tt>)?<br />
* How do forms like биһиэнэ (<tt>sah</tt>) work, and can it apply to nouns? (Seems no?)<br />
* The forms биһиги, эһиги might be remnants of <tag>gen</tag><tag>subst</tag>, but they are not used that way currently.</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75201Turkic languages/Ki2023-09-13T12:46:00Z<p>Firespeaker: /* Notes */</p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы. In all other environments (except after vowels) it's impossible to distinguish the two (сарсыҥҥы, быйылгы, аныгыскы, etc.).<br />
* What is кэнники (<tt>sah</tt>)?<br />
* How do forms like биһиэнэ (<tt>sah</tt>) work, and can it apply to nouns? (Seems no?)<br />
* The forms биһиги, эһиги might be remnants of <tag>gen</tag><tag>subst</tag>, but they are not used that way currently.</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75200Turkic languages/Ki2023-09-13T12:41:09Z<p>Firespeaker: /* Notes */</p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы. In all other environments (except after vowels) it's impossible to distinguish the two (сарсыҥҥы, быйылгы, аныгыскы, etc.).<br />
* What is кэнники (<tt>sah</tt>)?<br />
* How do forms like биһиэнэ (<tt>sah</tt>) work, and can it apply to nouns? (Seems no?)</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75199Turkic languages/Ki2023-09-13T12:38:19Z<p>Firespeaker: /* Notes */</p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы. In all other environments (except after vowels) it's impossible to distinguish the two (сарсыҥҥы, быйылгы, аныгыскы, etc.).<br />
* What is кэнники (<tt>sah</tt>)?</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75198Turkic languages/Ki2023-09-13T12:35:12Z<p>Firespeaker: /* Notes */</p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы. In all other environments (except after vowels) it's impossible to distinguish the two.<br />
* What is кэнники (<tt>sah</tt>)?</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75197Turkic languages/Ki2023-09-13T12:34:21Z<p>Firespeaker: </p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI/-GI<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}<br />
<br />
<br />
=== Notes ===<br />
* In Sakha, evidence for -ŋI is forms like бэҕэһээҥи, while evidence for -GI is forms like аныгы<br />
* What is кэнники (<tt>sah</tt>)?</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75196Turkic languages/Ki2023-09-13T12:21:41Z<p>Firespeaker: </p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|sah}} -ŋI<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|sah}} бүгүҥ'''ҥү''', былырыыҥ'''ҥы'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
* {{slc|sah}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Turkic_languages/Ki&diff=75195Turkic languages/Ki2023-09-13T12:16:37Z<p>Firespeaker: </p>
<hr />
<div>The various "ki"s in Turkic (and Mongolic!).<br />
<br />
{|class="wikitable"<br />
|-<br />
! usage !! attaches to !! resulting form !! forms !! examples<br />
|-<br />
! attributive locative<br />
| ~locative || {{tag|attr}}<br />
| <br />
* {{slc|kaz}} -DAGI<br />
* {{slc|kir}} -DAGI<br />
* {{slc|uzb}} -dagi<br />
* {{slc|tur}} -DAki<br />
* {{slc|sah}} -TAAGI<br />
* {{slc|khk}} dAx(ʲ)<br />
|<br />
* {{slc|kaz}} Бақша'''дағы''' ағаштар<br />
* {{slc|kir}} Бакча'''дагы''' дарактар<br />
* {{slc|uzb}} Bog'cha'''dagi''' daraxtlar<br />
* {{slc|tur}} Bahçe'''deki''' ağaçlar<br />
* {{slc|sah}} Сад'''тааҕы''' мастар<br />
* {{slc|khk}} Цэцэрлэг '''дэхь''' моднууд<br />
|- <br />
! substantival genitive<br />
| ~genitive || {{tag|subst}}<br />
|<br />
* {{slc|kaz}} -Niki(n)<br />
* {{slc|kir}} -NIKI(n)<br />
* {{slc|uzb}} -niki<br />
* {{slc|tur}} -(n)Inki(n)<br />
* {{slc|sah}} —<br />
* {{slc|khk}} -nIx/-Inx<br />
|<br />
* {{slc|kaz}} Сол ағаш біз'''дікі'''<br />
* {{slc|kir}} Ошол дарак биз'''дики'''<br />
* {{slc|uzb}} Shu daraxt biz'''niki'''<br />
* {{slc|tur}} Şu ağaç biz'''imki'''<br />
* {{slc|khk}} Тэр мод бид'''нийх'''<br />
|-<br />
! attributive ~time adverbs<br />
| closed set of adverbs (mostly time) || {{tag|attr}}<br />
|<br />
* {{slc|kaz}} -GI<br />
* {{slc|kir}} -KI<br />
* {{slc|uzb}} -gi<br />
* {{slc|tur}} -ki/-kI<br />
* {{slc|khk}} -х<br />
|<br />
* {{slc|kaz}} бүгін'''гі''', былтыр'''ғы'''<br />
* {{slc|kir}} бүгүн'''кү''', былтыр'''кы'''<br />
* {{slc|uzb}} bugun'''gi''', bultur'''gi'''<br />
* {{slc|tur}} bugün'''kü''', geçen yıl'''ki'''<br />
* {{slc|khk}} өмнө'''х''', дараа'''х'''<br />
|-<br />
! relative thingy<br />
| finite phrase (adverb, verb) || lambda(adverb phrase)??<br />
|<br />
* {{slc|kaz}} —<br />
* {{slc|kir}} —<br />
* {{slc|uzb}} ki?<br />
* {{slc|tur}} ki<br />
* {{slc|khk}} —<br />
|<br />
* {{slc|tur}} Tabii ki, ...<br />
* {{slc|tur}} Dedim ki, ...<br />
* {{slc|tur}} Sanırım (ki), ...<br />
|}</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=List_of_symbols&diff=74362List of symbols2023-04-11T16:08:48Z<p>Firespeaker: </p>
<hr />
<div>[[Liste de symboles|En français]] · [[Список символов|по-русски]]<br />
<br />
This page lists the symbols in Apertium used to denote part-of-speech and further morphological features, as well as chunk tags used for more syntactic functions, as well as XML tags.<br />
<br />
This page also documents alignment between Apertium morphological tags and [https://universaldependencies.org/ Universal Dependencies] [https://universaldependencies.org/u/pos/index.html POS tags] and [https://universaldependencies.org/u/feat/index.html features].<br />
<br />
<br />
{{TOCD}}<br />
This is meant to be a glossary of symbol names in alphabetical order with notes. Some of these names are specific to particular packages or language pairs, as not all languages have the same grammatical features (most don't have spatial distinction in articles for example).<br />
<br />
If you were wondering what the symbols #, /, @, +, ~ or * mean, read [[Apertium stream format]].<br />
<br />
<!-- comments following section headers are intended to make scraping this page easier --><br />
<br />
==Part-of-speech Categories== <!-- POS --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal POS<br />
|-<br />
| <code>n</code> || Noun || ''see 'np' for proper noun'' || NOUN<br />
|-<br />
| <code>vblex</code> || Standard ("lexical") verb || ''see also: vbser, vbhaver, vbmod, vaux, vbdo'' || VERB<br />
|-<br />
| <code>v</code> || Standard verb || shortened form of vblex, often used in agglutinative languages || VERB<br />
|-<br />
| <code>vbmod</code> || Modal verb || || VERB<br />
|-<br />
| <code>vbser</code> || Verb "to be" || from ''ser'' (to be) || VERB or AUX<br />
|-<br />
| <code>vbhaver</code> || Verb "to have" || from ''haver'' (to have) || VERB or AUX<br />
|-<br />
| <code>vbdo</code> || Verb "to do" || "to do" includes all eleven tenses and forms of to do, can also be an auxiliary verb || VERB or AUX<br />
|-<br />
| <code>vaux</code> || Auxiliary verb || [http://en.wikipedia.org/wiki/Auxilliary_verb wikipedia] || AUX<br />
|-<br />
| <code>cop</code> || Copula || [http://en.wikipedia.org/wiki/Copula_(linguistics) wikipedia]; sometimes verb-like, sometimes not || AUX<br />
|- <br />
| <code>adj</code> || Adjective || || ADJ<br />
|-<br />
| <code>adv</code> || Adverb || || ADV<br />
|-<br />
| <code>preadv</code> || Pre-adverb || || ADV<br />
|-<br />
| <code>postadv</code> || Post-adverb || || ADV<br />
|-<br />
| <code>mod</code> || Modal word || [http://dic.academic.ru/dic.nsf/lingvistic/749] || PART<br />
|-<br />
| <code>det</code> || Determiner || [http://en.wikipedia.org/wiki/Determiner_(class) wikipedia] || DET<br />
|-<br />
| <code>prn</code> || Pronoun || [http://en.wikipedia.org/wiki/Pronoun wikipedia] || PRON<br />
|-<br />
| <code>pr</code> || Preposition || [http://en.wikipedia.org/wiki/Preposition wikipedia] || ADP<br />
|-<br />
| <code>post</code> || Postposition || || ADP<br />
|-<br />
| <code>num</code> || Numeral || || NUM<br />
|-<br />
| <code>np</code> || Proper noun || From ''nom propi'' [http://en.wikipedia.org/wiki/Proper_noun wikipedia] || PROPN<br />
|-<br />
| <code>ij</code> || Interjection || [http://en.wikipedia.org/wiki/Interjection wikipedia] || INTJ<br />
|-<br />
| <code>cnjcoo</code> || Co-ordinating conjunction || [http://en.wikipedia.org/wiki/Co-ordinating_conjunction wikipedia] || CCONJ<br />
|-<br />
| <code>cnjsub</code> || Sub-ordinating conjunction || || SCONJ<br />
|-<br />
| <code>cnjadv</code> || Conjunctive adverb || [http://en.wikipedia.org/wiki/Conjunctive_adverb wikipedia] || SCONJ, ADV<br />
|-<br />
| <code>atp</code> || Attachable prefix || In [[German]], ''zusammen''- ||<br />
|-<br />
| <code>ideo</code> || Ideophone || ||<br />
|-<br />
| <code>clt</code> || Clitic || ||<br />
|}<br />
<br />
=== Punctuation === <!-- punct --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal POS<br />
|-<br />
| <code>sent</code> || Sentence-ending punctuation || e.g. full stop, question mark || PUNCT<br />
|-<br />
| <code>cm</code> || Comma punctuation || , || PUNCT PunctType=Comm<br />
|-<br />
| <code>lquot</code> || Left quote || « || PUNCT PunctType=Quot PunctSide=Ini<br />
|-<br />
| <code>rquot</code> || Right quote || » || PUNCT PunctType=Quot PunctSide=Fin<br />
|-<br />
| <code>lpar</code> || Left parenthesis || ( || PUNCT PunctType=Brck PunctSide=Ini<br />
|-<br />
| <code>rpar</code> || Right parenthesis || ) || PUNCT PunctType=Brck PunctSide=Fin<br />
|- <br />
| <code>guio</code> || Hyphen || - used to connect two words into one e.g. year-long|| PUNCT PunctType=Dash<br />
|- <br />
| <code>apos</code> || Apostrophe || ' or ' || PUNCT<br />
|- <br />
| <code>quot</code> || Quotation || " || PUNCT PunctType=Quot<br />
|- <br />
| <code>percent</code> || Percentage || % || PUNCT<br />
|- <br />
| <code>lquest</code> || Left question/exclamation mark || ¿¡ (''used in Spanish'') || PUNCT PunctSide=Ini<br />
|-<br />
| <code>clb</code> || Clause Boundary || Refers to any of the following symbols: .?;:!·… || PUNCT<br />
|-<br />
| <code>punct</code> || Punctuation || || PUNCT<br />
|}<br />
<br />
==Part-of-speech Sub-categories== <!-- subtype --><br />
<br />
===Gender=== <!-- gender --><br />
<br />
These tags are usually used with nouns. When they occur with things that agree/concord with nouns (like adjectives and verbs), they in fact constitute inflectional/grammatical tags.<br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>f</code> || Feminine || || Gender=Fem<br />
|-<br />
| <code>m</code> || Masculine || || Gender=Masc <!-- default --><br />
|-<br />
| <code>nt</code> || Neuter || || Gender=Neut<br />
|-<br />
| <code>ma</code> || Masculine (animate) || Mostly in Slavic languages || Gender=Masc<br />
|-<br />
| <code>mi</code> || Masculine (inanimate) || Mostly in Slavic languages || Gender=Masc<br />
|-<br />
| <code>mp</code> || Masculine (personal) || in Polish || Gender=Masc<br />
|-<br />
| <code>mn</code> || Masculine or neuter || || Gender=Masc,Neut<br />
|-<br />
| <code>fn</code> || Feminine or neuter || || Gender=Fem,Neut<br />
|-<br />
| <code>mf</code> || Masculine or feminine || Used when masculine and feminine have the same form || Gender=Masc,Fem<br />
|-<br />
| <code>mfn</code> || Masculine , feminine , neuter || Used when masculine, feminine, and neuter have the same form || Gender=Masc,Fem,Neut<br />
|-<br />
| <code>ut</code> || Common || From ''utrum'', found in Scandinavian languages. || Gender=Com<br />
|-<br />
| <code>un</code> || Common or neuter || As above, only common or neuter || Gender=Com,Neut<br />
|-<br />
| <code>GD</code> || Gender to be determined || || <!-- unknown --><br />
|- <br />
|}<br />
<br />
===Count/Mass=== <!-- countability --><br />
<br />
These tags are usually used with nouns, and things that agree/concord with nouns (like adjectives and verbs).<br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>cnt</code> || Countable || ||<br />
|-<br />
| <code>unc</code> || Uncountable (mass) || ||<br />
|- <br />
|}<br />
<br />
===Animacy=== <!-- animacy --><br />
<br />
These tags are usually used with nouns, and things that agree/concord with nouns (like adjectives and verbs).<br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>aa</code> || Animate || || Animacy=Anim<br />
|-<br />
| <code>an</code> || Animate or inanimate || || Animacy=Anim,Inan<br />
|-<br />
| <code>nn</code> || Inanimate || || Animacy=Inan<br />
|-<br />
| <code>hu</code> || Human || || Animacy=Hum<br />
|}<br />
<br />
===Adjectives=== <!-- adj_type --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>sint</code> || Synthetic || "nice, nicer, nicest" is synthetic. "handsome, more handsome, the most handsome" is not. [http://en.wikipedia.org/wiki/Synthetic_language wikipedia] ||<br />
|-<br />
| <code>preadj</code> || Pre-adjective || for languages where most of adjectives are after the noun (ex: French in eo->fr bidix) ||<br />
|-<br />
| <code>preadj_nh</code> || Pre-adjective if not human || according to the noun, the adjective is before or after ||<br />
|-<br />
|}<br />
<br />
===Noun Class === <!-- n_class --><br />
<br />
{| class="wikitable" border="1"<br />
! Symbol !! Gloss !! Notes<br />
|-<br />
| <code>cl1</code> || Noun class 1 ||<br />
|-<br />
| <code>cl2</code> || Noun class 2 ||<br />
|-<br />
| <code>cl3</code> || Noun class 3 ||<br />
|-<br />
| <code>cl4</code> || Noun class 4 ||<br />
|-<br />
| <code>cl5</code> || Noun class 5 ||<br />
|-<br />
| <code>cl6</code> || Noun class 6 ||<br />
|-<br />
| <code>cl7</code> || Noun class 7 ||<br />
|-<br />
| <code>cl8</code> || Noun class 8 ||<br />
|-<br />
| <code>cl9</code> || Noun class 9 ||<br />
|-<br />
| <code>cl10</code> || Noun class 10 ||<br />
|-<br />
| <code>cl11</code> || Noun class 11 ||<br />
|-<br />
| <code>cl12</code> || Noun class 12 ||<br />
|}<br />
<br />
===Pronoun types === <!-- prn_type --><br />
<br />
{| class="wikitable" border="1"<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>pers</code> || Personal || || PronType=Prs<br />
|-<br />
| <code>tn</code> || Tónico || ||<br />
|-<br />
| <code>log</code> || Logophoric || ||<br />
|-<br />
| <code>detnt</code> || Neuter determiner || POS? || DET<br />
|-<br />
| <code>predet</code> || Pre determiner || POS? || DET<br />
|-<br />
| <code>atn</code> || Atónico || ||<br />
|-<br />
| <code>qnt</code> || Quantifier || || PronType=Ind<br />
|-<br />
| <code>ord</code> || Ordinal || || NumType=Ord<br />
|-<br />
| <code>obj</code> || Object || || Case=Acc<br />
|-<br />
| <code>subj</code> || Subject || || Case=Nom<br />
|-<br />
| <code>pro</code> || Proclitic || ||<br />
|-<br />
| <code>enc</code> || Enclitic || ||<br />
|-<br />
| <code>acr</code> || Acronym || Not Pronuon? || Abbr=Yes<br />
|-<br />
| <code>rel</code> || Relative || || PronType=Rel<br />
|-<br />
| <code>ind</code> || Indefinite || || PronType=Ind<br />
|-<br />
| <code>itg</code> || Interrogative || || PronType=Int<br />
|-<br />
| <code>dem</code> || Demonstrative || || PronType=Dem<br />
|-<br />
| <code>def</code> || Definite || || Definite=Def<br />
|-<br />
| <code>pos</code> || Possessive || || Poss=Yes<br />
|-<br />
| <code>ref</code> || Reflexive || || Reflex=Yes<br />
|-<br />
| <code>prx</code> || Proximate || ||<br />
|-<br />
| <code>med</code> || Medial || ||<br />
|-<br />
| <code>dst</code> || Distal || ||<br />
|-<br />
| <code>expl</code> || Syntactic expletive || [https://en.wikipedia.org/wiki/Syntactic_expletive wikipedia] ||<br />
|-<br />
| <code>rec</code> || Reciprocal Pronoun || ||<br />
|-<br />
| <code>res</code> || Reciprocal Pronoun || ||<br />
|}<br />
<br />
=== Transitivity === <!-- transitivity --><br />
<br />
Used for verbs.<br />
<br />
{| class="wikitable" border="1"<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>tv</code> || Transitive || takes direct object in accusative case (used in Turkic) || Subcat=Tran<br />
|-<br />
| <code>iv</code> || Intransitive || does not take direct object in accusative case (used in Turkic) || Subcat=Intr<br />
|-<br />
| <code>TD</code> || Transitivity to be determined || if the sub-category is (currently) unknown || <!-- unknown --><br />
|}<br />
<br />
===Separable verbs=== <!-- separable --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes<br />
|-<br />
| <code>sep</code> || Separable verb || [https://en.wikipedia.org/wiki/Separable_verb wikipedia], [https://deutsch.lingolia.com/en/grammar/verbs/separable-verbs lingolia], [https://aclweb.org/anthology/P98-1078.pdf PDF]<br />
|-<br />
| <code>fs</code> || Separable verb in subordinate clause ||<br />
|-<br />
| <code>fm</code> || Separable verb in main clause ||<br />
|-<br />
|}<br />
<br />
===Proper nouns=== <!-- np_type --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes<br />
|-<br />
| <code>ant</code> || Anthroponym || [http://en.wikipedia.org/wiki/Anthroponym wikipedia], it's very common to use ant together with f and m for traditionally gender-specific names<br />
|-<br />
| <code>top</code> || Toponym || In some language pairs without the locative case this may be ''loc''. Although this should be changed. [http://en.wikipedia.org/wiki/Toponym wikipedia]<br />
|-<br />
| <code>hyd</code> || Hydronym || [http://en.wikipedia.org/wiki/Hydronym wikipedia]<br />
|-<br />
| <code>cog</code> || Cognomen || In normal use, surnames<br />
|-<br />
| <code>org</code> || Organisation || <br />
|-<br />
| <code>al</code> || Altres || Other, misc.<br />
|-<br />
| <code>pat</code> ||Patronymic || A name derived from the name of a father or ancestor, e.g. Johnson, O'Brien, Ivanovich.<br />
|}<br />
<br />
== Inflectional morphology == <!-- infl --><br />
<br />
===Number=== <!-- number --><br />
Note: number can be a sub-category tag too, e.g. with pronouns.<br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>sg</code> || Singular || || Number=Sing <!-- default --><br />
|-<br />
| <code>pl</code> || Plural || || Number=Plur<br />
|-<br />
| <code>sp</code> || Singular or plural || || Number=Sing,Plur<br />
|-<br />
| <code>du</code> || Dual || || Number=Dual<br />
|-<br />
| <code>ct</code> || Count || see mk-bg || Number=Count<br />
|-<br />
| <code>coll</code> || Collective || || Number=Coll<br />
|-<br />
| <code>ND</code> || Number to be determined || || <!-- unknown --><br />
|-<br />
|}<br />
<br />
<br />
===Case=== <!-- case --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>nom</code> || Nominative || || Case=Nom<br />
|-<br />
| <code>acc</code> || Accusative || || Case=Acc<br />
|-<br />
| <code>dat</code> || Dative || || Case=Dat<br />
|-<br />
| <code>gen</code> || Genitive || || Case=Gen<br />
|-<br />
| <code>dg</code> || Dative and Genitive || in [[ro-es]], discouraged in new developments || Case=Dat,Gen<br />
|-<br />
| <code>voc</code> || Vocative || || Case=Voc<br />
|-<br />
| <code>abl</code> || Ablative || [http://en.wikipedia.org/wiki/Ablative wikipedia] || Case=Abl<br />
|-<br />
| <code>ins</code> || Instrumental or Instructive || [http://en.wikipedia.org/wiki/Instrumental_case wikipedia] || Case=Ins<br />
|-<br />
| <code>loc</code> || Locative || [http://en.wikipedia.org/wiki/Locative wikipedia] || Case=Loc<br />
|-<br />
| <code>prp</code> || Prepositional || [http://en.wikipedia.org/wiki/Prepositional wikipedia] ||<br />
|-<br />
| <code>tra</code> || Translative || || Case=Tra<br />
|-<br />
| <code>ill</code> || Illative || || Case=Ill<br />
|-<br />
| <code>ine</code> || Inessive || || Case=Ine<br />
|-<br />
| <code>ade</code> || Adessive || || Case=Ade<br />
|-<br />
| <code>all</code> || Allative || || Case=All<br />
|-<br />
| <code>abe</code> || Abessive || || Case=Abe<br />
|-<br />
| <code>ess</code> || Essive || || Case=Ess<br />
|-<br />
| <code>par</code> || Partitive || || Case=Par<br />
|-<br />
| <code>dis</code> || Distributive || || Case=Dis<br />
|-<br />
| <code>com</code> || Comitative || || Case=Com<br />
|-<br />
| <code>soc</code> || Sociative || || <br />
|-<br />
| <code>prl</code> || Prolative || || Case=Pro<br />
|-<br />
| <code>ses</code> || Superessive || [[Hungarian]] || Case=Sup<br />
|-<br />
| <code>sub</code> || Sublative || [[Hungarian]] || Case=Sub<br />
|-<br />
| <code>dela</code> || Delative || [[Hungarian]] || Case=Del<br />
|-<br />
| <code>term</code> || Terminative || [[Hungarian]], Estonian, ... || Case=Ter<br />
|-<br />
| <code>temp</code> || Temporal || [https://en.wikipedia.org/wiki/Temporal_case] || Case=Tem<br />
|-<br />
| <code>obl</code> || Oblique || [https://en.wikipedia.org/wiki/Oblique_case] || Case=Obl<br />
|-<br />
| <code>erg</code> || Ergative || [https://en.wikipedia.org/wiki/Ergative_case] || Case=Erg<br />
|-<br />
| <code>CD</code> || Case to be determined || || <!-- unknown --><br />
|}<br />
<br />
===Voice=== <!-- voice --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>actv</code> || Active voice || || Voice=Act<br />
|-<br />
| <code>pass</code> || Passive voice || is more used in Turkic. || Voice=Pass<br />
|-<br />
| <code>pasv</code> || Passive voice || is more used in Germanic. || Voice=Pass<br />
|-<br />
| <code>midv</code> || Middle voice || || Voice=Mid<br />
|-<br />
| <code>nactv</code> || Non-active voice || See Albanian. || <br />
|-<br />
| <code>caus</code> || Causative voice || see also [[#Derivations]] || Voice=Cau<br />
|-<br />
|}<br />
<br />
===Tense and mode=== <!-- tense --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>aff</code> || Affirmative || [https://en.wikipedia.org/wiki/Affirmation_and_negation wikipedia] || Polarity=Pos<br />
|-<br />
| <code>aor</code> || Aorist || [https://en.wikipedia.org/wiki/Aorist wikipedia] A tense in Turkic languages. || Tense=Past<br />
|-<br />
| <code>cni</code> || Conditional || Lot of pairs will probably use cnd or cond... || Mood=Cnd<br />
|-<br />
| <code>deb</code> || Debitive mode || Exclusive to Latvian ([https://en.wikipedia.org/wiki/Debitive wikipedia]) ||<br />
|-<br />
| <code>fti</code> || Future indicative || || Tense=Fut Mood=Ind<br />
|-<br />
| <code>fts</code> || Future subjunctive || || Tense=Fut Mood=Sub<br />
|-<br />
| <code>fut</code> || Future || || Tense=Fut<br />
|-<br />
| <code>ifi</code> || Past definite || from ''Pretério perfecto o indefinido'' || Tense=Past Definite=Def<br />
|-<br />
| <code>imp</code> || Imperative || [http://www.englishlanguageguide.com/grammar/imperative.asp englishlanguageguide] || Mood=Imp<br />
|-<br />
| <code>itg</code> || Interrogative || ||<br />
|-<br />
| <code>ito</code> || Infinitive with 'to' || [[German]] || VerbForm=Inf<br />
|-<br />
| <code>lp</code> || L-participle || ||<br />
|-<br />
| <code>neg</code> || Negative || || Polarity=Neg<br />
|-<br />
| <code>nonpast</code> || Non-past || || Tense=Pres,Fut<br />
|-<br />
| <code>past</code> || Past || || Tense=Past<br />
|-<br />
| <code>pii</code> || Imperfect || from ''Pretério imperfecto de indicativo'' [https://en.wikipedia.org/wiki/Imperfect wikipedia] || Tense=Past Mood=Ind Aspect=Imp<br />
|-<br />
| <code>pis</code> || Imperfect subjunctive || || Tense=Past Mood=Sub Aspect=Imp<br />
|-<br />
| <code>plu</code> || Pluperfect || In <code>cy-en</code> || Tense=Pqp<br />
|-<br />
| <code>pmp</code> || Pluperfect || In <code>es-gl</code> (from ''Pluscamperfecto'') || Tense=Pqp<br />
|-<br />
| <code>pp2</code> || Past participle (???) || It's at least used in the Esperanto dictionaries for future active participles, ''ont'' (seems quite odd) || VerbForm=Part Tense=Past<br />
|-<br />
| <code>pp3</code> || Past participle (???) || It's at least used in the Esperanto dictionaries for past active participles, ''int'' (seems quite odd) || VerbForm=Part Tense=Past<br />
|-<br />
| <code>pp</code> || Past participle || [http://en.wikipedia.org/wiki/Participle wikipedia] || VerbForm=Part Tense=Past<br />
|-<br />
| <code>pprs</code> || Present participle || Also appears as <code>ppres</code> (deprecated) || VerbForm=Part Tense=Pres<br />
|-<br />
| <code>ppres</code> || Present participle || ''see also: pprs''. [http://en.wikipedia.org/wiki/Present_participle wikipedia] || Tense=Pres VerbForm=Part<br />
|-<br />
| <code>pres</code> || Present || || Tense=Pres<br />
|-<br />
| <code>pret</code> || Preterite || [https://en.wikipedia.org/wiki/Preterite Preterite] || Tense=Past<br />
|-<br />
| <code>pri</code> || Present indicative || ''see also: pres''. [http://en.wikipedia.org/wiki/Present_indicative wikipedia] || Tense=Pres Mood=Ind<br />
|-<br />
| <code>prs</code> || Present subjunctive || [http://en.wikipedia.org/wiki/Present_subjunctive wikipedia] || Tense=Pres Mood=Sub<br />
|-<br />
| <code>supn</code> || Supine || [http://en.wikipedia.org/wiki/Supine wikipedia] || VerbForm=Sup<br />
|}<br />
<br />
=== Non-finite verb forms === <!-- nonfinite --><br />
<br />
These tags are used for non-finite verb forms, which are often elsewhere called "infinitives" or "participles". See https://doi.org/10.3765/ptu.v4i1.4587 for discussion.<br />
<br />
==== Noun-like ==== <!-- verbal-nouns --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>ger</code> || Gerund || || VerbForm=Vnoun<br />
|-<br />
| <code>ger_aor</code> || Aorist gerund || || VerbForm=Vnoun<br />
|-<br />
| <code>ger_fut</code> || Future gerund || || VerbForm=Vnoun Tense=Fut<br />
|-<br />
| <code>ger_hab</code> || Habitual gerund || || VerbForm=Vnoun Aspect=Hab<br />
|-<br />
| <code>ger_impf</code> || Imperfect gerund || || VerbForm=Vnoun Aspect=Imp<br />
|-<br />
| <code>ger_past</code> || Past gerund || || VerbForm=Vnoun Tense=Past<br />
|-<br />
| <code>ger_perf</code> || Perfect gerund || || VerbForm=Vnoun Aspect=Perf<br />
|-<br />
| <code>ger_pres</code> || Present gerund || || VerbForm=Vnoun Tense=Pres<br />
|}<br />
<br />
==== Adjective-like ==== <!-- verbal-adjectives --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>gpr</code> || Verbal adjective || || VerbForm=Part<br />
|-<br />
| <code>gpr_aor</code> || Aorist verbal adjective || || VerbForm=Part<br />
|-<br />
| <code>gpr_fut</code> || Future verbal adjective || || VerbForm=Part Tense=Fut<br />
|-<br />
| <code>gpr_hab</code> || Habitual verbal adjective || || VerbForm=Part Aspect=Hab<br />
|-<br />
| <code>gpr_impf</code> || Imperfect verbal adjective || || VerbForm=Part Aspect=Imp<br />
|-<br />
| <code>gpr_past</code> || Past verbal adjective || || VerbForm=Part Tense=Past<br />
|-<br />
| <code>gpr_perf</code> || Perfect verbal adjective || || VerbForm=Part Aspect=Perf<br />
|-<br />
| <code>gpr_pres</code> || Present verbal adjective || || VerbForm=Part Tense=Pres<br />
|}<br />
<br />
==== Adverb-like ==== <!-- verbal-adverbs --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>gna</code> || Verbal adverb || || VerbForm=Conv<br />
|-<br />
| <code>gna_aor</code> || Aorist verbal adverb || || VerbForm=Conv<br />
|-<br />
| <code>gna_fut</code> || Future verbal adverb || || VerbForm=Conv Tense=Fut<br />
|-<br />
| <code>gna_hab</code> || Habitual verbal adverb || || VerbForm=Conv Aspect=Hab<br />
|-<br />
| <code>gna_impf</code> || Imperfect verbal adverb || || VerbForm=Conv Aspect=Imp<br />
|-<br />
| <code>gna_past</code> || Past verbal adverb || || VerbForm=Conv Tense=Past<br />
|-<br />
| <code>gna_perf</code> || Perfect verbal adverb || || VerbForm=Conv Aspect=Perf<br />
|-<br />
| <code>gna_pres</code> || Present verbal adverb || || VerbForm=Conv Tense=Pres<br />
|}<br />
<br />
==== Infinitives ==== <!-- infinitives --><br />
<br />
Generally these must occur with auxiliaries.<br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>inf</code> || Infinitive || || VerbForm=Inf<br />
|-<br />
| <code>infps</code> || Personal infinitive || Used in Portuguese, likely should be merged || VerbForm=Inf<br />
|-<br />
| <code>prc_aor</code> || Aorist participle || || VerbForm=Inf<br />
|-<br />
| <code>prc_fut</code> || Future participle || || VerbForm=Inf Tense=Fut<br />
|-<br />
| <code>prc_hab</code> || Habitual participle || || VerbForm=Inf Aspect=Hab<br />
|-<br />
| <code>prc_impf</code> || Imperfect participle || || VerbForm=Inf Aspect=Imp<br />
|-<br />
| <code>prc_past</code> || Past participle || || VerbForm=Inf Tense=Past<br />
|-<br />
| <code>prc_perf</code> || Perfect participle || || VerbForm=Inf Aspect=Perf<br />
|-<br />
| <code>prc_pres</code> || Present participle || || VerbForm=Inf Tense=Pres<br />
|}<br />
<br />
===Aspect=== <!-- aspect --><br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>hab</code> || Habitual || || Aspect=Hab<br />
|-<br />
| <code>imperf</code> || Imperfective || Should be merged with <code>impf</code> || Aspect=Imp<br />
|-<br />
| <code>impf</code> || Imperfective || || Aspect=Imp<br />
|-<br />
| <code>perf</code> || Perfective || || Aspect=Perf<br />
|}<br />
<br />
===Person=== <!-- person --><br />
Note: person can be a sub-category tag, e.g. with pronouns.<br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>p1</code> || First person || || Person=1<br />
|-<br />
| <code>p2</code> || Second person || || Person=2<br />
|-<br />
| <code>p3</code> || Third person || || Person=3<br />
|-<br />
| <code>impers</code> || Impersonal || Sometimes called 'autonomous' || Person=0<br />
|-<br />
| <code>past3p</code> || Past third person || In <code>rus</code> and <code>bel-rus</code>, should be 2 tags || Person=3 Tense=Past<br />
|}<br />
<br />
===Derivations=== <!-- verb_deriv --><br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes<br />
|-<br />
| <code>caus</code> || Causative ||<br />
|-<br />
| <code>ingr</code> || Ingressive || https://nn.wikipedia.org/w/index.php?title=Ingressiv<br />
|-<br />
| <code>subs</code> || Verbal Noun or Verbal Substantive || Shorten form of ''substantive''. Noun formed from a verb<br />
|-<br />
| <code>agnt</code> || Agent noun || [https://en.wikipedia.org/wiki/Agent_noun Agent Noun]<br />
|-<br />
|}<br />
<br />
===Possession=== <!-- possessor --><br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature<br />
|-<br />
| <code>px1sg</code> || First person singular possessive || e.g. in [[Turkic languages]] || Person[psor]=1 Number[psor]=Sing<br />
|-<br />
| <code>px2sg</code> || Second person singular possessive || e.g. in [[Turkic languages]] || Person[psor]=2 Number[psor]=Sing<br />
|-<br />
| <code>px3sg</code> || Third person singular possessive || e.g. in [[Turkic languages]] || Person[psor]=3 Number[psor]=Sing<br />
|-<br />
| <code>px1pl</code> || First person plural possessive || e.g. in [[Turkic languages]] || Person[psor]=1 Number[psor]=Plur<br />
|-<br />
| <code>px2pl</code> || Second person plural possessive || e.g. in [[Turkic languages]] || Person[psor]=2 Number[psor]=Plur<br />
|-<br />
| <code>px3pl</code> || Third person plural possessive || e.g. in [[Turkic languages]] || Person[psor]=3 Number[psor]=Plur<br />
|-<br />
| <code>px3sp</code> || Third person possessive singular or plural || e.g. in [[Turkic languages]] || Person[psor]=3<br />
|-<br />
|}<br />
<br />
===Subject marking=== <!-- subject --><br />
<br />
e.g. in verbs with both, otherwise, see [[#Person]] and [[#Number]].<br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>s_sg1</code> || First person singular object || || Number[subj]=Sing Person[subj]=1<br />
|-<br />
| <code>s_sg2</code> || Second person singular object || || Number[subj]=Sing Person[subj]=2<br />
|-<br />
| <code>s_sg3</code> || Third person singular object || || Number[subj]=Sing Person[subj]=3<br />
|-<br />
| <code>s_pl1</code> || First person plural object || || Number[subj]=Plur Person[subj]=1<br />
|-<br />
| <code>s_pl2</code> || Second person plural object || || Number[subj]=Plur Person[subj]=2<br />
|-<br />
| <code>s_pl3</code> || Third person plural object || || Number[subj]=Plur Person[subj]=3<br />
|-<br />
|}<br />
<br />
<br />
===Object marking=== <!-- object --><br />
<br />
e.g. in verbs with both<br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>o_sg1</code> || First person singular object || || Number[obj]=Sing Person[obj]=1<br />
|-<br />
| <code>o_sg2</code> || Second person singular object || || Number[obj]=Sing Person[obj]=2<br />
|-<br />
| <code>o_sg3</code> || Third person singular object || || Number[obj]=Sing Person[obj]=3<br />
|-<br />
| <code>o_pl1</code> || First person plural object || || Number[obj]=Plur Person[obj]=1<br />
|-<br />
| <code>o_pl2</code> || Second person plural object || || Number[obj]=Plur Person[obj]=2<br />
|-<br />
| <code>o_pl3</code> || Third person plural object || || Number[obj]=Plur Person[obj]=3<br />
|-<br />
|}<br />
<br />
===Adjectives=== <!-- adj_infl --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal features<br />
|-<br />
| <code>pst</code> || Positive || || Degree=Pos<br />
|-<br />
| <code>comp</code> || Comparative || [http://en.wikipedia.org/wiki/Adjective#Attributive.2C_predicative.2C_absolute.2C_and_substantive_adjectives wikipedia] || Degree=Comp<br />
|-<br />
| <code>sup</code> || Superlative || [http://en.wikipedia.org/wiki/Adjective#Attributive.2C_predicative.2C_absolute.2C_and_substantive_adjectives wikipedia] || Degree=Sup<br />
|-<br />
| <code>attr</code> || Attributive || [http://en.wikipedia.org/wiki/Adjective#Attributive.2C_predicative.2C_absolute.2C_and_substantive_adjectives wikipedia] ||<br />
|-<br />
| <code>pred</code> || Predicative || [http://en.wikipedia.org/wiki/Adjective#Attributive.2C_predicative.2C_absolute.2C_and_substantive_adjectives wikipedia] ||<br />
|-<br />
|-<code>short</code> || Short adjective ||<br />
|}<br />
<br />
===Formality=== <!-- formality --><br />
{|class=wikitable<br />
! Symbols !! Gloss !! Notes<br />
|-<br />
| <code>crd</code> || Cordial ||<br />
|-<br />
| <code>el</code> || Elite ||<br />
|-<br />
| <code>fam</code> || Familiar ||<br />
|-<br />
| <code>frm</code> || Formal ||<br />
|-<br />
| <code>infml</code> || Informal ||<br />
|-<br />
| <code>pol</code> || Polite ||<br />
|-<br />
| <code>low</code> || Low courtesy ||<br />
|-<br />
| <code>mid</code> || Mid courtesy ||<br />
|-<br />
| <code>hi</code> || High courtesy ||<br />
|}<br />
<br />
===Specificity=== <!-- specificity --><br />
{|class=wikitable<br />
! Symbols !! Gloss !! Notes<br />
|-<br />
| <code>spc</code> || Specific || Definite=Spec<br />
|-<br />
| <code>nspc</code> || Non-sepecific ||<br />
|}<br />
<br />
===Others=== <!-- other --><br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes<br />
|-<br />
| <code>abbr</code> || Abbreviation (e.g. ''etc., Mr.'') || Acronyms are also included (see <code>acr</code>)<br />
|-<br />
| <code>date</code> || Dates, years... ||<br />
|-<br />
| <code>email</code> || Electronic Mail || Shorten form of Electronic Mail<br />
|-<br />
| <code>file</code> || Filenames ||<br />
|-<br />
| <code>mon</code> || Money ||<br />
|-<br />
| <code>percent</code> || Percentage || e.g. 25%, 0.9%<br />
|-<br />
| <code>time</code> || Time ||<br />
|-<br />
| <code>url</code> || Web address ||<br />
|-<br />
| <code>web</code> || Links and Emails ||<br />
|-<br />
| <code>year</code> || Years ||<br />
|-<br />
| <code>maj</code> || Large script in which every letter is the same height ||<br />
|-<br />
| <code>min</code> || small script in which every letter is the same height ||<br />
|}<br />
<br />
=== Compounds === <!-- compound --><br />
<br />
{|class=wikitable<br />
! Symbol !! Gloss !! Notes !! Universal feature <br />
|-<br />
| <code>cmp</code> || Compound Noun || ||<br />
|}<br />
<br />
==Chunk tags== <!-- chunk --><br />
<br />
{|class=wikitable<br />
! Tag !! Description<br />
|-<br />
| {{tag|SN}} || Noun phrase / noun group (''sintagma nominal'')<br />
|- <br />
| {{tag|SA}} || Adjective phrase / adjective group <br />
|-<br />
| {{tag|SV}} || Verb phrase / verb group (''sintagma verbal'')<br />
|-<br />
|}<br />
<br />
==XML tags== <!-- xml --><br />
Note: All XML tags are explained in depth in the PDF [[documentation]], see also the [https://github.com/apertium/lttoolbox/blob/master/lttoolbox/dix.dtd dix.dtd] and [https://github.com/apertium/lttoolbox/blob/master/lttoolbox/dix.rng dix.rng] files in the GitHub repository.<br />
<br />
{|class=wikitable<br />
! XML tag !! Means !! Appears in XML tags / notes / examples<br />
|-<br />
| <code><dictionary></code> || Mono- or bilingual dictionary || Toplevel tag for all dictionaries<br />
|-<br />
| <code><alphabet></code> || Set of characters in the language|| In <code><dictionary></code><br />
|-<br />
| <code><sdefs></code> || Symbol definitions || In <code>&lt;dictionary></code><br />
|-<br />
| <code><sdef></code> || Symbol definition || In <code>&lt;sdefs></code>. Ex: <code>&lt;sdef n="noun"/></code><br />
|-<br />
| <code><pardefs></code> || Paradigm definitions || In <code>&lt;dictionary></code>. <br />
|-<br />
| <code><pardef></code> || Paradigm definition || In <code>&lt;pardefs></code>. <br />
|-<br />
| <code>&lt;section></code> || A section of the dictionary || In <code>&lt;dictionary></code>. Ex: <code>&lt;section id="main" type="standard"></code><br />
|-<br />
| <code>&lt;e></code> || A dictionary entry (a word) || In <code>&lt;section></code> and in <code>&lt;pardef></code>.<br />
|-<br />
| <code>&lt;i></code> || Invariant (left and right side) || In <code>&lt;e></code>. Ex.: <code>&lt;i>beer&lt;/i></code><br />
|-<br />
| <code>&lt;p></code> || A pair || In <code><e></code>. <br />
|-<br />
| <code>&lt;l></code> || Left side (surface form) || In <code>&lt;p></code>. Ex.: <code><l>beer</l></code><br />
|-<br />
| <code>&lt;r></code> || Right side (lexical unit) || In <code>&lt;p></code>. Ex.: <code><r>beer&lt;s n="noun"/>&lt;s n="singular"/></r></code><br />
|-<br />
| <code>&lt;s></code> || A lexical symbol (noun, adj..) || In <code>&lt;r></code>, <code>&lt;l></code> and <code>&lt;i></code>. Ex.: <code>&lt;s n="noun"/></code><br />
|-<br />
| <code>&lt;a></code> || Post-generator wake-up mark || In <code>&lt;r></code>, <code>&lt;l></code> and <code>&lt;i></code>. Ex.: <code>&lt;l>&lt;a/>a&lt;s ...</code> (for the a/an rule in English)<br />
|-<br />
| <code>&lt;b></code> || Blank space || In <code>&lt;r></code>, <code>&lt;l></code> and <code>&lt;i></code>. Ex.: <code>&lt;l>you're&lt;b/>welcome&lt;s ...</code> <br />
|-<br />
| <code>&lt;g></code> || Group || For [[Chunking:_A_full_example#Handling_of_multiwords_with_inner_inflection|multiwords]]<br />
|-<br />
| <code>&lt;ig></code> || Identity group || Combination of <code>&lt;i></code> and <code>&lt;g></code><br />
|-<br />
| <code>&lt;j></code> || Join || A <code>+</code> symbol in compounds<br />
|-<br />
| <code>&lt;prm></code> || Parameter || Only in [[Metadix]]<br />
|-<br />
| <code>&lt;sa></code> || Symbol Argument ??? || Only in [[Metadix]]<br />
|-<br />
| <code>&lt;t></code> || Tag or Template || In [[Apertium-separable]] <code>&lt;t></code> is any tag, in crossdix it is template (matches a single tag)<br />
|-<br />
| <code>&lt;d></code> || Delimiter || In [[Apertium-separable]] marks end-of-word<br />
|-<br />
| <code>&lt;v></code> || Variable || Only in crossdix - like + in regexes<br />
|}<br />
<br />
=== Transfer ===<br />
<br />
==== <clip> tag ==== <!-- clip --><br />
<br />
See the [https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf documentation (pdf)], p.144 for more information.<br />
<br />
{|class=wikitable<br />
! XML attribute value !! Means !! Appears in attribute || Notes<br />
|-<br />
| <code>whole</code> || lemma and grammatical symbols || part <br />
|-<br />
| <code>lem</code> || lemma || part<br />
|-<br />
| <code>lemh</code> || (inflected) head word of [[Chunking:_A_full_example#Handling_of_multiwords_with_inner_inflection|multiword]] || part<br />
|-<br />
| <code>lemq</code> || following queue of [[Chunking:_A_full_example#Handling_of_multiwords_with_inner_inflection|multiword]] || part<br />
|-<br />
|}<br />
<br />
==Scraping this page==<br />
<br />
This page should be relatively scrapeable if requested with <code>?action=raw</code>.<br />
<br />
Section headers which precede tables all have <code>=</code> as the first character of the line and have a category name without spaces in a comment.<br />
<br />
Lines that define tags begin with <code>| &lt;code&gt;</code>. Splitting a line on <code>||</code> gives either 3 or 4 columns. The 4th column can be split on spaces to give UD POS tags and feature values or the word <code>or</code>. These are mixed together but features have <code>=</code> and POS tags don't. A line might be followed by a comment containing either <code>unknown</code> or <code>default</code>, which indicate a placeholder tag or a tag which is commonly used when the correct value cannot be determined, respectively.<br />
<br />
A Python scraper script can be found at https://github.com/mr-martian/apertium-recursive-learning/blob/master/tags.py<br />
<br />
==See also==<br />
* [[Turkic lexicon|Guidelines for tag assignment (etc.) in Turkic]]<br />
* [[Tagging guidelines for Portuguese]]<br />
* [[Syntax tags]]<br />
* [[Secondary tags]]<br />
* [[Apertium stream format]]<br />
* [[User:Adverick#FreeMind_Apertium_PoS|FreeMind Apertium PoS]]<br />
<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Constraint-based_lexical_selection_module&diff=74361Constraint-based lexical selection module2023-04-11T15:26:24Z<p>Firespeaker: /* Sequences */</p>
<hr />
<div>{{TOCD}}<br />
<br />
'''apertium-lex-tools''' provides a module for compiling lexical selection rules and processing them in the pipeline. Rules can be manually written, or learnt from monolingual or parallel corpora.<br />
<br />
==Installing==<br />
Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev. <br />
<br />
<span style="color: #f00;">See [[Installation]], for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.</span><br />
<br />
==Lexical transfer in the pipeline==<br />
<br />
lrx-proc runs between bidix lookup and the first stage of transfer, e.g. <br />
<pre><br />
… apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | lrx-proc kaz-tat.lrx.bin \<br />
| apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x kaz-tat.t1x.bin | …<br />
</pre><br />
<br />
This is the output of <code>lt-proc -b</code> on an ambiguous bilingual dictionary:<br />
<pre><br />
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ <br />
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ <br />
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ <br />
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ <br />
^el<det><def><f><sg>/the<det><def><f><sg>$ <br />
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ <br />
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ <br />
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ <br />
^el<det><def><m><sg>/the<det><def><m><sg>$ <br />
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$<br />
</pre><br />
<br />
I.e.<br />
<pre><br />
L'estació més plujosa és la tardor, i la més seca l'estiu<br />
</pre><br />
<br />
Goes to:<br />
<pre><br />
The season/station more rainy is the autumn/fall, and the more dry the summer.<br />
</pre><br />
<br />
Apertium/lttoolbox 3.3 and onwards support the -b option to lt-proc / apertium-transfer.<br />
<br />
==Usage==<br />
<br />
Make a simple rule file,<br />
<br />
<pre><br />
<rules><br />
<rule><br />
<match lemma="criminal" tags="adj"/><br />
<match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match><br />
</rule><br />
</rules><br />
</pre><br />
<br />
Then compile it:<br />
<br />
<pre><br />
$ lrx-comp rules.xml rules.fst<br />
1: 32@32<br />
</pre><br />
<br />
The input is the output of <code>lt-proc -b</code>,<br />
<br />
<pre><br />
$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ <br />
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ <br />
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst <br />
1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG><br />
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ <br />
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$<br />
</pre><br />
<br />
==Rule format==<br />
<br />
A rule is made up of an ordered list of:<br />
<br />
* Matches<br />
* Operations (select, remove)<br />
<br />
<pre><br />
<rule> <br />
<match lemma="el"/> <br />
<match lemma="dona" tags="n.*"> <br />
<select lemma="wife"/> <br />
</match> <br />
<match lemma="de"/><br />
</rule><br />
<br />
<rule> <br />
<match lemma="estació" tags="n.*"> <br />
<select lemma="season"/> <br />
</match> <br />
<match lemma="més"/><br />
<match lemma="plujós"/><br />
</rule><br />
<br />
<rule> <br />
<match lemma="guanyador"/><br />
<match lemma="de"/><br />
<match/><br />
<match lemma="prova" tags="n.*"> <br />
<select lemma="event"/> <br />
</match> <br />
</rule><br />
</pre><br />
<br />
===Weights===<br />
<br />
The rules compete with each other. That is why a weight is assigned to each of them. In the case of a word that has several possible translations in the dictionary, all rules are evaluated. For each possible translation, the weights of the rules that match the context of the use of the word in the sentence are added up, and the translation with the highest value is chosen. For instance, let's consider these two rules:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="ferotge" tags="adj.*"><select lemma="farouche"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<or><br />
<match lemma="animal" tags="n.*"/><br />
<match lemma="animau" tags="n.*"/><br />
</or><br />
<match lemma="ferotge" tags="adj.*"><select lemma="féroce"/></match><br />
</rule><br />
</pre><br />
<br />
If we have "un animal ferotge", the translation "farouche" will get 0.8 points, and "féroce" will get 1.0. The latter will be chosen.<br />
<br />
===Operator OR===<br />
<br />
The boolean operator OR can be used, as shown in the previous example:<br />
<br />
<pre><br />
<rule weight="1.0"><br />
<or><br />
<match lemma="animal" tags="n.*"/><br />
<match lemma="animau" tags="n.*"/><br />
</or><br />
<match lemma="ferotge" tags="adj.*"><select lemma="féroce"/></match><br />
</rule><br />
</pre><br />
<br />
===Sequences===<br />
<br />
Often, the same words are used in OR's. For readability and maintainability, they can be defined in a special sequence bloc, for instance:<br />
<br />
<pre><br />
<def-seqs><br />
<def-seq n="jorns"><or><br />
<match lemma="diluns" tags="n.*"/><br />
<match lemma="dimars" tags="n.*"/><br />
<match lemma="dimècres" tags="n.*"/><br />
<match lemma="dimèrcs" tags="n.*"/><br />
<match lemma="dijòus" tags="n.*"/><br />
<match lemma="dijaus" tags="n.*"/><br />
<match lemma="divendres" tags="n.*"/><br />
<match lemma="divés" tags="n.*"/><br />
<match lemma="dissabte" tags="n.*"/><br />
<match lemma="dimenge" tags="n.*"/><br />
</or></def-seq><br />
<br />
<def-seq n="meses"><or><br />
<match lemma="genèr" tags="n.*"/><br />
<match lemma="genièr" tags="n.*"/><br />
<match lemma="janvièr" tags="n.*"/><br />
<match lemma="gèr" tags="n.*"/><br />
<match lemma="febrièr" tags="n.*"/><br />
<match lemma="heurèr" tags="n.*"/><br />
<match lemma="hrevèr" tags="n.*"/><br />
<match lemma="herevèr" tags="n.*"/><br />
<match lemma="herbèr" tags="n.*"/><br />
<match lemma="hiurèr" tags="n.*"/><br />
<match lemma="març" tags="n.*"/><br />
<match lemma="abrial" tags="n.*"/><br />
<match lemma="abril" tags="n.*"/><br />
<match lemma="abriu" tags="n.*"/><br />
<match lemma="abrieu" tags="n.*"/><br />
<match lemma="mai" tags="n.*"/><br />
<match lemma="junh" tags="n.*"/><br />
<match lemma="julh" tags="n.*"/><br />
<match lemma="juin" tags="n.*"/><br />
<match lemma="gulh" tags="n.*"/><br />
<match lemma="julhet" tags="n.*"/><br />
<match lemma="gulhet" tags="n.*"/><br />
<match lemma="junhsèga" tags="n.*"/><br />
<match lemma="agost" tags="n.*"/><br />
<match lemma="aost" tags="n.*"/><br />
<match lemma="setembre" tags="n.*"/><br />
<match lemma="seteme" tags="n.*"/><br />
<match lemma="octobre" tags="n.*"/><br />
<match lemma="octòbre" tags="n.*"/><br />
<match lemma="novembre" tags="n.*"/><br />
<match lemma="noveme" tags="n.*"/><br />
<match lemma="decembre" tags="n.*"/><br />
<match lemma="deceme" tags="n.*"/><br />
</or></def-seq><br />
</def-seqs><br />
</pre><br />
<br />
They have to be referenced in the rules as follows:<br />
<br />
<pre><br />
<rule weight="1.0"><br />
<or><br />
<seq n="jorns"/><br />
<seq n="meses"/><br />
<match lemma="prima" tags="n.*"/><br />
<match lemma="estiu" tags="n.*"/><br />
<match lemma="auton" tags="n.*"/><br />
<match lemma="ivèrn" tags="n.*"/><br />
</or><br />
<match lemma="passat" tags="adj.*"><select lemma="dernier"/></match><br />
</rule><br />
</pre><br />
<br />
Note that if you add a <code>&lt;def-seqs&gt;</code> section and you had only a <code>&lt;rules&gt;</code> section already, then you'll need to put both inside of an <code>&lt;lrx&gt;</code> section:<br />
<pre><br />
<lrx><br />
<def-seqs><br />
...<br />
</def-seqs><br />
<rules><br />
...<br />
</rules><br />
</lrx><br />
</pre><br />
<br />
===Operator REPEAT===<br />
<br />
Imagine the translation of the Occitan word "còrn" that may be "corner" or "horn" (of an animal). We could have as a first version:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="còrn" tags="n.*"><select lemma="corner"/></match><br />
</rule><br />
<rule weight="1.0" ><br />
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match><br />
<match lemma="de" tags="pr"/><br />
<or><br />
<seq n="animals"/><br />
</or><br />
</rule><br />
</pre><br />
<br />
But this will not match if we have an adjective that follows "còrn" (usually adjectives follow the nouns in Occitan). We could add a rule like:<br />
<br />
<pre><br />
<rule weight="1.0" ><br />
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match><br />
<match tags="adj.*"/><br />
<match lemma="de" tags="pr"/><br />
<or><br />
<seq n="animals"/><br />
</or><br />
</rule><br />
</pre><br />
<br />
Using the operator REPEAT we can have a more compact way just expanding rule 2:<br />
<br />
<pre><br />
<rule weight="1.0" ><br />
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match><br />
<repeat from="0" upto="2"><br />
<match tags="adj.*"/><br />
</repeat><br />
<match lemma="de" tags="pr"/><br />
<or><br />
<seq n="animals"/><br />
</or><br />
</rule><br />
</pre><br />
<br />
Note that now we are even accepting two adjectives after "còrn" instead of only one (without adding a fourth rule for dealing with two adjectives).<br />
<br />
And, if we think that horn can be not only big, but also "very big", we can improve the rule this way:<br />
<br />
<pre><br />
<rule weight="1.0" ><br />
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match><br />
<repeat from="0" upto="3"><br />
<or><br />
<match tags="adv"/><br />
<match tags="adj.*"/><br />
</or><br />
</repeat><br />
<match lemma="de" tags="pr"/><br />
<or><br />
<seq n="animals"/><br />
</or><br />
</rule><br />
</pre><br />
<br />
Next, a second REPEAT block could be added between the preposition "de" and the sequence to deal with the possible existence of determiners, adjectives, etc.<br />
<br />
====REPEAT hack====<br />
<br />
Sometimes, a lexical selection has unclear rules. For instance the Occitan noun "cosina" may be "(female) cousin" or "kitchen". We can decide that the latter is the most usual translation, so it will be the default. On the other hand, we will select "cousin" if there is another parent term nearby, such as "father", "mother" or "brother". For this we can do something like:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="cosina" tags="n.*"><select lemma="kitchen"/></match><br />
</rule><br />
<rule weight="1.0" ><br />
<match lemma="cosina" tags="n.*"><select lemma="cousin"/></match><br />
<repeat from="0" upto="4"><br />
<or><br />
<match tags=""/><br />
<match tags="*"/><br />
</or><br />
</repeat><br />
<or><br />
<seq n="familia"/><br />
</or><br />
</rule><br />
<rule weight="1.0" ><br />
<or><br />
<seq n="familia"/><br />
</or><br />
<repeat from="0" upto="4"><br />
<or><br />
<match tags=""/><br />
<match tags="*"/><br />
</or><br />
</repeat><br />
<match lemma="cosina" tags="n.*"><select lemma="cousin"/></match><br />
</rule><br />
</pre><br />
<br />
Rule 2 selects "cousin" if, at most, after four words there is a family word. Rule 3 does the same, but looking at up to 4 words in front. Note the OR operator within the REPEAT: <i><nowiki><match tags="*"/></nowiki></i> matches any known word (i.e. that gets a morphological analysis), while <i><nowiki><match tags=""/></nowiki></i> matches unknown words (i.e. without any morphological tag). Without the OR operation, the rules would try to match precisely a sequence of one unknown word followed by one known one.<br />
<br />
===Macros===<br />
<br />
A macro is a set of rules for a common purpose that can be used for several words. For instance, quite often a verb has a different translation if it is pronominal or not, or if it is transitive or not.<br />
<br />
Let's take as a example the Occitan verb "recordar" that is usually translated into French as "rappeler" ("remind"), but as a pronominal verb it would be "(se) souvenir" ("remember"). The problem is that to recognise a pronominal context one needs quite a few rules to prove that there is a personal unstressed pronoun before (or after) the verb and that it has the same person and number as the verb. So a macro could be created like:<br />
<br />
<pre><br />
<!-- p1 = verb oci, p2 = verb fra no pron, p3 = verb fra pron --><br />
<def-macro n="verb_nopron_pron" npar="3"><br />
<rule weight="0.8"><br />
<match plemma="1" tags="vblex.*"><select plemma="2"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match tags="prn.pro.p1.*.sg"/><br />
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match tags="prn.pro.p1.*.sg"/><br />
<match tags="prn.pro.*"/><br />
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match><br />
<match tags="prn.enc.p1.*.sg"/><br />
</rule><br />
<rule weight="1.0"><br />
<match tags="prn.pro.p2.*.sg"/><br />
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match tags="prn.pro.p2.*.sg"/><br />
<match tags="prn.pro.*"/><br />
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match><br />
<match tags="prn.enc.p2.*.sg"/><br />
</rule><br />
<br />
</def-macro><br />
</pre><br />
<br />
The call to the rule should be:<br />
<br />
<pre><br />
<macro n="verb_nopron_pron"><with-param v="recordar"/><with-param v="rappeler"/><with-param v="souvenir"/></macro><br />
</pre><br />
<br />
For other verbs a call to the same macro is sufficient. The code is much more readable and maintainable than without macros.<br />
<br />
===Special cases===<br />
====Matching a capitalized word====<br />
<br />
Below, the noun "audiència" will be usually translated as "audience", but if it is written as "Audiència", "cour# d'assises" (i.e. <nowiki>cour<g><b/>d'assises</g></nowiki>) will be elected:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="audiència" tags="n.*"><select lemma="audience"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match lemma="Audiència" tags="n.*"><select lemma="cour# d'assises"/></match><br />
</rule><br />
</pre><br />
<br />
====Matching an unknown word====<br />
<br />
Below, the noun "mossèn" will be usually translated as "curé", but if it is followed by an anthroponym (rule 2) or an unknown word (rule 3), "monseigneur" will be elected:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="mossèn" tags="n.*"><select lemma="curé"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match><br />
<match tags="np.ant.*"/><br />
</rule><br />
<rule weight="1.0"><br />
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match><br />
<match tags=""/><br />
</rule><br />
</pre><br />
<br />
The last rule can be improved specifying that the unknown word should be capitalized:<br />
<br />
<pre><br />
<rule weight="1.0"><br />
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match><br />
<match tags="" case="Aa"/><br />
</rule><br />
</pre><br />
<br />
=== Some more new stuff ===<br />
* [https://github.com/apertium/apertium-lex-tools/commit/8f5493f28d944c9c1591e3449f6df8c4718ab2c6 contains]<br />
* [https://github.com/apertium/apertium-lex-tools/commit/371b3740e74c27a2e6ff4278ce9286ee9e0b2319 suffix]<br />
* [https://github.com/apertium/apertium-lex-tools/pull/92 <code>glob="star"</code>]<br />
<br />
==Writing and generating rules==<br />
<br />
===Writing===<br />
{{main|How to get started with lexical selection rules}}<br />
A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in. <br />
<br />
===Generating===<br />
<br />
;Parallel corpus<br />
{{main|Learning rules from parallel and non-parallel corpora}}<br />
<br />
;Monolingual corpora<br />
<br />
{{main|Running_the_monolingual_rule_learning}}<br />
<br />
==Todo and bugs==<br />
<br />
* <s>xml compiler</s><br />
* <s>compile rule operation patterns, as well as matching patterns</s><br />
* <s>make rules with gaps work</s><br />
* <s>optimal coverage</s><br />
* <s>fix bug with processing multiple sentences</s><br />
* <s>instead of having regex OR, insert separate paths/states.</s><br />
* <s>optimise the bestPath function (don't use strings to store the paths)</s><br />
* <s>autotoolsise build</s><br />
* <s>add option to compiler to spit out ATT transducers</s><br />
* <s>fix bug with outputting an extra '\n' at the end</s><br />
* <s>edit <code>transfer.cc</code> to allow input from <code>lt-proc -b</code></s><br />
* profiling and speed up<br />
** <s>why do the regex transducers have to be minimised ?</s><br />
** <s>retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths</s><br />
** <s>stop using string processing to retrieve rule numbers</s><br />
** <s>retrieve vector of vectors of words, not string of words from lttoolbox</s><br />
** why does the performance drop substantially with more rules ? <br />
** <s>add a pattern -> first letter map so we don't have to call recognise() with every transition</s> (didn't work so well)<br />
* <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s><br />
* <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s><br />
* <s>default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in <code>testing/</code>.</s><br />
* make sure that <code>-b</code> works with <code>-n</code> too.<br />
* testing<br />
* null flush<br />
* add option to processor to spit out ATT transducers<br />
* use brown clusters to merge rules with the same context, or remove parts of context from rules which are not relevant?<br />
* https://sourceforge.net/p/apertium/tickets/64/ <code><match tags="n.*"></match></code> never matches, while <code><match tags="n.*"/></code> does<br />
<br />
; Rendimiento<br />
<br />
* 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)<br />
* 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)<br />
<br />
==Preparedness of language pairs==<br />
<br />
{|class="wikitable"<br />
! Pair !! LR (L) !! LR (L→R) !! Fertility !! Rules <br />
|-<br />
| <code>apertium-is-en</code> || 18,563 || 22,220 || 1.19 || 115 <br />
|-<br />
| <code>apertium-es-fr</code> || || || || <br />
|-<br />
| <code>apertium-eu-es</code> || 16,946 || 18,550 || 1.09 || 250<br />
|-<br />
| <code>apertium-eu-en</code> || || || || <br />
|-<br />
| <code>apertium-br-fr</code> || 20,489 || 20,770 || 1.01 || 256 <br />
|-<br />
| <code>apertium-mk-en</code> || 8,568 || 10,624 || 1.24 || 81<br />
|-<br />
| <code>apertium-es-pt</code> || || || || <br />
|-<br />
| <code>apertium-es-it</code> || || || || <br />
|-<br />
| <code>apertium-es-ro</code> || || || || <br />
|-<br />
| <code>apertium-en-es</code> || 267,469 || 268,522 || 1.003 || 334 <br />
|-<br />
| <code>apertium-en-ca</code> || || || || <br />
|-<br />
|}<br />
<br />
<br />
===Troubleshooting===<br />
If you get the message <code>lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory</code> you may need to put this in your ~/.bashrc <br />
<pre><br />
LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"<br />
</pre><br />
Then open a new terminal before using lrx-comp/lrx-proc.<br />
<br />
On a 64-bit machine, apertium-lex-tools make may fail because the zlib is missing, even though you have zlib1g-dev installed. If you get the error message <code>/usr/bin/ld: cannot find -lz</code>, do the following: install package lib32z1-dev (which will install many other dependencies), even though it is a 32-bit binary, it is needed to compile the sources.<br />
<br />
==See also==<br />
<br />
* [[How to get started with lexical selection rules]]<br />
* [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-lex-tools/ SVN Module: apertium-lex-tools]<br />
<br />
==References==<br />
<br />
* Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L. (2012) "[https://rua.ua.es/dspace/bitstream/10045/27581/1/tyers12a.pdf Flexible finite-state lexical selection for rule-based machine translation]". Proceedings of the 17th Annual Conference of the European Association of Machine Translation, EAMT12 <br />
* Tyers et al [https://rua.ua.es/dspace/bitstream/10045/35848/1/thesis_FrancisMTyers.pdf#page=62 Feasible lexical selection for rule-based machine translation]<br />
* Tyers et al [https://aclanthology.org/W15-4919.pdf Unsupervised training of maximum-entropy models for lexical selection in rule-based machine translation]<br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Constraint-based_lexical_selection_module&diff=74360Constraint-based lexical selection module2023-04-08T04:12:29Z<p>Firespeaker: /* Matching an unknown word */</p>
<hr />
<div>{{TOCD}}<br />
<br />
'''apertium-lex-tools''' provides a module for compiling lexical selection rules and processing them in the pipeline. Rules can be manually written, or learnt from monolingual or parallel corpora.<br />
<br />
==Installing==<br />
Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev. <br />
<br />
<span style="color: #f00;">See [[Installation]], for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.</span><br />
<br />
==Lexical transfer in the pipeline==<br />
<br />
lrx-proc runs between bidix lookup and the first stage of transfer, e.g. <br />
<pre><br />
… apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | lrx-proc kaz-tat.lrx.bin \<br />
| apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x kaz-tat.t1x.bin | …<br />
</pre><br />
<br />
This is the output of <code>lt-proc -b</code> on an ambiguous bilingual dictionary:<br />
<pre><br />
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ <br />
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ <br />
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ <br />
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ <br />
^el<det><def><f><sg>/the<det><def><f><sg>$ <br />
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ <br />
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ <br />
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ <br />
^el<det><def><m><sg>/the<det><def><m><sg>$ <br />
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$<br />
</pre><br />
<br />
I.e.<br />
<pre><br />
L'estació més plujosa és la tardor, i la més seca l'estiu<br />
</pre><br />
<br />
Goes to:<br />
<pre><br />
The season/station more rainy is the autumn/fall, and the more dry the summer.<br />
</pre><br />
<br />
Apertium/lttoolbox 3.3 and onwards support the -b option to lt-proc / apertium-transfer.<br />
<br />
==Usage==<br />
<br />
Make a simple rule file,<br />
<br />
<pre><br />
<rules><br />
<rule><br />
<match lemma="criminal" tags="adj"/><br />
<match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match><br />
</rule><br />
</rules><br />
</pre><br />
<br />
Then compile it:<br />
<br />
<pre><br />
$ lrx-comp rules.xml rules.fst<br />
1: 32@32<br />
</pre><br />
<br />
The input is the output of <code>lt-proc -b</code>,<br />
<br />
<pre><br />
$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ <br />
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ <br />
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst <br />
1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG><br />
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ <br />
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$<br />
</pre><br />
<br />
==Rule format==<br />
<br />
A rule is made up of an ordered list of:<br />
<br />
* Matches<br />
* Operations (select, remove)<br />
<br />
<pre><br />
<rule> <br />
<match lemma="el"/> <br />
<match lemma="dona" tags="n.*"> <br />
<select lemma="wife"/> <br />
</match> <br />
<match lemma="de"/><br />
</rule><br />
<br />
<rule> <br />
<match lemma="estació" tags="n.*"> <br />
<select lemma="season"/> <br />
</match> <br />
<match lemma="més"/><br />
<match lemma="plujós"/><br />
</rule><br />
<br />
<rule> <br />
<match lemma="guanyador"/><br />
<match lemma="de"/><br />
<match/><br />
<match lemma="prova" tags="n.*"> <br />
<select lemma="event"/> <br />
</match> <br />
</rule><br />
</pre><br />
<br />
===Weights===<br />
<br />
The rules compete with each other. That is why a weight is assigned to each of them. In the case of a word that has several possible translations in the dictionary, all rules are evaluated. For each possible translation, the weights of the rules that match the context of the use of the word in the sentence are added up, and the translation with the highest value is chosen. For instance, let's consider these two rules:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="ferotge" tags="adj.*"><select lemma="farouche"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<or><br />
<match lemma="animal" tags="n.*"/><br />
<match lemma="animau" tags="n.*"/><br />
</or><br />
<match lemma="ferotge" tags="adj.*"><select lemma="féroce"/></match><br />
</rule><br />
</pre><br />
<br />
If we have "un animal ferotge", the translation "farouche" will get 0.8 points, and "féroce" will get 1.0. The latter will be chosen.<br />
<br />
===Operator OR===<br />
<br />
The boolean operator OR can be used, as shown in the previous example:<br />
<br />
<pre><br />
<rule weight="1.0"><br />
<or><br />
<match lemma="animal" tags="n.*"/><br />
<match lemma="animau" tags="n.*"/><br />
</or><br />
<match lemma="ferotge" tags="adj.*"><select lemma="féroce"/></match><br />
</rule><br />
</pre><br />
<br />
===Sequences===<br />
<br />
Often, the same words are used in OR's. For readability and maintainability, they can be defined in a special sequence bloc, for instance:<br />
<br />
<pre><br />
<def-seqs><br />
<def-seq n="jorns"><or><br />
<match lemma="diluns" tags="n.*"/><br />
<match lemma="dimars" tags="n.*"/><br />
<match lemma="dimècres" tags="n.*"/><br />
<match lemma="dimèrcs" tags="n.*"/><br />
<match lemma="dijòus" tags="n.*"/><br />
<match lemma="dijaus" tags="n.*"/><br />
<match lemma="divendres" tags="n.*"/><br />
<match lemma="divés" tags="n.*"/><br />
<match lemma="dissabte" tags="n.*"/><br />
<match lemma="dimenge" tags="n.*"/><br />
</or></def-seq><br />
<br />
<def-seq n="meses"><or><br />
<match lemma="genèr" tags="n.*"/><br />
<match lemma="genièr" tags="n.*"/><br />
<match lemma="janvièr" tags="n.*"/><br />
<match lemma="gèr" tags="n.*"/><br />
<match lemma="febrièr" tags="n.*"/><br />
<match lemma="heurèr" tags="n.*"/><br />
<match lemma="hrevèr" tags="n.*"/><br />
<match lemma="herevèr" tags="n.*"/><br />
<match lemma="herbèr" tags="n.*"/><br />
<match lemma="hiurèr" tags="n.*"/><br />
<match lemma="març" tags="n.*"/><br />
<match lemma="abrial" tags="n.*"/><br />
<match lemma="abril" tags="n.*"/><br />
<match lemma="abriu" tags="n.*"/><br />
<match lemma="abrieu" tags="n.*"/><br />
<match lemma="mai" tags="n.*"/><br />
<match lemma="junh" tags="n.*"/><br />
<match lemma="julh" tags="n.*"/><br />
<match lemma="juin" tags="n.*"/><br />
<match lemma="gulh" tags="n.*"/><br />
<match lemma="julhet" tags="n.*"/><br />
<match lemma="gulhet" tags="n.*"/><br />
<match lemma="junhsèga" tags="n.*"/><br />
<match lemma="agost" tags="n.*"/><br />
<match lemma="aost" tags="n.*"/><br />
<match lemma="setembre" tags="n.*"/><br />
<match lemma="seteme" tags="n.*"/><br />
<match lemma="octobre" tags="n.*"/><br />
<match lemma="octòbre" tags="n.*"/><br />
<match lemma="novembre" tags="n.*"/><br />
<match lemma="noveme" tags="n.*"/><br />
<match lemma="decembre" tags="n.*"/><br />
<match lemma="deceme" tags="n.*"/><br />
</or></def-seq><br />
</def-seqs><br />
</pre><br />
<br />
They have to be referenced in the rules as follows:<br />
<br />
<pre><br />
<rule weight="1.0"><br />
<or><br />
<seq n="jorns"/><br />
<seq n="meses"/><br />
<match lemma="prima" tags="n.*"/><br />
<match lemma="estiu" tags="n.*"/><br />
<match lemma="auton" tags="n.*"/><br />
<match lemma="ivèrn" tags="n.*"/><br />
</or><br />
<match lemma="passat" tags="adj.*"><select lemma="dernier"/></match><br />
</rule><br />
</pre><br />
<br />
===Operator REPEAT===<br />
<br />
Imagine the translation of the Occitan word "còrn" that may be "corner" or "horn" (of an animal). We could have as a first version:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="còrn" tags="n.*"><select lemma="corner"/></match><br />
</rule><br />
<rule weight="1.0" ><br />
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match><br />
<match lemma="de" tags="pr"/><br />
<or><br />
<seq n="animals"/><br />
</or><br />
</rule><br />
</pre><br />
<br />
But this will not match if we have an adjective that follows "còrn" (usually adjectives follow the nouns in Occitan). We could add a rule like:<br />
<br />
<pre><br />
<rule weight="1.0" ><br />
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match><br />
<match tags="adj.*"/><br />
<match lemma="de" tags="pr"/><br />
<or><br />
<seq n="animals"/><br />
</or><br />
</rule><br />
</pre><br />
<br />
Using the operator REPEAT we can have a more compact way just expanding rule 2:<br />
<br />
<pre><br />
<rule weight="1.0" ><br />
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match><br />
<repeat from="0" upto="2"><br />
<match tags="adj.*"/><br />
</repeat><br />
<match lemma="de" tags="pr"/><br />
<or><br />
<seq n="animals"/><br />
</or><br />
</rule><br />
</pre><br />
<br />
Note that now we are even accepting two adjectives after "còrn" instead of only one (without adding a fourth rule for dealing with two adjectives).<br />
<br />
And, if we think that horn can be not only big, but also "very big", we can improve the rule this way:<br />
<br />
<pre><br />
<rule weight="1.0" ><br />
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match><br />
<repeat from="0" upto="3"><br />
<or><br />
<match tags="adv"/><br />
<match tags="adj.*"/><br />
</or><br />
</repeat><br />
<match lemma="de" tags="pr"/><br />
<or><br />
<seq n="animals"/><br />
</or><br />
</rule><br />
</pre><br />
<br />
Next, a second REPEAT block could be added between the preposition "de" and the sequence to deal with the possible existence of determiners, adjectives, etc.<br />
<br />
====REPEAT hack====<br />
<br />
Sometimes, a lexical selection has unclear rules. For instance the Occitan noun "cosina" may be "(female) cousin" or "kitchen". We can decide that the latter is the most usual translation, so it will be the default. On the other hand, we will select "cousin" if there is another parent term nearby, such as "father", "mother" or "brother". For this we can do something like:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="cosina" tags="n.*"><select lemma="kitchen"/></match><br />
</rule><br />
<rule weight="1.0" ><br />
<match lemma="cosina" tags="n.*"><select lemma="cousin"/></match><br />
<repeat from="0" upto="4"><br />
<or><br />
<match tags=""/><br />
<match tags="*"/><br />
</or><br />
</repeat><br />
<or><br />
<seq n="familia"/><br />
</or><br />
</rule><br />
<rule weight="1.0" ><br />
<or><br />
<seq n="familia"/><br />
</or><br />
<repeat from="0" upto="4"><br />
<or><br />
<match tags=""/><br />
<match tags="*"/><br />
</or><br />
</repeat><br />
<match lemma="cosina" tags="n.*"><select lemma="cousin"/></match><br />
</rule><br />
</pre><br />
<br />
Rule 2 selects "cousin" if, at most, after four words there is a family word. Rule 3 does the same, but looking at up to 4 words in front. Note the OR operator within the REPEAT: <i><nowiki><match tags="*"/></nowiki></i> matches any known word (i.e. that gets a morphological analysis), while <i><nowiki><match tags=""/></nowiki></i> matches unknown words (i.e. without any morphological tag). Without the OR operation, the rules would try to match precisely a sequence of one unknown word followed by one known one.<br />
<br />
===Macros===<br />
<br />
A macro is a set of rules for a common purpose that can be used for several words. For instance, quite often a verb has a different translation if it is pronominal or not, or if it is transitive or not.<br />
<br />
Let's take as a example the Occitan verb "recordar" that is usually translated into French as "rappeler" ("remind"), but as a pronominal verb it would be "(se) souvenir" ("remember"). The problem is that to recognise a pronominal context one needs quite a few rules to prove that there is a personal unstressed pronoun before (or after) the verb and that it has the same person and number as the verb. So a macro could be created like:<br />
<br />
<pre><br />
<!-- p1 = verb oci, p2 = verb fra no pron, p3 = verb fra pron --><br />
<def-macro n="verb_nopron_pron" npar="3"><br />
<rule weight="0.8"><br />
<match plemma="1" tags="vblex.*"><select plemma="2"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match tags="prn.pro.p1.*.sg"/><br />
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match tags="prn.pro.p1.*.sg"/><br />
<match tags="prn.pro.*"/><br />
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match><br />
<match tags="prn.enc.p1.*.sg"/><br />
</rule><br />
<rule weight="1.0"><br />
<match tags="prn.pro.p2.*.sg"/><br />
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match tags="prn.pro.p2.*.sg"/><br />
<match tags="prn.pro.*"/><br />
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match><br />
<match tags="prn.enc.p2.*.sg"/><br />
</rule><br />
<br />
</def-macro><br />
</pre><br />
<br />
The call to the rule should be:<br />
<br />
<pre><br />
<macro n="verb_nopron_pron"><with-param v="recordar"/><with-param v="rappeler"/><with-param v="souvenir"/></macro><br />
</pre><br />
<br />
For other verbs a call to the same macro is sufficient. The code is much more readable and maintainable than without macros.<br />
<br />
===Special cases===<br />
====Matching a capitalized word====<br />
<br />
Below, the noun "audiència" will be usually translated as "audience", but if it is written as "Audiència", "cour# d'assises" (i.e. <nowiki>cour<g><b/>d'assises</g></nowiki>) will be elected:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="audiència" tags="n.*"><select lemma="audience"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match lemma="Audiència" tags="n.*"><select lemma="cour# d'assises"/></match><br />
</rule><br />
</pre><br />
<br />
====Matching an unknown word====<br />
<br />
Below, the noun "mossèn" will be usually translated as "curé", but if it is followed by an anthroponym (rule 2) or an unknown word (rule 3), "monseigneur" will be elected:<br />
<br />
<pre><br />
<rule weight="0.8"><br />
<match lemma="mossèn" tags="n.*"><select lemma="curé"/></match><br />
</rule><br />
<rule weight="1.0"><br />
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match><br />
<match tags="np.ant.*"/><br />
</rule><br />
<rule weight="1.0"><br />
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match><br />
<match tags=""/><br />
</rule><br />
</pre><br />
<br />
The last rule can be improved specifying that the unknown word should be capitalized:<br />
<br />
<pre><br />
<rule weight="1.0"><br />
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match><br />
<match tags="" case="Aa"/><br />
</rule><br />
</pre><br />
<br />
=== Some more new stuff ===<br />
* [https://github.com/apertium/apertium-lex-tools/commit/8f5493f28d944c9c1591e3449f6df8c4718ab2c6 contains]<br />
* [https://github.com/apertium/apertium-lex-tools/commit/371b3740e74c27a2e6ff4278ce9286ee9e0b2319 suffix]<br />
* [https://github.com/apertium/apertium-lex-tools/pull/92 <code>glob="star"</code>]<br />
<br />
==Writing and generating rules==<br />
<br />
===Writing===<br />
{{main|How to get started with lexical selection rules}}<br />
A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in. <br />
<br />
===Generating===<br />
<br />
;Parallel corpus<br />
{{main|Learning rules from parallel and non-parallel corpora}}<br />
<br />
;Monolingual corpora<br />
<br />
{{main|Running_the_monolingual_rule_learning}}<br />
<br />
==Todo and bugs==<br />
<br />
* <s>xml compiler</s><br />
* <s>compile rule operation patterns, as well as matching patterns</s><br />
* <s>make rules with gaps work</s><br />
* <s>optimal coverage</s><br />
* <s>fix bug with processing multiple sentences</s><br />
* <s>instead of having regex OR, insert separate paths/states.</s><br />
* <s>optimise the bestPath function (don't use strings to store the paths)</s><br />
* <s>autotoolsise build</s><br />
* <s>add option to compiler to spit out ATT transducers</s><br />
* <s>fix bug with outputting an extra '\n' at the end</s><br />
* <s>edit <code>transfer.cc</code> to allow input from <code>lt-proc -b</code></s><br />
* profiling and speed up<br />
** <s>why do the regex transducers have to be minimised ?</s><br />
** <s>retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths</s><br />
** <s>stop using string processing to retrieve rule numbers</s><br />
** <s>retrieve vector of vectors of words, not string of words from lttoolbox</s><br />
** why does the performance drop substantially with more rules ? <br />
** <s>add a pattern -> first letter map so we don't have to call recognise() with every transition</s> (didn't work so well)<br />
* <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s><br />
* <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s><br />
* <s>default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in <code>testing/</code>.</s><br />
* make sure that <code>-b</code> works with <code>-n</code> too.<br />
* testing<br />
* null flush<br />
* add option to processor to spit out ATT transducers<br />
* use brown clusters to merge rules with the same context, or remove parts of context from rules which are not relevant?<br />
* https://sourceforge.net/p/apertium/tickets/64/ <code><match tags="n.*"></match></code> never matches, while <code><match tags="n.*"/></code> does<br />
<br />
; Rendimiento<br />
<br />
* 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)<br />
* 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)<br />
<br />
==Preparedness of language pairs==<br />
<br />
{|class="wikitable"<br />
! Pair !! LR (L) !! LR (L→R) !! Fertility !! Rules <br />
|-<br />
| <code>apertium-is-en</code> || 18,563 || 22,220 || 1.19 || 115 <br />
|-<br />
| <code>apertium-es-fr</code> || || || || <br />
|-<br />
| <code>apertium-eu-es</code> || 16,946 || 18,550 || 1.09 || 250<br />
|-<br />
| <code>apertium-eu-en</code> || || || || <br />
|-<br />
| <code>apertium-br-fr</code> || 20,489 || 20,770 || 1.01 || 256 <br />
|-<br />
| <code>apertium-mk-en</code> || 8,568 || 10,624 || 1.24 || 81<br />
|-<br />
| <code>apertium-es-pt</code> || || || || <br />
|-<br />
| <code>apertium-es-it</code> || || || || <br />
|-<br />
| <code>apertium-es-ro</code> || || || || <br />
|-<br />
| <code>apertium-en-es</code> || 267,469 || 268,522 || 1.003 || 334 <br />
|-<br />
| <code>apertium-en-ca</code> || || || || <br />
|-<br />
|}<br />
<br />
<br />
===Troubleshooting===<br />
If you get the message <code>lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory</code> you may need to put this in your ~/.bashrc <br />
<pre><br />
LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"<br />
</pre><br />
Then open a new terminal before using lrx-comp/lrx-proc.<br />
<br />
On a 64-bit machine, apertium-lex-tools make may fail because the zlib is missing, even though you have zlib1g-dev installed. If you get the error message <code>/usr/bin/ld: cannot find -lz</code>, do the following: install package lib32z1-dev (which will install many other dependencies), even though it is a 32-bit binary, it is needed to compile the sources.<br />
<br />
==See also==<br />
<br />
* [[How to get started with lexical selection rules]]<br />
* [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-lex-tools/ SVN Module: apertium-lex-tools]<br />
<br />
==References==<br />
<br />
* Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L. (2012) "[https://rua.ua.es/dspace/bitstream/10045/27581/1/tyers12a.pdf Flexible finite-state lexical selection for rule-based machine translation]". Proceedings of the 17th Annual Conference of the European Association of Machine Translation, EAMT12 <br />
* Tyers et al [https://rua.ua.es/dspace/bitstream/10045/35848/1/thesis_FrancisMTyers.pdf#page=62 Feasible lexical selection for rule-based machine translation]<br />
* Tyers et al [https://aclanthology.org/W15-4919.pdf Unsupervised training of maximum-entropy models for lexical selection in rule-based machine translation]<br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Wikipedia_Extractor&diff=74186Wikipedia Extractor2023-01-30T18:55:09Z<p>Firespeaker: </p>
<hr />
<div>{{TOCD}}<br />
== Goal ==<br />
<br />
This tool extracts main text from xml [[Wikipedia dump]] files (at https://dumps.wikimedia.org/backup-index.html, ideally the "'''pages-articles.xml.bz2'''" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.<br />
<br />
It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at [https://github.com/apertium/WikiExtractor https://github.com/apertium/WikiExtractor].<br />
<br />
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump<br />
<br />
$ python3 WikiExtractor.py --infn dump.xml.bz2<br />
<br />
(Note: If you are on a Mac, make sure that <tt>--</tt> is really two hyphens and not an em-dash like this: &mdash;).<br />
<br />
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command.<br />
<br />
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.<br />
<br />
== Steps ==<br />
<br />
Here's a simple step-by-step guide to the above.<br />
<br />
# Get the <code>WikiExtractor.py</code> script from https://github.com/apertium/WikiExtractor:<br />
#: <code>$ wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py</code><br />
# Download the Wikipedia dump for the language in question from http://dumps.wikimedia.org/backup-index.html. The below uses a hypothetical file name for a language with the <tt>xyz</tt> code:<br />
#: <code>$ wget https://dumps.wikimedia.org/xyzwiki/20230120/xyzwiki-20230120-pages-articles.xml.bz2</code><br />
# Run the script on the Wikipedia dump file:<br />
#: <code>$ python3 WikiExtractor.py --infn xyzwiki-20230120-pages-articles.xml.bz2 --compress</code><br />
<br />
This will output a file called <code>wiki.txt.bz2</code>. You will probably want to rename it to something like <code>xyz.wikipedia.20230120.txt.bz2</code>.<br />
<br />
==See also==<br />
* [[ Wikipedia dumps ]]<br />
<br />
<br />
[[Category:Resources]]<br />
[[Category:Development]]<br />
[[Category:Corpora]]<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Wikipedia_Extractor&diff=74185Wikipedia Extractor2023-01-30T18:53:22Z<p>Firespeaker: /* Steps */</p>
<hr />
<div>{{TOCD}}<br />
== Goal ==<br />
<br />
This tool extracts main text from xml [[Wikipedia dump]] files (at http://dumps.wikimedia.org/backup-index.html, ideally the "pages-articles" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.<br />
<br />
It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at [https://github.com/apertium/WikiExtractor https://github.com/apertium/WikiExtractor].<br />
<br />
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump<br />
<br />
$ python3 WikiExtractor.py --infn dump.xml.bz2<br />
<br />
(Note: If you are on a Mac, make sure that <tt>--</tt> is really two hyphens and not an em-dash like this: &mdash;).<br />
<br />
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command.<br />
<br />
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.<br />
<br />
== Steps ==<br />
<br />
Here's a simple step-by-step guide to the above.<br />
<br />
# Get the <code>WikiExtractor.py</code> script from https://github.com/apertium/WikiExtractor:<br />
#: <code>$ wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py</code><br />
# Download the Wikipedia dump for the language in question from http://dumps.wikimedia.org/backup-index.html. The below uses a hypothetical file name for a language with the <tt>xyz</tt> code:<br />
#: <code>$ wget https://dumps.wikimedia.org/xyzwiki/20230120/xyzwiki-20230120-pages-articles.xml.bz2</code><br />
# Run the script on the Wikipedia dump file:<br />
#: <code>$ python3 WikiExtractor.py --infn xyzwiki-20230120-pages-articles.xml.bz2 --compress</code><br />
<br />
This will output a file called <code>wiki.txt.bz2</code>. You will probably want to rename it to something like <code>xyz.wikipedia.20230120.txt.bz2</code>.<br />
<br />
==See also==<br />
* [[ Wikipedia dumps ]]<br />
<br />
<br />
[[Category:Resources]]<br />
[[Category:Development]]<br />
[[Category:Corpora]]<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Wikipedia_Extractor&diff=74184Wikipedia Extractor2023-01-30T18:53:06Z<p>Firespeaker: </p>
<hr />
<div>{{TOCD}}<br />
== Goal ==<br />
<br />
This tool extracts main text from xml [[Wikipedia dump]] files (at http://dumps.wikimedia.org/backup-index.html, ideally the "pages-articles" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.<br />
<br />
It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at [https://github.com/apertium/WikiExtractor https://github.com/apertium/WikiExtractor].<br />
<br />
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump<br />
<br />
$ python3 WikiExtractor.py --infn dump.xml.bz2<br />
<br />
(Note: If you are on a Mac, make sure that <tt>--</tt> is really two hyphens and not an em-dash like this: &mdash;).<br />
<br />
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command.<br />
<br />
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.<br />
<br />
== Steps ==<br />
<br />
Here's a simple step-by-step guide to the above.<br />
<br />
# Get the <code>WikiExtractor.py</code> script from https://github.com/apertium/WikiExtractor:<br />
#: <code>$ wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py</code><br />
# Download the Wikipedia dump for the language in question from http://dumps.wikimedia.org/backup-index.html. The below uses a hypothetical file name for a language with the <tt>xyz</tt> code:<br />
#: <code>$ wget https://dumps.wikimedia.org/xyzwiki/20230120/xyzwiki-20230120-pages-articles.xml.bz2</code><br />
# Run the script on the Wikipedia dump file:<br />
#: <code>$ python3 WikiExtractor.py --infn xyzwiki-20230120-pages-articles.xml.bz2 --compress</code><br />
<br />
This will output a file called <code>wiki.txt.bz2</code>. You will probably want to rename it to something like <code>xyz.wikipedia.20210620.txt.bz2</code>.<br />
<br />
==See also==<br />
* [[ Wikipedia dumps ]]<br />
<br />
<br />
[[Category:Resources]]<br />
[[Category:Development]]<br />
[[Category:Corpora]]<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Wikipedia_Extractor&diff=74183Wikipedia Extractor2023-01-30T18:52:42Z<p>Firespeaker: </p>
<hr />
<div>{{TOCD}}<br />
== Goal ==<br />
<br />
This tool extracts main text from xml [[Wikipedia dump]] files (at http://dumps.wikimedia.org/backup-index.html, ideally the "pages-articles" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.<br />
<br />
It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at [https://github.com/apertium/WikiExtractor https://github.com/apertium/WikiExtractor].<br />
<br />
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump<br />
<br />
$ python3 WikiExtractor.py --infn dump.xml.bz2<br />
<br />
(Note: If you are on a Mac, make sure that <tt>--</tt> is really two hyphens and not an em-dash like this: &mdash;).<br />
<br />
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command.<br />
<br />
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.<br />
<br />
== Steps ==<br />
<br />
Here's a simple step-by-step guide to the above.<br />
<br />
# Get the <code>WikiExtractor.py</code> script from https://github.com/apertium/WikiExtractor:<br />
#: <code>$ wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py</code><br />
# Download the Wikipedia dump for the language in question from http://dumps.wikimedia.org/backup-index.html. The below uses a hypothetical file name for a language with the <tt>xyz</tt> code:<br />
#: <code>$ wget https://dumps.wikimedia.org/xyzwiki/20210620/xyzwiki-20230120-pages-articles.xml.bz2</code><br />
# Run the script on the Wikipedia dump file:<br />
#: <code>$ python3 WikiExtractor.py --infn xyzwiki-20210620-pages-articles.xml.bz2 --compress</code><br />
<br />
This will output a file called <code>wiki.txt.bz2</code>. You will probably want to rename it to something like <code>xyz.wikipedia.20210620.txt.bz2</code>.<br />
<br />
==See also==<br />
* [[ Wikipedia dumps ]]<br />
<br />
<br />
[[Category:Resources]]<br />
[[Category:Development]]<br />
[[Category:Corpora]]<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2023&diff=74180Google Summer of Code/Application 20232023-01-28T23:17:33Z<p>Firespeaker: /* Mentors */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
* Chat: https://wiki.apertium.org/wiki/IRC<br />
* Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (irc.oftc.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
<br />
https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
<br />
* Daniel<br />
* Jonathan<br />
(add your names here!)<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
* Number of accepted students/contributors: 10<br />
* Number of those participants that are still active today: 1</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Ideas_for_Google_Summer_of_Code&diff=74179Ideas for Google Summer of Code2023-01-28T21:16:57Z<p>Firespeaker: </p>
<hr />
<div>{{TOCD}}<br />
This is the ideas page for [[Google Summer of Code]], here you can find ideas on interesting projects that would make Apertium more useful for people and improve or expand our functionality.<br />
<br />
'''Current Apertium contributors''': If you have an idea please add it below, if you think you could mentor someone in a particular area, add your name to "Interested mentors" using <code><nowiki>~~~</nowiki></code>.<br />
<br />
'''Prospective GSoC contributors''': The page is intended as an overview of the kind of projects we have in mind. If one of them particularly piques your interest, please come and discuss with us on <code>#apertium</code> on <code>irc.oftc.net</code> ([[IRC|more on IRC]]), mail the [[Contact|mailing list]], or draw attention to yourself in some other way. <br />
<br />
Note that if you have an idea that isn't mentioned here, we would be very interested to hear about it.<br />
<br />
Here are some more things you could look at:<br />
<br />
* [[Top tips for GSOC applications]] <br />
* Get in contact with one of our long-serving [[List of Apertium mentors|mentors]] &mdash; they are nice, honest!<br />
* Pages in the [[:Category:Development|development category]]<br />
* Resources that could be converted or expanded in the [[incubator]]. Consider doing or improving a language pair (see [[incubator]], [[nursery]] and [[staging]] for pairs that need work)<br />
* Unhammer's [[User:Unhammer/wishlist|wishlist]]<br />
<!--* The open issues [https://github.com/search?q=org%3Aapertium&state=open&type=Issues on Github] - especially the [https://github.com/search?q=org%3Aapertium+label%3A%22good+first+issue%22&state=open&type=Issues Good First Issues]. --><br />
<br />
__TOC__<br />
<br />
If you're a prospective GSoC contributor trying to propose a topic, the recommended way is to request a wiki account and then go to <pre>http://wiki.apertium.org/wiki/User:[[your username]]/GSoC2023Proposal</pre> and click the "create" button near the top of the page. It's also nice to include <code><nowiki>[[</nowiki>[[:Category:GSoC_2023_student_proposals|Category:GSoC_2023_student_proposals]]<nowiki>]]</nowiki></code> to help organize submitted proposals.<br />
<br />
== Language Data ==<br />
<br />
Can you read or write a language other than English (and we do mean any language)? If so, you can help with one of these and we can help you figure out the technical parts.<br />
<br />
{{IdeaSummary<br />
| name = Develop a morphological analyser<br />
| difficulty = easy<br />
| length = either<br />
| skills = XML or HFST or lexd<br />
| description = Write a morphological analyser and generator for a language that does not yet have one<br />
| rationale = A key part of an Apertium machine translation system is a morphological analyser and generator. The objective of this task is to create an analyser for a language that does not yet have one.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:Firespeaker|Jonathan Washington]], [[User: Sevilay Bayatlı|Sevilay Bayatlı]], Hossep, nlhowell, [[User:Popcorndude]]<br />
| more = /Morphological analyser<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = apertium-separable language-pair integration<br />
| difficulty = Medium<br />
| length = short<br />
| skills = XML, a scripting language (Python, Perl), some knowledge of linguistics and/or at least one relevant natural language<br />
| description = Choose a language you can identify as having a good number of "multiwords" in the lexicon. Modify all language pairs in Apertium to use the [[Apertium-separable]] module to process the multiwords, and clean up the dictionaries accordingly.<br />
| rationale = Apertium-separable is a newish module to process lexical items with discontinguous dependencies, an area where Apertium has traditionally fallen short. Despite all the module has to offer, many translation pairs still don't use it.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /Apertium separable<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Bring an unreleased translation pair to releasable quality<br />
| difficulty = Medium<br />
| length = long<br />
| skills = shell scripting<br />
| description = Take an unstable language pair and improve its quality, focusing on testvoc<br />
| rationale = Many Apertium language pairs have large dictionaries and have otherwise seen much development, but are not of releasable quality. The point of this project would be bring one translation pair to releasable quality. This would entail obtaining good naïve coverage and a clean [[testvoc]].<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Seviay Bayatlı|Sevilay Bayatlı]], [[User:Unhammer]], [[User:hectoralos|Hèctor Alòs i Font]]<br />
| more = /Make a language pair state-of-the-art<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Develop a prototype MT system for a strategic language pair<br />
| difficulty = Medium<br />
| length = long<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Create a translation pair based on two existing language modules, focusing on the dictionary and structural transfer<br />
| rationale = Choose a strategic set of languages to develop an MT system for, such that you know the target language well and morphological transducers for each language are part of Apertium. Develop an Apertium MT system by focusing on writing a bilingual dictionary and structural transfer rules. Expanding the transducers and disambiguation, and writing lexical selection rules and multiword sequences may also be part of the work. The pair may be an existing prototype, but if it's a heavily developed but unreleased pair, consider applying for "Bring an unreleased translation pair to releasable quality" instead.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Sevilay Bayatlı| Sevilay Bayatlı]], [[User:Unhammer]], [[User:hectoralos|Hèctor Alòs i Font]]<br />
| more = /Adopt a language pair<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add a new variety to an existing language<br />
| difficulty = easy<br />
| length = either<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Add a language variety to one or more released pairs, focusing on the dictionary and lexical selection<br />
| rationale = Take a released language, and define a new language variety for it: e.g. Quebec French or Provençal Occitan. Then add the new variety to one or more released language pairs, without diminishing the quality of the pre-existing variety(ies). The objective is to facilitate the generation of varieties for languages with a weak standardisation and/or pluricentric languages.<br />
| mentors = [[User:hectoralos|Hèctor Alòs i Font]], [[User:Firespeaker|Jonathan Washington]]<br />
| more = /Add a new variety to an existing language<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Leverage and integrate language preferences into language pairs<br />
| difficulty = easy<br />
| length = short<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Update language pairs with lexical and orthographical variations to leverage the new [[Dialectal_or_standard_variation|preferences]] functionality<br />
| rationale = Currently, preferences are implemented via language variant, which relies on multiple dictionaries, increasing compilation time exponentially every time a new preference gets introduced.<br />
| mentors = [[User:Xavivars|Xavi Ivars]] [[User:Unhammer]]<br />
| more = /Use preferences in SPA-CAT<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add Capitalization Handling Module to a Language Pair<br />
| difficulty = easy<br />
| length = short<br />
| skills = XML, knowledge of some relevant natural language<br />
| description = Update a language pair to make use make use of the new [[Capitalization_restoration|Capitalization handling module]]<br />
| rationale = Correcting capitalization via transfer rules is tedious and error prone<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Capitalization<br />
}}<br />
<br />
== Data Extraction ==<br />
<br />
A lot of the language data we need to make our analyzers and translators work already exists in other forms and we just need to figure out how to convert it. If you know of another source of data that isn't listed, we'd love to hear about it.<br />
<br />
{{IdeaSummary<br />
| name = dictionary induction from wikis<br />
| difficulty = Medium<br />
| length = either<br />
| skills = MySQL, mediawiki syntax, perl, maybe C++ or Java; Java, Scala, RDF, and DBpedia to use DBpedia extraction<br />
| description = Extract dictionaries from linguistic wikis<br />
| rationale = Wiki dictionaries and encyclopedias (e.g. omegawiki, wiktionary, wikipedia, dbpedia) contain information (e.g. bilingual equivalences, morphological features, conjugations) that could be exploited to speed up the development of dictionaries for Apertium. This task aims at automatically building dictionaries by extracting different pieces of information from wiki structures such as interlingual links, infoboxes and/or from dbpedia RDF datasets.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /Dictionary induction from wikis<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Dictionary induction from parallel corpora / Revive ReTraTos<br />
| difficulty = Hard<br />
| length = either<br />
| skills = C++, perl, python, xml, scripting, machine learning<br />
| description = Extract dictionaries from parallel corpora<br />
| rationale = Given a pair of monolingual modules and a parallel corpus, we should be able to run a program to align tagged sentences and give us the best entries that are missing from bidix. [[ReTraTos]] (from 2008) did this back in 2008, but it's from 2008. We want a program which builds and runs in 2022, and does all the steps for the user.<br />
| mentors = [[User:Unhammer]], [[User:Popcorndude]]<br />
| more = /Dictionary induction from parallel corpora<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Extract morphological data from FLEx<br />
| difficulty = hard<br />
| length = long<br />
| skills = python, XML parsing<br />
| description = Write a program to extract data from [https://software.sil.org/fieldworks/ SIL FieldWorks] and convert as much as possible to monodix (and maybe bidix).<br />
| rationale = There's a lot of potentially useful data in FieldWorks files that might be enough to build a whole monodix for some languages but it's currently really hard to use<br />
| mentors = [[User:Popcorndude|Popcorndude]], [[User:TommiPirinen|Flammie]]<br />
| more = /FieldWorks_data_extraction<br />
}}<br />
<br />
== Tooling ==<br />
<br />
These are projects for people who would be comfortable digging through our C++ codebases (you will be doing a lot of that).<br />
<br />
{{IdeaSummary<br />
| name = Python API for Apertium<br />
| difficulty = medium<br />
| length = either<br />
| skills = C++, Python<br />
| description = Update the Python API for Apertium to expose all Apertium modes and test with all major OSes<br />
| rationale = The current Python API misses out on a lot of functionality, like phonemicisation, segmentation, and transliteration, and doesn't work for some OSes <s>like Debian</s>.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /Python API<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Robust tokenisation in lttoolbox<br />
| difficulty = Medium<br />
| length = long<br />
| skills = C++, XML, Python<br />
| description = Improve the longest-match left-to-right tokenisation strategy in [[lttoolbox]] to be fully Unicode compliant.<br />
| rationale = One of the most frustrating things about working with Apertium on texts "in the wild" is the way that the tokenisation works. If a letter is not specified in the alphabet, it is dealt with as whitespace, so e.g. you get unknown words split in two so you can end up with stuff like ^G$ö^k$ı^rmak$ which is terrible for further processing. <br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:TommiPirinen|Flammie]]<br />
| more = /Robust tokenisation<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = rule visualization tools<br />
| difficulty = Medium<br />
| length = long<br />
| skills = python? javascript? XML<br />
| description = make tools to help visualize the effect of various rules<br />
| rationale = TODO see https://github.com/Jakespringer/dapertium for an example<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Sevilay Bayatlı|Sevilay Bayatlı]], [[User:Popcorndude]]<br />
| more = /Visualization tools<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Extend Weighted transfer rules<br />
| difficulty = Medium<br />
| length = either<br />
| skills = C++, python<br />
| description = The weighted transfer module is already applied to the chunker transfer rules. And the idea here is to extend that module to be applied to interchunk and postchunk transfer rules too. <br />
| rationale = As a resource see https://github.com/aboelhamd/Weighted-transfer-rules-module<br />
| mentors = [[User: Sevilay Bayatlı|Sevilay Bayatlı]]<br />
| more = /Make a module <br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Automatic Error-Finder / Backpropagation<br />
| difficulty = Hard<br />
| length = long<br />
| skills = python?<br />
| description = Develop a tool to locate the approximate source of translation errors in the pipeline.<br />
| rationale = Being able to generate a list of probable error sources automatically makes it possible to prioritize issues by frequency, frees up developer time, and is a first step towards automated generation of better rules.<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Backpropagation<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Localization (l10n/i18n) of Apertium tools<br />
| difficulty = Medium<br />
| length = short<br />
| skills = C++<br />
| description = All our command line tools are currently hardcoded as English-only and it would be good if this were otherwise. [https://github.com/apertium/organisation/issues/28#issuecomment-803474833 Coding Challenge]<br />
| rationale = ...<br />
| mentors = [[User:Tino_Didriksen|Tino Didriksen]]<br />
| more = https://github.com/apertium/organisation/issues/28 Github<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Language Server Protocol<br />
| difficulty = medium<br />
| length = short<br />
| skills = any programming language<br />
| description = Build a [https://microsoft.github.io/language-server-protocol/|Language Server] for the various Apertium rule formats<br />
| rationale = We have some static analysis tools and syntax highlighters already and it would be great if we could combine and expand them to support more text editors.<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Language Server Protocol<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = WASM Compilation<br />
| difficulty = hard<br />
| length = long<br />
| skills = C++, Javascript<br />
| description = Compile the pipeline modules to WASM and provide JS wrappers for them.<br />
| rationale = There are situations where it would be nice to be able to run the entire pipeline in the browser<br />
| mentors = [[User:Tino Didriksen|Tino Didriksen]]<br />
| more = /WASM<br />
}}<br />
<br />
== Web ==<br />
<br />
If you know Python and JavaScript, here's some ideas for improving our [https://apertium.org website]. Some of these should be fairly short and it would be a good idea to talk to the mentors about doing a couple of them together.<br />
<br />
{{IdeaSummary<br />
| name = Web API extensions<br />
| difficulty = medium<br />
| length = short<br />
| skills = Python<br />
| description = Update the web API for Apertium to expose all Apertium modes <br />
| rationale = The current Web API misses out on a lot of functionality, like phonemicisation, segmentation, transliteration, and paradigm generation.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Apertium APY<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Misc<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Improve elements of Apertium's web infrastructure<br />
| rationale = Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]] have numerous open issues. This project would entail choosing a subset of open issues and features that could realistically be completed in the summer. You're encouraged to speak with the Apertium community to see which features and issues are the most pressing.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Dictionary Lookup<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Finish implementing dictionary lookup mode in Apertium's web infrastructure<br />
| rationale = Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]] have numerous open issues, including half-completed features like dictionary lookup. This project would entail completing the dictionary lookup feature. Some additional features which would be good to work would include automatic reverse lookups (so that a user has a better understanding of the results), grammatical information (such as the gender of nouns or the conjugation paradigms of verbs), and information about MWEs. See [https://github.com/apertium/apertium-html-tools/issues/105 the open issue on GitHub].<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]], [[User:Popcorndude]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Spell checking<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, js, css, python<br />
| description = Add a spell-checking interface to Apertium's web tools<br />
| rationale = [[Apertium-html-tools]] has seen some prototypes for spell-checking interfaces (all in stale PRs and branches on GitHub), but none have ended up being quite ready to integrate into the tools. This project would entail polishing up or recreating an interface, and making sure [[APy]] has a mode that allows access to Apertium voikospell modules. The end result should be a slick, easy-to-use interface for proofing text, with intuitive underlining of text deemed to be misspelled and intuitive presentation and selection of alternatives. [https://github.com/apertium/apertium-html-tools/issues/390 the open issue on GitHub]<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Spell checker web interface<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Suggestions<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Finish implementing a suggestions interface for Apertium's web infrastructure<br />
| rationale = Some work has been done to add a "suggestions" interface to Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]], whereby users can suggest corrected translations. This project would entail finishing that feature. There are some related [https://github.com/apertium/apertium-html-tools/issues/55 issues] and [https://github.com/apertium/apertium-html-tools/pull/252 PRs] on GitHub.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Orthography conversion interface<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, js, css, python<br />
| description = Add an orthography conversion interface to Apertium's web tools<br />
| rationale = Several Apertium language modules (like Kazakh, Kyrgyz, Crimean Tatar, and Hñähñu) have orthography conversion modes in their mode definition files. This project would be to expose those modes through [[APy|Apertium APy]] and provide a simple interface in [[Apertium-html-tools]] to use them.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add support for NMT to web API<br />
| difficulty = Medium<br />
| length = long<br />
| skills = python, NMT<br />
| description = Add support for a popular NMT engine to Apertium's web API<br />
| rationale = Currently Apertium's web API [[APy|Apertium APy]], supports only Apertium language modules. But the front end could just as easily interface with an API that supports trained NMT models. The point of the project is to add support for one popular NMT package (e.g., translateLocally/Bergamot, OpenNMT or JoeyNMT) to the APy.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = <br />
}}<br />
<br />
== Integrations ==<br />
<br />
In addition to incorporating data from other projects, it would be nice if we could also make our data useful to them.<br />
<br />
{{IdeaSummary<br />
| name = OmniLingo and Apertium<br />
| difficulty = medium<br />
| length = either<br />
| skills = JS, Python<br />
| description = OmniLingo is a language learning system for practicing listening comprehension using Apertium data. There is a lot of text processing involved (for example tokenisation) that could be aided by Apertium tools. <br />
| rationale = <br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /OmniLingo<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Support for Enhanced Dependencies in UD Annotatrix<br />
| difficulty = medium<br />
| length = either<br />
| skills = NodeJS<br />
| description = UD Annotatrix is an annotation interface for Universal Dependencies, but does not yet support all functionality<br />
| rationale = <br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /Morphological analyser<br />
}}<br />
<br />
<!--<br />
This one was done, but could do with more work. Not sure if it's a full gsoc though?<br />
<br />
{{IdeaSummary<br />
| name = User-friendly lexical selection training<br />
| difficulty = Medium<br />
| skills = Python, C++, shell scripting<br />
| description = Make it so that training/inference of lexical selection rules is a more user-friendly process<br />
| rationale = Our lexical selection module allows for inferring rules from corpora and word alignments, but the procedure is currently a bit messy, with various scripts involved that require lots of manual tweaking, and many third party tools to be installed. The goal of this task is to make the procedure as user-friendly as possible, so that ideally only a simple config file would be needed, and a driver script would take care of the rest.<br />
| mentors = [[User:Unhammer|Unhammer]], [[User:Mlforcada|Mikel Forcada]]<br />
| more = /User-friendly lexical selection training<br />
}}<br />
--><br />
<br />
{{IdeaSummary<br />
| name = UD and Apertium integration<br />
| difficulty = Entry level<br />
| length = short<br />
| skills = python, javascript, HTML, (C++)<br />
| description = Create a range of tools for making Apertium compatible with Universal Dependencies<br />
| rationale = Universal dependencies is a fast growing project aimed at creating a unified annotation scheme for treebanks. This includes both part-of-speech and morphological features. Their annotated corpora could be extremely useful for Apertium for training models for translation. In addition, Apertium's rule-based morphological descriptions could be useful for software that relies on Universal dependencies.<br />
| mentors = [[User:Francis Tyers]], [[User:Firespeaker| Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /UD and Apertium integration <br />
}}<br />
<br />
[[Category:Development]]<br />
[[Category:Google Summer of Code]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Ideas_for_Google_Summer_of_Code&diff=74178Ideas for Google Summer of Code2023-01-28T21:16:17Z<p>Firespeaker: </p>
<hr />
<div>{{TOCD}}<br />
This is the ideas page for [[Google Summer of Code]], here you can find ideas on interesting projects that would make Apertium more useful for people and improve or expand our functionality.<br />
<br />
'''Current Apertium contributors''': If you have an idea please add it below, if you think you could mentor someone in a particular area, add your name to "Interested mentors" using <code><nowiki>~~~</nowiki></code>.<br />
<br />
'''Prospective GSoC contributors''': The page is intended as an overview of the kind of projects we have in mind. If one of them particularly piques your interest, please come and discuss with us on <code>#apertium</code> on <code>irc.oftc.net</code> ([[IRC|more on IRC]]), mail the [[Contact|mailing list]], or draw attention to yourself in some other way. <br />
<br />
Note that if you have an idea that isn't mentioned here, we would be very interested to hear about it.<br />
<br />
Here are some more things you could look at:<br />
<br />
* [[Top tips for GSOC applications]] <br />
* Get in contact with one of our long-serving [[List of Apertium mentors|mentors]] &mdash; they are nice, honest!<br />
* Pages in the [[:Category:Development|development category]]<br />
* Resources that could be converted or expanded in the [[incubator]]. Consider doing or improving a language pair (see [[incubator]], [[nursery]] and [[staging]] for pairs that need work)<br />
* Unhammer's [[User:Unhammer/wishlist|wishlist]]<br />
<!--* The open issues [https://github.com/search?q=org%3Aapertium&state=open&type=Issues on Github] - especially the [https://github.com/search?q=org%3Aapertium+label%3A%22good+first+issue%22&state=open&type=Issues Good First Issues]. --><br />
<br />
__TOC__<br />
<br />
If you're a student trying to propose a topic, the recommended way is to request a wiki account and then go to <pre>http://wiki.apertium.org/wiki/User:[[your username]]/GSoC2023Proposal</pre> and click the "create" button near the top of the page. It's also nice to include <code><nowiki>[[</nowiki>[[:Category:GSoC_2023_student_proposals|Category:GSoC_2023_student_proposals]]<nowiki>]]</nowiki></code> to help organize submitted proposals.<br />
<br />
== Language Data ==<br />
<br />
Can you read or write a language other than English (and we do mean any language)? If so, you can help with one of these and we can help you figure out the technical parts.<br />
<br />
{{IdeaSummary<br />
| name = Develop a morphological analyser<br />
| difficulty = easy<br />
| length = either<br />
| skills = XML or HFST or lexd<br />
| description = Write a morphological analyser and generator for a language that does not yet have one<br />
| rationale = A key part of an Apertium machine translation system is a morphological analyser and generator. The objective of this task is to create an analyser for a language that does not yet have one.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:Firespeaker|Jonathan Washington]], [[User: Sevilay Bayatlı|Sevilay Bayatlı]], Hossep, nlhowell, [[User:Popcorndude]]<br />
| more = /Morphological analyser<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = apertium-separable language-pair integration<br />
| difficulty = Medium<br />
| length = short<br />
| skills = XML, a scripting language (Python, Perl), some knowledge of linguistics and/or at least one relevant natural language<br />
| description = Choose a language you can identify as having a good number of "multiwords" in the lexicon. Modify all language pairs in Apertium to use the [[Apertium-separable]] module to process the multiwords, and clean up the dictionaries accordingly.<br />
| rationale = Apertium-separable is a newish module to process lexical items with discontinguous dependencies, an area where Apertium has traditionally fallen short. Despite all the module has to offer, many translation pairs still don't use it.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /Apertium separable<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Bring an unreleased translation pair to releasable quality<br />
| difficulty = Medium<br />
| length = long<br />
| skills = shell scripting<br />
| description = Take an unstable language pair and improve its quality, focusing on testvoc<br />
| rationale = Many Apertium language pairs have large dictionaries and have otherwise seen much development, but are not of releasable quality. The point of this project would be bring one translation pair to releasable quality. This would entail obtaining good naïve coverage and a clean [[testvoc]].<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Seviay Bayatlı|Sevilay Bayatlı]], [[User:Unhammer]], [[User:hectoralos|Hèctor Alòs i Font]]<br />
| more = /Make a language pair state-of-the-art<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Develop a prototype MT system for a strategic language pair<br />
| difficulty = Medium<br />
| length = long<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Create a translation pair based on two existing language modules, focusing on the dictionary and structural transfer<br />
| rationale = Choose a strategic set of languages to develop an MT system for, such that you know the target language well and morphological transducers for each language are part of Apertium. Develop an Apertium MT system by focusing on writing a bilingual dictionary and structural transfer rules. Expanding the transducers and disambiguation, and writing lexical selection rules and multiword sequences may also be part of the work. The pair may be an existing prototype, but if it's a heavily developed but unreleased pair, consider applying for "Bring an unreleased translation pair to releasable quality" instead.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Sevilay Bayatlı| Sevilay Bayatlı]], [[User:Unhammer]], [[User:hectoralos|Hèctor Alòs i Font]]<br />
| more = /Adopt a language pair<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add a new variety to an existing language<br />
| difficulty = easy<br />
| length = either<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Add a language variety to one or more released pairs, focusing on the dictionary and lexical selection<br />
| rationale = Take a released language, and define a new language variety for it: e.g. Quebec French or Provençal Occitan. Then add the new variety to one or more released language pairs, without diminishing the quality of the pre-existing variety(ies). The objective is to facilitate the generation of varieties for languages with a weak standardisation and/or pluricentric languages.<br />
| mentors = [[User:hectoralos|Hèctor Alòs i Font]], [[User:Firespeaker|Jonathan Washington]]<br />
| more = /Add a new variety to an existing language<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Leverage and integrate language preferences into language pairs<br />
| difficulty = easy<br />
| length = short<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Update language pairs with lexical and orthographical variations to leverage the new [[Dialectal_or_standard_variation|preferences]] functionality<br />
| rationale = Currently, preferences are implemented via language variant, which relies on multiple dictionaries, increasing compilation time exponentially every time a new preference gets introduced.<br />
| mentors = [[User:Xavivars|Xavi Ivars]] [[User:Unhammer]]<br />
| more = /Use preferences in SPA-CAT<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add Capitalization Handling Module to a Language Pair<br />
| difficulty = easy<br />
| length = short<br />
| skills = XML, knowledge of some relevant natural language<br />
| description = Update a language pair to make use make use of the new [[Capitalization_restoration|Capitalization handling module]]<br />
| rationale = Correcting capitalization via transfer rules is tedious and error prone<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Capitalization<br />
}}<br />
<br />
== Data Extraction ==<br />
<br />
A lot of the language data we need to make our analyzers and translators work already exists in other forms and we just need to figure out how to convert it. If you know of another source of data that isn't listed, we'd love to hear about it.<br />
<br />
{{IdeaSummary<br />
| name = dictionary induction from wikis<br />
| difficulty = Medium<br />
| length = either<br />
| skills = MySQL, mediawiki syntax, perl, maybe C++ or Java; Java, Scala, RDF, and DBpedia to use DBpedia extraction<br />
| description = Extract dictionaries from linguistic wikis<br />
| rationale = Wiki dictionaries and encyclopedias (e.g. omegawiki, wiktionary, wikipedia, dbpedia) contain information (e.g. bilingual equivalences, morphological features, conjugations) that could be exploited to speed up the development of dictionaries for Apertium. This task aims at automatically building dictionaries by extracting different pieces of information from wiki structures such as interlingual links, infoboxes and/or from dbpedia RDF datasets.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /Dictionary induction from wikis<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Dictionary induction from parallel corpora / Revive ReTraTos<br />
| difficulty = Hard<br />
| length = either<br />
| skills = C++, perl, python, xml, scripting, machine learning<br />
| description = Extract dictionaries from parallel corpora<br />
| rationale = Given a pair of monolingual modules and a parallel corpus, we should be able to run a program to align tagged sentences and give us the best entries that are missing from bidix. [[ReTraTos]] (from 2008) did this back in 2008, but it's from 2008. We want a program which builds and runs in 2022, and does all the steps for the user.<br />
| mentors = [[User:Unhammer]], [[User:Popcorndude]]<br />
| more = /Dictionary induction from parallel corpora<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Extract morphological data from FLEx<br />
| difficulty = hard<br />
| length = long<br />
| skills = python, XML parsing<br />
| description = Write a program to extract data from [https://software.sil.org/fieldworks/ SIL FieldWorks] and convert as much as possible to monodix (and maybe bidix).<br />
| rationale = There's a lot of potentially useful data in FieldWorks files that might be enough to build a whole monodix for some languages but it's currently really hard to use<br />
| mentors = [[User:Popcorndude|Popcorndude]], [[User:TommiPirinen|Flammie]]<br />
| more = /FieldWorks_data_extraction<br />
}}<br />
<br />
== Tooling ==<br />
<br />
These are projects for people who would be comfortable digging through our C++ codebases (you will be doing a lot of that).<br />
<br />
{{IdeaSummary<br />
| name = Python API for Apertium<br />
| difficulty = medium<br />
| length = either<br />
| skills = C++, Python<br />
| description = Update the Python API for Apertium to expose all Apertium modes and test with all major OSes<br />
| rationale = The current Python API misses out on a lot of functionality, like phonemicisation, segmentation, and transliteration, and doesn't work for some OSes <s>like Debian</s>.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /Python API<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Robust tokenisation in lttoolbox<br />
| difficulty = Medium<br />
| length = long<br />
| skills = C++, XML, Python<br />
| description = Improve the longest-match left-to-right tokenisation strategy in [[lttoolbox]] to be fully Unicode compliant.<br />
| rationale = One of the most frustrating things about working with Apertium on texts "in the wild" is the way that the tokenisation works. If a letter is not specified in the alphabet, it is dealt with as whitespace, so e.g. you get unknown words split in two so you can end up with stuff like ^G$ö^k$ı^rmak$ which is terrible for further processing. <br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:TommiPirinen|Flammie]]<br />
| more = /Robust tokenisation<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = rule visualization tools<br />
| difficulty = Medium<br />
| length = long<br />
| skills = python? javascript? XML<br />
| description = make tools to help visualize the effect of various rules<br />
| rationale = TODO see https://github.com/Jakespringer/dapertium for an example<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Sevilay Bayatlı|Sevilay Bayatlı]], [[User:Popcorndude]]<br />
| more = /Visualization tools<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Extend Weighted transfer rules<br />
| difficulty = Medium<br />
| length = either<br />
| skills = C++, python<br />
| description = The weighted transfer module is already applied to the chunker transfer rules. And the idea here is to extend that module to be applied to interchunk and postchunk transfer rules too. <br />
| rationale = As a resource see https://github.com/aboelhamd/Weighted-transfer-rules-module<br />
| mentors = [[User: Sevilay Bayatlı|Sevilay Bayatlı]]<br />
| more = /Make a module <br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Automatic Error-Finder / Backpropagation<br />
| difficulty = Hard<br />
| length = long<br />
| skills = python?<br />
| description = Develop a tool to locate the approximate source of translation errors in the pipeline.<br />
| rationale = Being able to generate a list of probable error sources automatically makes it possible to prioritize issues by frequency, frees up developer time, and is a first step towards automated generation of better rules.<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Backpropagation<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Localization (l10n/i18n) of Apertium tools<br />
| difficulty = Medium<br />
| length = short<br />
| skills = C++<br />
| description = All our command line tools are currently hardcoded as English-only and it would be good if this were otherwise. [https://github.com/apertium/organisation/issues/28#issuecomment-803474833 Coding Challenge]<br />
| rationale = ...<br />
| mentors = [[User:Tino_Didriksen|Tino Didriksen]]<br />
| more = https://github.com/apertium/organisation/issues/28 Github<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Language Server Protocol<br />
| difficulty = medium<br />
| length = short<br />
| skills = any programming language<br />
| description = Build a [https://microsoft.github.io/language-server-protocol/|Language Server] for the various Apertium rule formats<br />
| rationale = We have some static analysis tools and syntax highlighters already and it would be great if we could combine and expand them to support more text editors.<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Language Server Protocol<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = WASM Compilation<br />
| difficulty = hard<br />
| length = long<br />
| skills = C++, Javascript<br />
| description = Compile the pipeline modules to WASM and provide JS wrappers for them.<br />
| rationale = There are situations where it would be nice to be able to run the entire pipeline in the browser<br />
| mentors = [[User:Tino Didriksen|Tino Didriksen]]<br />
| more = /WASM<br />
}}<br />
<br />
== Web ==<br />
<br />
If you know Python and JavaScript, here's some ideas for improving our [https://apertium.org website]. Some of these should be fairly short and it would be a good idea to talk to the mentors about doing a couple of them together.<br />
<br />
{{IdeaSummary<br />
| name = Web API extensions<br />
| difficulty = medium<br />
| length = short<br />
| skills = Python<br />
| description = Update the web API for Apertium to expose all Apertium modes <br />
| rationale = The current Web API misses out on a lot of functionality, like phonemicisation, segmentation, transliteration, and paradigm generation.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Apertium APY<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Misc<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Improve elements of Apertium's web infrastructure<br />
| rationale = Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]] have numerous open issues. This project would entail choosing a subset of open issues and features that could realistically be completed in the summer. You're encouraged to speak with the Apertium community to see which features and issues are the most pressing.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Dictionary Lookup<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Finish implementing dictionary lookup mode in Apertium's web infrastructure<br />
| rationale = Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]] have numerous open issues, including half-completed features like dictionary lookup. This project would entail completing the dictionary lookup feature. Some additional features which would be good to work would include automatic reverse lookups (so that a user has a better understanding of the results), grammatical information (such as the gender of nouns or the conjugation paradigms of verbs), and information about MWEs. See [https://github.com/apertium/apertium-html-tools/issues/105 the open issue on GitHub].<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]], [[User:Popcorndude]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Spell checking<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, js, css, python<br />
| description = Add a spell-checking interface to Apertium's web tools<br />
| rationale = [[Apertium-html-tools]] has seen some prototypes for spell-checking interfaces (all in stale PRs and branches on GitHub), but none have ended up being quite ready to integrate into the tools. This project would entail polishing up or recreating an interface, and making sure [[APy]] has a mode that allows access to Apertium voikospell modules. The end result should be a slick, easy-to-use interface for proofing text, with intuitive underlining of text deemed to be misspelled and intuitive presentation and selection of alternatives. [https://github.com/apertium/apertium-html-tools/issues/390 the open issue on GitHub]<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Spell checker web interface<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Suggestions<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Finish implementing a suggestions interface for Apertium's web infrastructure<br />
| rationale = Some work has been done to add a "suggestions" interface to Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]], whereby users can suggest corrected translations. This project would entail finishing that feature. There are some related [https://github.com/apertium/apertium-html-tools/issues/55 issues] and [https://github.com/apertium/apertium-html-tools/pull/252 PRs] on GitHub.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Orthography conversion interface<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, js, css, python<br />
| description = Add an orthography conversion interface to Apertium's web tools<br />
| rationale = Several Apertium language modules (like Kazakh, Kyrgyz, Crimean Tatar, and Hñähñu) have orthography conversion modes in their mode definition files. This project would be to expose those modes through [[APy|Apertium APy]] and provide a simple interface in [[Apertium-html-tools]] to use them.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add support for NMT to web API<br />
| difficulty = Medium<br />
| length = long<br />
| skills = python, NMT<br />
| description = Add support for a popular NMT engine to Apertium's web API<br />
| rationale = Currently Apertium's web API [[APy|Apertium APy]], supports only Apertium language modules. But the front end could just as easily interface with an API that supports trained NMT models. The point of the project is to add support for one popular NMT package (e.g., translateLocally/Bergamot, OpenNMT or JoeyNMT) to the APy.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = <br />
}}<br />
<br />
== Integrations ==<br />
<br />
In addition to incorporating data from other projects, it would be nice if we could also make our data useful to them.<br />
<br />
{{IdeaSummary<br />
| name = OmniLingo and Apertium<br />
| difficulty = medium<br />
| length = either<br />
| skills = JS, Python<br />
| description = OmniLingo is a language learning system for practicing listening comprehension using Apertium data. There is a lot of text processing involved (for example tokenisation) that could be aided by Apertium tools. <br />
| rationale = <br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /OmniLingo<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Support for Enhanced Dependencies in UD Annotatrix<br />
| difficulty = medium<br />
| length = either<br />
| skills = NodeJS<br />
| description = UD Annotatrix is an annotation interface for Universal Dependencies, but does not yet support all functionality<br />
| rationale = <br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /Morphological analyser<br />
}}<br />
<br />
<!--<br />
This one was done, but could do with more work. Not sure if it's a full gsoc though?<br />
<br />
{{IdeaSummary<br />
| name = User-friendly lexical selection training<br />
| difficulty = Medium<br />
| skills = Python, C++, shell scripting<br />
| description = Make it so that training/inference of lexical selection rules is a more user-friendly process<br />
| rationale = Our lexical selection module allows for inferring rules from corpora and word alignments, but the procedure is currently a bit messy, with various scripts involved that require lots of manual tweaking, and many third party tools to be installed. The goal of this task is to make the procedure as user-friendly as possible, so that ideally only a simple config file would be needed, and a driver script would take care of the rest.<br />
| mentors = [[User:Unhammer|Unhammer]], [[User:Mlforcada|Mikel Forcada]]<br />
| more = /User-friendly lexical selection training<br />
}}<br />
--><br />
<br />
{{IdeaSummary<br />
| name = UD and Apertium integration<br />
| difficulty = Entry level<br />
| length = short<br />
| skills = python, javascript, HTML, (C++)<br />
| description = Create a range of tools for making Apertium compatible with Universal Dependencies<br />
| rationale = Universal dependencies is a fast growing project aimed at creating a unified annotation scheme for treebanks. This includes both part-of-speech and morphological features. Their annotated corpora could be extremely useful for Apertium for training models for translation. In addition, Apertium's rule-based morphological descriptions could be useful for software that relies on Universal dependencies.<br />
| mentors = [[User:Francis Tyers]], [[User:Firespeaker| Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /UD and Apertium integration <br />
}}<br />
<br />
[[Category:Development]]<br />
[[Category:Google Summer of Code]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Category:GSoC_2023_student_proposals&diff=74177Category:GSoC 2023 student proposals2023-01-28T21:13:49Z<p>Firespeaker: Created page with "2023 2023"</p>
<hr />
<div>[[Category:Google Summer of Code|2023]]<br />
[[Category:Student proposals for the Google Summer of Code|2023]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Ideas_for_Google_Summer_of_Code&diff=74176Ideas for Google Summer of Code2023-01-28T21:13:30Z<p>Firespeaker: </p>
<hr />
<div>{{TOCD}}<br />
This is the ideas page for [[Google Summer of Code]], here you can find ideas on interesting projects that would make Apertium more useful for people and improve or expand our functionality. If you have an idea please add it below, if you think you could mentor someone in a particular area, add your name to "Interested mentors" using <nowiki>~~~</nowiki> <br />
<br />
The page is intended as an overview of the kind of projects we have in mind. If one of them particularly piques your interest, please come and discuss with us on <code>#apertium</code> on <code>irc.oftc.net</code>, mail the [[Contact|mailing list]], or draw attention to yourself in some other way. <br />
<br />
Note that, if you have an idea that isn't mentioned here, we would be very interested to hear about it.<br />
<br />
Here are some more things you could look at:<br />
<br />
* [[Top tips for GSOC applications]] <br />
* Get in contact with one of our long-serving [[List of Apertium mentors|mentors]] &mdash; they are nice, honest!<br />
* Pages in the [[:Category:Development|development category]]<br />
* Resources that could be converted or expanded in the [[incubator]]. Consider doing or improving a language pair (see [[incubator]], [[nursery]] and [[staging]] for pairs that need work)<br />
* Unhammer's [[User:Unhammer/wishlist|wishlist]]<br />
<!--* The open issues [https://github.com/search?q=org%3Aapertium&state=open&type=Issues on Github] - especially the [https://github.com/search?q=org%3Aapertium+label%3A%22good+first+issue%22&state=open&type=Issues Good First Issues]. --><br />
<br />
__TOC__<br />
<br />
If you're a student trying to propose a topic, the recommended way is to request a wiki account and then go to <pre>http://wiki.apertium.org/wiki/User:[[your username]]/GSoC2023Proposal</pre> and click the "create" button near the top of the page. It's also nice to include <code><nowiki>[[</nowiki>[[:Category:GSoC_2023_student_proposals|Category:GSoC_2023_student_proposals]]<nowiki>]]</nowiki></code> to help organize submitted proposals.<br />
<br />
== Language Data ==<br />
<br />
Can you read or write a language other than English (and we do mean any language)? If so, you can help with one of these and we can help you figure out the technical parts.<br />
<br />
{{IdeaSummary<br />
| name = Develop a morphological analyser<br />
| difficulty = easy<br />
| length = either<br />
| skills = XML or HFST or lexd<br />
| description = Write a morphological analyser and generator for a language that does not yet have one<br />
| rationale = A key part of an Apertium machine translation system is a morphological analyser and generator. The objective of this task is to create an analyser for a language that does not yet have one.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:Firespeaker|Jonathan Washington]], [[User: Sevilay Bayatlı|Sevilay Bayatlı]], Hossep, nlhowell, [[User:Popcorndude]]<br />
| more = /Morphological analyser<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = apertium-separable language-pair integration<br />
| difficulty = Medium<br />
| length = short<br />
| skills = XML, a scripting language (Python, Perl), some knowledge of linguistics and/or at least one relevant natural language<br />
| description = Choose a language you can identify as having a good number of "multiwords" in the lexicon. Modify all language pairs in Apertium to use the [[Apertium-separable]] module to process the multiwords, and clean up the dictionaries accordingly.<br />
| rationale = Apertium-separable is a newish module to process lexical items with discontinguous dependencies, an area where Apertium has traditionally fallen short. Despite all the module has to offer, many translation pairs still don't use it.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /Apertium separable<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Bring an unreleased translation pair to releasable quality<br />
| difficulty = Medium<br />
| length = long<br />
| skills = shell scripting<br />
| description = Take an unstable language pair and improve its quality, focusing on testvoc<br />
| rationale = Many Apertium language pairs have large dictionaries and have otherwise seen much development, but are not of releasable quality. The point of this project would be bring one translation pair to releasable quality. This would entail obtaining good naïve coverage and a clean [[testvoc]].<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Seviay Bayatlı|Sevilay Bayatlı]], [[User:Unhammer]], [[User:hectoralos|Hèctor Alòs i Font]]<br />
| more = /Make a language pair state-of-the-art<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Develop a prototype MT system for a strategic language pair<br />
| difficulty = Medium<br />
| length = long<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Create a translation pair based on two existing language modules, focusing on the dictionary and structural transfer<br />
| rationale = Choose a strategic set of languages to develop an MT system for, such that you know the target language well and morphological transducers for each language are part of Apertium. Develop an Apertium MT system by focusing on writing a bilingual dictionary and structural transfer rules. Expanding the transducers and disambiguation, and writing lexical selection rules and multiword sequences may also be part of the work. The pair may be an existing prototype, but if it's a heavily developed but unreleased pair, consider applying for "Bring an unreleased translation pair to releasable quality" instead.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Sevilay Bayatlı| Sevilay Bayatlı]], [[User:Unhammer]], [[User:hectoralos|Hèctor Alòs i Font]]<br />
| more = /Adopt a language pair<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add a new variety to an existing language<br />
| difficulty = easy<br />
| length = either<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Add a language variety to one or more released pairs, focusing on the dictionary and lexical selection<br />
| rationale = Take a released language, and define a new language variety for it: e.g. Quebec French or Provençal Occitan. Then add the new variety to one or more released language pairs, without diminishing the quality of the pre-existing variety(ies). The objective is to facilitate the generation of varieties for languages with a weak standardisation and/or pluricentric languages.<br />
| mentors = [[User:hectoralos|Hèctor Alòs i Font]], [[User:Firespeaker|Jonathan Washington]]<br />
| more = /Add a new variety to an existing language<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Leverage and integrate language preferences into language pairs<br />
| difficulty = easy<br />
| length = short<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Update language pairs with lexical and orthographical variations to leverage the new [[Dialectal_or_standard_variation|preferences]] functionality<br />
| rationale = Currently, preferences are implemented via language variant, which relies on multiple dictionaries, increasing compilation time exponentially every time a new preference gets introduced.<br />
| mentors = [[User:Xavivars|Xavi Ivars]] [[User:Unhammer]]<br />
| more = /Use preferences in SPA-CAT<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add Capitalization Handling Module to a Language Pair<br />
| difficulty = easy<br />
| length = short<br />
| skills = XML, knowledge of some relevant natural language<br />
| description = Update a language pair to make use make use of the new [[Capitalization_restoration|Capitalization handling module]]<br />
| rationale = Correcting capitalization via transfer rules is tedious and error prone<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Capitalization<br />
}}<br />
<br />
== Data Extraction ==<br />
<br />
A lot of the language data we need to make our analyzers and translators work already exists in other forms and we just need to figure out how to convert it. If you know of another source of data that isn't listed, we'd love to hear about it.<br />
<br />
{{IdeaSummary<br />
| name = dictionary induction from wikis<br />
| difficulty = Medium<br />
| length = either<br />
| skills = MySQL, mediawiki syntax, perl, maybe C++ or Java; Java, Scala, RDF, and DBpedia to use DBpedia extraction<br />
| description = Extract dictionaries from linguistic wikis<br />
| rationale = Wiki dictionaries and encyclopedias (e.g. omegawiki, wiktionary, wikipedia, dbpedia) contain information (e.g. bilingual equivalences, morphological features, conjugations) that could be exploited to speed up the development of dictionaries for Apertium. This task aims at automatically building dictionaries by extracting different pieces of information from wiki structures such as interlingual links, infoboxes and/or from dbpedia RDF datasets.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /Dictionary induction from wikis<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Dictionary induction from parallel corpora / Revive ReTraTos<br />
| difficulty = Hard<br />
| length = either<br />
| skills = C++, perl, python, xml, scripting, machine learning<br />
| description = Extract dictionaries from parallel corpora<br />
| rationale = Given a pair of monolingual modules and a parallel corpus, we should be able to run a program to align tagged sentences and give us the best entries that are missing from bidix. [[ReTraTos]] (from 2008) did this back in 2008, but it's from 2008. We want a program which builds and runs in 2022, and does all the steps for the user.<br />
| mentors = [[User:Unhammer]], [[User:Popcorndude]]<br />
| more = /Dictionary induction from parallel corpora<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Extract morphological data from FLEx<br />
| difficulty = hard<br />
| length = long<br />
| skills = python, XML parsing<br />
| description = Write a program to extract data from [https://software.sil.org/fieldworks/ SIL FieldWorks] and convert as much as possible to monodix (and maybe bidix).<br />
| rationale = There's a lot of potentially useful data in FieldWorks files that might be enough to build a whole monodix for some languages but it's currently really hard to use<br />
| mentors = [[User:Popcorndude|Popcorndude]], [[User:TommiPirinen|Flammie]]<br />
| more = /FieldWorks_data_extraction<br />
}}<br />
<br />
== Tooling ==<br />
<br />
These are projects for people who would be comfortable digging through our C++ codebases (you will be doing a lot of that).<br />
<br />
{{IdeaSummary<br />
| name = Python API for Apertium<br />
| difficulty = medium<br />
| length = either<br />
| skills = C++, Python<br />
| description = Update the Python API for Apertium to expose all Apertium modes and test with all major OSes<br />
| rationale = The current Python API misses out on a lot of functionality, like phonemicisation, segmentation, and transliteration, and doesn't work for some OSes <s>like Debian</s>.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /Python API<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Robust tokenisation in lttoolbox<br />
| difficulty = Medium<br />
| length = long<br />
| skills = C++, XML, Python<br />
| description = Improve the longest-match left-to-right tokenisation strategy in [[lttoolbox]] to be fully Unicode compliant.<br />
| rationale = One of the most frustrating things about working with Apertium on texts "in the wild" is the way that the tokenisation works. If a letter is not specified in the alphabet, it is dealt with as whitespace, so e.g. you get unknown words split in two so you can end up with stuff like ^G$ö^k$ı^rmak$ which is terrible for further processing. <br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:TommiPirinen|Flammie]]<br />
| more = /Robust tokenisation<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = rule visualization tools<br />
| difficulty = Medium<br />
| length = long<br />
| skills = python? javascript? XML<br />
| description = make tools to help visualize the effect of various rules<br />
| rationale = TODO see https://github.com/Jakespringer/dapertium for an example<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Sevilay Bayatlı|Sevilay Bayatlı]], [[User:Popcorndude]]<br />
| more = /Visualization tools<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Extend Weighted transfer rules<br />
| difficulty = Medium<br />
| length = either<br />
| skills = C++, python<br />
| description = The weighted transfer module is already applied to the chunker transfer rules. And the idea here is to extend that module to be applied to interchunk and postchunk transfer rules too. <br />
| rationale = As a resource see https://github.com/aboelhamd/Weighted-transfer-rules-module<br />
| mentors = [[User: Sevilay Bayatlı|Sevilay Bayatlı]]<br />
| more = /Make a module <br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Automatic Error-Finder / Backpropagation<br />
| difficulty = Hard<br />
| length = long<br />
| skills = python?<br />
| description = Develop a tool to locate the approximate source of translation errors in the pipeline.<br />
| rationale = Being able to generate a list of probable error sources automatically makes it possible to prioritize issues by frequency, frees up developer time, and is a first step towards automated generation of better rules.<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Backpropagation<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Localization (l10n/i18n) of Apertium tools<br />
| difficulty = Medium<br />
| length = short<br />
| skills = C++<br />
| description = All our command line tools are currently hardcoded as English-only and it would be good if this were otherwise. [https://github.com/apertium/organisation/issues/28#issuecomment-803474833 Coding Challenge]<br />
| rationale = ...<br />
| mentors = [[User:Tino_Didriksen|Tino Didriksen]]<br />
| more = https://github.com/apertium/organisation/issues/28 Github<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Language Server Protocol<br />
| difficulty = medium<br />
| length = short<br />
| skills = any programming language<br />
| description = Build a [https://microsoft.github.io/language-server-protocol/|Language Server] for the various Apertium rule formats<br />
| rationale = We have some static analysis tools and syntax highlighters already and it would be great if we could combine and expand them to support more text editors.<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Language Server Protocol<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = WASM Compilation<br />
| difficulty = hard<br />
| length = long<br />
| skills = C++, Javascript<br />
| description = Compile the pipeline modules to WASM and provide JS wrappers for them.<br />
| rationale = There are situations where it would be nice to be able to run the entire pipeline in the browser<br />
| mentors = [[User:Tino Didriksen|Tino Didriksen]]<br />
| more = /WASM<br />
}}<br />
<br />
== Web ==<br />
<br />
If you know Python and JavaScript, here's some ideas for improving our [https://apertium.org website]. Some of these should be fairly short and it would be a good idea to talk to the mentors about doing a couple of them together.<br />
<br />
{{IdeaSummary<br />
| name = Web API extensions<br />
| difficulty = medium<br />
| length = short<br />
| skills = Python<br />
| description = Update the web API for Apertium to expose all Apertium modes <br />
| rationale = The current Web API misses out on a lot of functionality, like phonemicisation, segmentation, transliteration, and paradigm generation.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Apertium APY<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Misc<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Improve elements of Apertium's web infrastructure<br />
| rationale = Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]] have numerous open issues. This project would entail choosing a subset of open issues and features that could realistically be completed in the summer. You're encouraged to speak with the Apertium community to see which features and issues are the most pressing.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Dictionary Lookup<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Finish implementing dictionary lookup mode in Apertium's web infrastructure<br />
| rationale = Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]] have numerous open issues, including half-completed features like dictionary lookup. This project would entail completing the dictionary lookup feature. Some additional features which would be good to work would include automatic reverse lookups (so that a user has a better understanding of the results), grammatical information (such as the gender of nouns or the conjugation paradigms of verbs), and information about MWEs. See [https://github.com/apertium/apertium-html-tools/issues/105 the open issue on GitHub].<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]], [[User:Popcorndude]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Spell checking<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, js, css, python<br />
| description = Add a spell-checking interface to Apertium's web tools<br />
| rationale = [[Apertium-html-tools]] has seen some prototypes for spell-checking interfaces (all in stale PRs and branches on GitHub), but none have ended up being quite ready to integrate into the tools. This project would entail polishing up or recreating an interface, and making sure [[APy]] has a mode that allows access to Apertium voikospell modules. The end result should be a slick, easy-to-use interface for proofing text, with intuitive underlining of text deemed to be misspelled and intuitive presentation and selection of alternatives. [https://github.com/apertium/apertium-html-tools/issues/390 the open issue on GitHub]<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Spell checker web interface<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Suggestions<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Finish implementing a suggestions interface for Apertium's web infrastructure<br />
| rationale = Some work has been done to add a "suggestions" interface to Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]], whereby users can suggest corrected translations. This project would entail finishing that feature. There are some related [https://github.com/apertium/apertium-html-tools/issues/55 issues] and [https://github.com/apertium/apertium-html-tools/pull/252 PRs] on GitHub.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Orthography conversion interface<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, js, css, python<br />
| description = Add an orthography conversion interface to Apertium's web tools<br />
| rationale = Several Apertium language modules (like Kazakh, Kyrgyz, Crimean Tatar, and Hñähñu) have orthography conversion modes in their mode definition files. This project would be to expose those modes through [[APy|Apertium APy]] and provide a simple interface in [[Apertium-html-tools]] to use them.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add support for NMT to web API<br />
| difficulty = Medium<br />
| length = long<br />
| skills = python, NMT<br />
| description = Add support for a popular NMT engine to Apertium's web API<br />
| rationale = Currently Apertium's web API [[APy|Apertium APy]], supports only Apertium language modules. But the front end could just as easily interface with an API that supports trained NMT models. The point of the project is to add support for one popular NMT package (e.g., translateLocally/Bergamot, OpenNMT or JoeyNMT) to the APy.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = <br />
}}<br />
<br />
== Integrations ==<br />
<br />
In addition to incorporating data from other projects, it would be nice if we could also make our data useful to them.<br />
<br />
{{IdeaSummary<br />
| name = OmniLingo and Apertium<br />
| difficulty = medium<br />
| length = either<br />
| skills = JS, Python<br />
| description = OmniLingo is a language learning system for practicing listening comprehension using Apertium data. There is a lot of text processing involved (for example tokenisation) that could be aided by Apertium tools. <br />
| rationale = <br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /OmniLingo<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Support for Enhanced Dependencies in UD Annotatrix<br />
| difficulty = medium<br />
| length = either<br />
| skills = NodeJS<br />
| description = UD Annotatrix is an annotation interface for Universal Dependencies, but does not yet support all functionality<br />
| rationale = <br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /Morphological analyser<br />
}}<br />
<br />
<!--<br />
This one was done, but could do with more work. Not sure if it's a full gsoc though?<br />
<br />
{{IdeaSummary<br />
| name = User-friendly lexical selection training<br />
| difficulty = Medium<br />
| skills = Python, C++, shell scripting<br />
| description = Make it so that training/inference of lexical selection rules is a more user-friendly process<br />
| rationale = Our lexical selection module allows for inferring rules from corpora and word alignments, but the procedure is currently a bit messy, with various scripts involved that require lots of manual tweaking, and many third party tools to be installed. The goal of this task is to make the procedure as user-friendly as possible, so that ideally only a simple config file would be needed, and a driver script would take care of the rest.<br />
| mentors = [[User:Unhammer|Unhammer]], [[User:Mlforcada|Mikel Forcada]]<br />
| more = /User-friendly lexical selection training<br />
}}<br />
--><br />
<br />
{{IdeaSummary<br />
| name = UD and Apertium integration<br />
| difficulty = Entry level<br />
| length = short<br />
| skills = python, javascript, HTML, (C++)<br />
| description = Create a range of tools for making Apertium compatible with Universal Dependencies<br />
| rationale = Universal dependencies is a fast growing project aimed at creating a unified annotation scheme for treebanks. This includes both part-of-speech and morphological features. Their annotated corpora could be extremely useful for Apertium for training models for translation. In addition, Apertium's rule-based morphological descriptions could be useful for software that relies on Universal dependencies.<br />
| mentors = [[User:Francis Tyers]], [[User:Firespeaker| Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /UD and Apertium integration <br />
}}<br />
<br />
[[Category:Development]]<br />
[[Category:Google Summer of Code]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Ideas_for_Google_Summer_of_Code&diff=74175Ideas for Google Summer of Code2023-01-28T21:11:40Z<p>Firespeaker: looking good</p>
<hr />
<div>{{TOCD}}<br />
This is the ideas page for [[Google Summer of Code]], here you can find ideas on interesting projects that would make Apertium more useful for people and improve or expand our functionality. If you have an idea please add it below, if you think you could mentor someone in a particular area, add your name to "Interested mentors" using <nowiki>~~~</nowiki> <br />
<br />
The page is intended as an overview of the kind of projects we have in mind. If one of them particularly piques your interest, please come and discuss with us on <code>#apertium</code> on <code>irc.oftc.net</code>, mail the [[Contact|mailing list]], or draw attention to yourself in some other way. <br />
<br />
Note that, if you have an idea that isn't mentioned here, we would be very interested to hear about it.<br />
<br />
Here are some more things you could look at:<br />
<br />
* [[Top tips for GSOC applications]] <br />
* Get in contact with one of our long-serving [[List of Apertium mentors|mentors]] &mdash; they are nice, honest!<br />
* Pages in the [[:Category:Development|development category]]<br />
* Resources that could be converted or expanded in the [[incubator]]. Consider doing or improving a language pair (see [[incubator]], [[nursery]] and [[staging]] for pairs that need work)<br />
* Unhammer's [[User:Unhammer/wishlist|wishlist]]<br />
<!--* The open issues [https://github.com/search?q=org%3Aapertium&state=open&type=Issues on Github] - especially the [https://github.com/search?q=org%3Aapertium+label%3A%22good+first+issue%22&state=open&type=Issues Good First Issues]. --><br />
<br />
__TOC__<br />
<br />
If you're a student trying to propose a topic, the recommended way is to request a wiki account and then go to <pre>http://wiki.apertium.org/wiki/User:[[your username]]/GSoC2023Proposal</pre> and click the "create" button near the top of the page. It's also nice to include <code><nowiki>[[Category:GSoC_2023_student_proposals]]</nowiki></code> to help organize submitted proposals.<br />
<br />
== Language Data ==<br />
<br />
Can you read or write a language other than English (and we do mean any language)? If so, you can help with one of these and we can help you figure out the technical parts.<br />
<br />
{{IdeaSummary<br />
| name = Develop a morphological analyser<br />
| difficulty = easy<br />
| length = either<br />
| skills = XML or HFST or lexd<br />
| description = Write a morphological analyser and generator for a language that does not yet have one<br />
| rationale = A key part of an Apertium machine translation system is a morphological analyser and generator. The objective of this task is to create an analyser for a language that does not yet have one.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:Firespeaker|Jonathan Washington]], [[User: Sevilay Bayatlı|Sevilay Bayatlı]], Hossep, nlhowell, [[User:Popcorndude]]<br />
| more = /Morphological analyser<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = apertium-separable language-pair integration<br />
| difficulty = Medium<br />
| length = short<br />
| skills = XML, a scripting language (Python, Perl), some knowledge of linguistics and/or at least one relevant natural language<br />
| description = Choose a language you can identify as having a good number of "multiwords" in the lexicon. Modify all language pairs in Apertium to use the [[Apertium-separable]] module to process the multiwords, and clean up the dictionaries accordingly.<br />
| rationale = Apertium-separable is a newish module to process lexical items with discontinguous dependencies, an area where Apertium has traditionally fallen short. Despite all the module has to offer, many translation pairs still don't use it.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /Apertium separable<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Bring an unreleased translation pair to releasable quality<br />
| difficulty = Medium<br />
| length = long<br />
| skills = shell scripting<br />
| description = Take an unstable language pair and improve its quality, focusing on testvoc<br />
| rationale = Many Apertium language pairs have large dictionaries and have otherwise seen much development, but are not of releasable quality. The point of this project would be bring one translation pair to releasable quality. This would entail obtaining good naïve coverage and a clean [[testvoc]].<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Seviay Bayatlı|Sevilay Bayatlı]], [[User:Unhammer]], [[User:hectoralos|Hèctor Alòs i Font]]<br />
| more = /Make a language pair state-of-the-art<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Develop a prototype MT system for a strategic language pair<br />
| difficulty = Medium<br />
| length = long<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Create a translation pair based on two existing language modules, focusing on the dictionary and structural transfer<br />
| rationale = Choose a strategic set of languages to develop an MT system for, such that you know the target language well and morphological transducers for each language are part of Apertium. Develop an Apertium MT system by focusing on writing a bilingual dictionary and structural transfer rules. Expanding the transducers and disambiguation, and writing lexical selection rules and multiword sequences may also be part of the work. The pair may be an existing prototype, but if it's a heavily developed but unreleased pair, consider applying for "Bring an unreleased translation pair to releasable quality" instead.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Sevilay Bayatlı| Sevilay Bayatlı]], [[User:Unhammer]], [[User:hectoralos|Hèctor Alòs i Font]]<br />
| more = /Adopt a language pair<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add a new variety to an existing language<br />
| difficulty = easy<br />
| length = either<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Add a language variety to one or more released pairs, focusing on the dictionary and lexical selection<br />
| rationale = Take a released language, and define a new language variety for it: e.g. Quebec French or Provençal Occitan. Then add the new variety to one or more released language pairs, without diminishing the quality of the pre-existing variety(ies). The objective is to facilitate the generation of varieties for languages with a weak standardisation and/or pluricentric languages.<br />
| mentors = [[User:hectoralos|Hèctor Alòs i Font]], [[User:Firespeaker|Jonathan Washington]]<br />
| more = /Add a new variety to an existing language<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Leverage and integrate language preferences into language pairs<br />
| difficulty = easy<br />
| length = short<br />
| skills = XML, some knowledge of linguistics and of one relevant natural language <br />
| description = Update language pairs with lexical and orthographical variations to leverage the new [[Dialectal_or_standard_variation|preferences]] functionality<br />
| rationale = Currently, preferences are implemented via language variant, which relies on multiple dictionaries, increasing compilation time exponentially every time a new preference gets introduced.<br />
| mentors = [[User:Xavivars|Xavi Ivars]] [[User:Unhammer]]<br />
| more = /Use preferences in SPA-CAT<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add Capitalization Handling Module to a Language Pair<br />
| difficulty = easy<br />
| length = short<br />
| skills = XML, knowledge of some relevant natural language<br />
| description = Update a language pair to make use make use of the new [[Capitalization_restoration|Capitalization handling module]]<br />
| rationale = Correcting capitalization via transfer rules is tedious and error prone<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Capitalization<br />
}}<br />
<br />
== Data Extraction ==<br />
<br />
A lot of the language data we need to make our analyzers and translators work already exists in other forms and we just need to figure out how to convert it. If you know of another source of data that isn't listed, we'd love to hear about it.<br />
<br />
{{IdeaSummary<br />
| name = dictionary induction from wikis<br />
| difficulty = Medium<br />
| length = either<br />
| skills = MySQL, mediawiki syntax, perl, maybe C++ or Java; Java, Scala, RDF, and DBpedia to use DBpedia extraction<br />
| description = Extract dictionaries from linguistic wikis<br />
| rationale = Wiki dictionaries and encyclopedias (e.g. omegawiki, wiktionary, wikipedia, dbpedia) contain information (e.g. bilingual equivalences, morphological features, conjugations) that could be exploited to speed up the development of dictionaries for Apertium. This task aims at automatically building dictionaries by extracting different pieces of information from wiki structures such as interlingual links, infoboxes and/or from dbpedia RDF datasets.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /Dictionary induction from wikis<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Dictionary induction from parallel corpora / Revive ReTraTos<br />
| difficulty = Hard<br />
| length = either<br />
| skills = C++, perl, python, xml, scripting, machine learning<br />
| description = Extract dictionaries from parallel corpora<br />
| rationale = Given a pair of monolingual modules and a parallel corpus, we should be able to run a program to align tagged sentences and give us the best entries that are missing from bidix. [[ReTraTos]] (from 2008) did this back in 2008, but it's from 2008. We want a program which builds and runs in 2022, and does all the steps for the user.<br />
| mentors = [[User:Unhammer]], [[User:Popcorndude]]<br />
| more = /Dictionary induction from parallel corpora<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Extract morphological data from FLEx<br />
| difficulty = hard<br />
| length = long<br />
| skills = python, XML parsing<br />
| description = Write a program to extract data from [https://software.sil.org/fieldworks/ SIL FieldWorks] and convert as much as possible to monodix (and maybe bidix).<br />
| rationale = There's a lot of potentially useful data in FieldWorks files that might be enough to build a whole monodix for some languages but it's currently really hard to use<br />
| mentors = [[User:Popcorndude|Popcorndude]], [[User:TommiPirinen|Flammie]]<br />
| more = /FieldWorks_data_extraction<br />
}}<br />
<br />
== Tooling ==<br />
<br />
These are projects for people who would be comfortable digging through our C++ codebases (you will be doing a lot of that).<br />
<br />
{{IdeaSummary<br />
| name = Python API for Apertium<br />
| difficulty = medium<br />
| length = either<br />
| skills = C++, Python<br />
| description = Update the Python API for Apertium to expose all Apertium modes and test with all major OSes<br />
| rationale = The current Python API misses out on a lot of functionality, like phonemicisation, segmentation, and transliteration, and doesn't work for some OSes <s>like Debian</s>.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /Python API<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Robust tokenisation in lttoolbox<br />
| difficulty = Medium<br />
| length = long<br />
| skills = C++, XML, Python<br />
| description = Improve the longest-match left-to-right tokenisation strategy in [[lttoolbox]] to be fully Unicode compliant.<br />
| rationale = One of the most frustrating things about working with Apertium on texts "in the wild" is the way that the tokenisation works. If a letter is not specified in the alphabet, it is dealt with as whitespace, so e.g. you get unknown words split in two so you can end up with stuff like ^G$ö^k$ı^rmak$ which is terrible for further processing. <br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:TommiPirinen|Flammie]]<br />
| more = /Robust tokenisation<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = rule visualization tools<br />
| difficulty = Medium<br />
| length = long<br />
| skills = python? javascript? XML<br />
| description = make tools to help visualize the effect of various rules<br />
| rationale = TODO see https://github.com/Jakespringer/dapertium for an example<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Sevilay Bayatlı|Sevilay Bayatlı]], [[User:Popcorndude]]<br />
| more = /Visualization tools<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Extend Weighted transfer rules<br />
| difficulty = Medium<br />
| length = either<br />
| skills = C++, python<br />
| description = The weighted transfer module is already applied to the chunker transfer rules. And the idea here is to extend that module to be applied to interchunk and postchunk transfer rules too. <br />
| rationale = As a resource see https://github.com/aboelhamd/Weighted-transfer-rules-module<br />
| mentors = [[User: Sevilay Bayatlı|Sevilay Bayatlı]]<br />
| more = /Make a module <br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Automatic Error-Finder / Backpropagation<br />
| difficulty = Hard<br />
| length = long<br />
| skills = python?<br />
| description = Develop a tool to locate the approximate source of translation errors in the pipeline.<br />
| rationale = Being able to generate a list of probable error sources automatically makes it possible to prioritize issues by frequency, frees up developer time, and is a first step towards automated generation of better rules.<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Backpropagation<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Localization (l10n/i18n) of Apertium tools<br />
| difficulty = Medium<br />
| length = short<br />
| skills = C++<br />
| description = All our command line tools are currently hardcoded as English-only and it would be good if this were otherwise. [https://github.com/apertium/organisation/issues/28#issuecomment-803474833 Coding Challenge]<br />
| rationale = ...<br />
| mentors = [[User:Tino_Didriksen|Tino Didriksen]]<br />
| more = https://github.com/apertium/organisation/issues/28 Github<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Language Server Protocol<br />
| difficulty = medium<br />
| length = short<br />
| skills = any programming language<br />
| description = Build a [https://microsoft.github.io/language-server-protocol/|Language Server] for the various Apertium rule formats<br />
| rationale = We have some static analysis tools and syntax highlighters already and it would be great if we could combine and expand them to support more text editors.<br />
| mentors = [[User:Popcorndude]]<br />
| more = /Language Server Protocol<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = WASM Compilation<br />
| difficulty = hard<br />
| length = long<br />
| skills = C++, Javascript<br />
| description = Compile the pipeline modules to WASM and provide JS wrappers for them.<br />
| rationale = There are situations where it would be nice to be able to run the entire pipeline in the browser<br />
| mentors = [[User:Tino Didriksen|Tino Didriksen]]<br />
| more = /WASM<br />
}}<br />
<br />
== Web ==<br />
<br />
If you know Python and JavaScript, here's some ideas for improving our [https://apertium.org website]. Some of these should be fairly short and it would be a good idea to talk to the mentors about doing a couple of them together.<br />
<br />
{{IdeaSummary<br />
| name = Web API extensions<br />
| difficulty = medium<br />
| length = short<br />
| skills = Python<br />
| description = Update the web API for Apertium to expose all Apertium modes <br />
| rationale = The current Web API misses out on a lot of functionality, like phonemicisation, segmentation, transliteration, and paradigm generation.<br />
| mentors = [[User:Francis Tyers|Francis Tyers]], [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Apertium APY<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Misc<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Improve elements of Apertium's web infrastructure<br />
| rationale = Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]] have numerous open issues. This project would entail choosing a subset of open issues and features that could realistically be completed in the summer. You're encouraged to speak with the Apertium community to see which features and issues are the most pressing.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Dictionary Lookup<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Finish implementing dictionary lookup mode in Apertium's web infrastructure<br />
| rationale = Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]] have numerous open issues, including half-completed features like dictionary lookup. This project would entail completing the dictionary lookup feature. Some additional features which would be good to work would include automatic reverse lookups (so that a user has a better understanding of the results), grammatical information (such as the gender of nouns or the conjugation paradigms of verbs), and information about MWEs. See [https://github.com/apertium/apertium-html-tools/issues/105 the open issue on GitHub].<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]], [[User:Popcorndude]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Spell checking<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, js, css, python<br />
| description = Add a spell-checking interface to Apertium's web tools<br />
| rationale = [[Apertium-html-tools]] has seen some prototypes for spell-checking interfaces (all in stale PRs and branches on GitHub), but none have ended up being quite ready to integrate into the tools. This project would entail polishing up or recreating an interface, and making sure [[APy]] has a mode that allows access to Apertium voikospell modules. The end result should be a slick, easy-to-use interface for proofing text, with intuitive underlining of text deemed to be misspelled and intuitive presentation and selection of alternatives. [https://github.com/apertium/apertium-html-tools/issues/390 the open issue on GitHub]<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Spell checker web interface<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Suggestions<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, css, js, python<br />
| description = Finish implementing a suggestions interface for Apertium's web infrastructure<br />
| rationale = Some work has been done to add a "suggestions" interface to Apertium's website infrastructure [[Apertium-html-tools]] and its supporting API [[APy|Apertium APy]], whereby users can suggest corrected translations. This project would entail finishing that feature. There are some related [https://github.com/apertium/apertium-html-tools/issues/55 issues] and [https://github.com/apertium/apertium-html-tools/pull/252 PRs] on GitHub.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Website Improvements: Orthography conversion interface<br />
| difficulty = Medium<br />
| length = short<br />
| skills = html, js, css, python<br />
| description = Add an orthography conversion interface to Apertium's web tools<br />
| rationale = Several Apertium language modules (like Kazakh, Kyrgyz, Crimean Tatar, and Hñähñu) have orthography conversion modes in their mode definition files. This project would be to expose those modes through [[APy|Apertium APy]] and provide a simple interface in [[Apertium-html-tools]] to use them.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = /Website improvements<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Add support for NMT to web API<br />
| difficulty = Medium<br />
| length = long<br />
| skills = python, NMT<br />
| description = Add support for a popular NMT engine to Apertium's web API<br />
| rationale = Currently Apertium's web API [[APy|Apertium APy]], supports only Apertium language modules. But the front end could just as easily interface with an API that supports trained NMT models. The point of the project is to add support for one popular NMT package (e.g., translateLocally/Bergamot, OpenNMT or JoeyNMT) to the APy.<br />
| mentors = [[User:Firespeaker|Jonathan Washington]], [[User:Xavivars|Xavi Ivars]]<br />
| more = <br />
}}<br />
<br />
== Integrations ==<br />
<br />
In addition to incorporating data from other projects, it would be nice if we could also make our data useful to them.<br />
<br />
{{IdeaSummary<br />
| name = OmniLingo and Apertium<br />
| difficulty = medium<br />
| length = either<br />
| skills = JS, Python<br />
| description = OmniLingo is a language learning system for practicing listening comprehension using Apertium data. There is a lot of text processing involved (for example tokenisation) that could be aided by Apertium tools. <br />
| rationale = <br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /OmniLingo<br />
}}<br />
<br />
{{IdeaSummary<br />
| name = Support for Enhanced Dependencies in UD Annotatrix<br />
| difficulty = medium<br />
| length = either<br />
| skills = NodeJS<br />
| description = UD Annotatrix is an annotation interface for Universal Dependencies, but does not yet support all functionality<br />
| rationale = <br />
| mentors = [[User:Francis Tyers|Francis Tyers]]<br />
| more = /Morphological analyser<br />
}}<br />
<br />
<!--<br />
This one was done, but could do with more work. Not sure if it's a full gsoc though?<br />
<br />
{{IdeaSummary<br />
| name = User-friendly lexical selection training<br />
| difficulty = Medium<br />
| skills = Python, C++, shell scripting<br />
| description = Make it so that training/inference of lexical selection rules is a more user-friendly process<br />
| rationale = Our lexical selection module allows for inferring rules from corpora and word alignments, but the procedure is currently a bit messy, with various scripts involved that require lots of manual tweaking, and many third party tools to be installed. The goal of this task is to make the procedure as user-friendly as possible, so that ideally only a simple config file would be needed, and a driver script would take care of the rest.<br />
| mentors = [[User:Unhammer|Unhammer]], [[User:Mlforcada|Mikel Forcada]]<br />
| more = /User-friendly lexical selection training<br />
}}<br />
--><br />
<br />
{{IdeaSummary<br />
| name = UD and Apertium integration<br />
| difficulty = Entry level<br />
| length = short<br />
| skills = python, javascript, HTML, (C++)<br />
| description = Create a range of tools for making Apertium compatible with Universal Dependencies<br />
| rationale = Universal dependencies is a fast growing project aimed at creating a unified annotation scheme for treebanks. This includes both part-of-speech and morphological features. Their annotated corpora could be extremely useful for Apertium for training models for translation. In addition, Apertium's rule-based morphological descriptions could be useful for software that relies on Universal dependencies.<br />
| mentors = [[User:Francis Tyers]], [[User:Firespeaker| Jonathan Washington]], [[User:Popcorndude]]<br />
| more = /UD and Apertium integration <br />
}}<br />
<br />
[[Category:Development]]<br />
[[Category:Google Summer of Code]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Adding_orthography_conversion_to_a_language_module&diff=74153Adding orthography conversion to a language module2023-01-12T15:40:36Z<p>Firespeaker: Created page with " == One approach == This accomplishes the following goal: * Analysis of two orthographies (by default) * Generation in orthography of choice (defaults to transducer orthogra..."</p>
<hr />
<div><br />
<br />
== One approach ==<br />
<br />
This accomplishes the following goal:<br />
* Analysis of two orthographies (by default)<br />
* Generation in orthography of choice (defaults to transducer orthography)<br />
* Conversion from orthography 1 to orthography 2<br />
<br />
What's missing/broken:<br />
* Explicit conversion of orthography 2 to orthography 1<br />
** Will work in some simple cases. Other cases will need explicit orthography conversion <code>lexd</code> and <code>twol</code> from orthography 2 to orthography 1<br />
* Orthography-specific spellrelax<br />
** See [https://github.com/apertium/apertium-krc/blob/master/Makefile.am apertium-krc's Makefile] for an example of how to do this.<br />
<br />
=== A diff ===<br />
<br />
* Replace <code>abc</code> with language code<br />
* Replace <code>ORTH1</code> with abbreviation for orthography 1<br />
* Replace <code>ORTH2</code> with abbreviation for orthography 2<br />
<br />
<pre><br />
diff --git a/Makefile.am b/Makefile.am<br />
index 3537fd7..86d9751 100644<br />
--- a/Makefile.am<br />
+++ b/Makefile.am<br />
@@ -3,6 +3,10 @@<br />
###############################################################################<br />
<br />
LANG1=abc<br />
+SCRIPT1=ORTH1<br />
+SCRIPT2=ORTH2<br />
+LANG1SCRIPT1=$(LANG1)@$(SCRIPT1)<br />
+LANG1SCRIPT2=$(LANG1)@$(SCRIPT2)<br />
BASENAME=apertium-$(LANG1)<br />
<br />
TARGETS_COMMON = \<br />
@@ -14,6 +18,12 @@ TARGETS_COMMON = \<br />
$(LANG1).autogen.att.gz \<br />
$(LANG1).autopgen.bin \<br />
$(LANG1).rlx.bin \<br />
+ $(LANG1SCRIPT1).autogen.hfst \<br />
+ $(LANG1SCRIPT2).autogen.hfst \<br />
+ $(LANG1SCRIPT1).autogen.bin \<br />
+ $(LANG1SCRIPT2).autogen.bin \<br />
+ $(LANG1).$(SCRIPT1)-$(SCRIPT2).hfst \<br />
+ $(LANG1).$(SCRIPT2)-$(SCRIPT1).hfst<br />
$(LANG1).zhfst<br />
<br />
# This include defines goals for install-modes, .deps/.d, autobil.prefixes and .mode files:<br />
@@ -49,14 +59,73 @@ TARGETS_COMMON = \<br />
.deps/$(LANG1).LR.hfst: .deps/$(LANG1).LR.lexd.hfst .deps/$(LANG1).twol.hfst<br />
hfst-compose-intersect -1 .deps/$(LANG1).LR.lexd.hfst -2 .deps/$(LANG1).twol.hfst -o $@<br />
<br />
-$(LANG1).autogen.hfst: .deps/$(LANG1).RL.hfst<br />
+# Default autogen is SCRIPT1<br />
+<br />
+$(LANG1SCRIPT1).autogen.hfst: .deps/$(LANG1).RL.hfst<br />
hfst-fst2fst -O $< -o $@<br />
<br />
+$(LANG1).autogen.hfst: $(LANG1SCRIPT1).autogen.hfst<br />
+ cp $< $@<br />
+<br />
.deps/$(LANG1).spellrelax.hfst: $(BASENAME).$(LANG1).spellrelax .deps/.d<br />
hfst-regexp2fst -S -o $@ $<<br />
<br />
-$(LANG1).automorf.hfst: .deps/$(LANG1).LR.hfst .deps/$(LANG1).spellrelax.hfst<br />
- hfst-compose -1 $< -2 .deps/$(LANG1).spellrelax.hfst | hfst-invert | hfst-fst2fst -O -o $@<br />
+# SCRIPT2 autogen<br />
+$(LANG1SCRIPT2).autogen.hfst: .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).hfst .deps/$(LANG1).RL.hfst .deps/.d<br />
+ hfst-compose -1 `echo $(word 2,$^)` -2 $< | hfst-fst2fst -w -o $@<br />
+<br />
+# Base orthographic converter<br />
+<br />
+# SCRIPT1 automorf<br />
+.deps/$(LANG1SCRIPT1).automorf.hfst: .deps/$(LANG1).LR.hfst .deps/$(LANG1).spellrelax.hfst .deps/.d<br />
+ hfst-compose-intersect -1 $< -2 .deps/$(LANG1).spellrelax.hfst | hfst-invert -o $@<br />
+<br />
+# SCRIPT2 automorf<br />
+.deps/$(LANG1SCRIPT2).automorf.hfst: .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).hfst .deps/$(LANG1).LR.hfst .deps/$(LANG1).spellrelax.hfst .deps/.d<br />
+ hfst-compose -1 `echo $(word 2,$^)` -2 $< | hfst-compose-intersect -1 - -2 `echo $(word 3,$^)` | hfst-invert -o $@<br />
+<br />
+# automorf that analyses SCRIPT1 and SCRIPT2<br />
+$(LANG1).automorf.hfst: .deps/$(LANG1SCRIPT1).automorf.hfst .deps/$(LANG1SCRIPT2).automorf.hfst<br />
+ hfst-invert $< -o .deps/$(LANG1SCRIPT1).REVautomorf.hfst<br />
+ hfst-invert `echo $(word 2,$^)` -o .deps/$(LANG1SCRIPT2).REVautomorf.hfst<br />
+ hfst-union -1 .deps/$(LANG1SCRIPT1).REVautomorf.hfst -2 .deps/$(LANG1SCRIPT2).REVautomorf.hfst | hfst-invert | hfst-minimise | hfst-fst2fst -w -o $@<br />
+<br />
+# SCRIPT1 to SCRIPT2 transducer<br />
+.deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).hfst: .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).lexd.hfst .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).twol.hfst .deps/.d<br />
+ hfst-compose-intersect -1 $< -2 .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).twol.hfst -o $@<br />
+<br />
+# compile the first stage of the SCRIPT1-SCRIPT2 transliteration transducer<br />
+.deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).lexd.hfst: $(BASENAME).$(SCRIPT1)-$(SCRIPT2).lexd .deps/.d<br />
+ lexd $< .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).lexd.att<br />
+ hfst-txt2fst .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).lexd.att -o $@<br />
+<br />
+# compile the second stage of the SCRIPT1-SCRIPT2 transliteration transducer<br />
+.deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).twol.hfst: $(BASENAME).$(SCRIPT1)-$(SCRIPT2).twol<br />
+ hfst-twolc $< -o $@<br />
+<br />
+# SCRIPT1 to SCRIPT2 orthographic converter<br />
+<br />
+$(LANG1).$(SCRIPT1)-$(SCRIPT2).hfst: .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).hfst .deps/.d<br />
+ hfst-fst2fst $< -Oo $@<br />
+<br />
+.deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).att: .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).hfst .deps/.d<br />
+ hfst-fst2txt $< -o $@<br />
+<br />
+$(LANG1).$(SCRIPT1)-$(SCRIPT2).bin: .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).att .deps/.d<br />
+ lt-comp -H lr $< $@<br />
+<br />
+# SCRIPT2 to SCRIPT1 orthographic converter<br />
+<br />
+$(LANG1).$(SCRIPT2)-$(SCRIPT1).hfst: .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).hfst .deps/.d<br />
+ hfst-invert $< | hfst-fst2fst -Oo $@<br />
+<br />
+.deps/$(LANG1SCRIPT2)-$(LANG1SCRIPT1).att: .deps/$(LANG1SCRIPT1)-$(LANG1SCRIPT2).hfst .deps/.d<br />
+ hfst-invert | hfst-fst2txt -o $@<br />
+<br />
+$(LANG1).$(SCRIPT2)-$(SCRIPT1).bin: .deps/$(LANG1SCRIPT2)-$(LANG1SCRIPT1).att .deps/.d<br />
+ lt-comp -H lr $< $@<br />
+<br />
+# bin files of automorfs and autogens<br />
<br />
$(LANG1).autogen.att.gz: $(LANG1).autogen.hfst<br />
hfst-fst2txt $< | gzip -9 -c -n > $@<br />
@@ -64,10 +133,13 @@ $(LANG1).autogen.att.gz: $(LANG1).autogen.hfst<br />
$(LANG1).automorf.att.gz: $(LANG1).automorf.hfst<br />
hfst-fst2txt $< | gzip -9 -c -n > $@<br />
<br />
-$(LANG1).autogen.bin: $(LANG1).autogen.att.gz .deps/.d<br />
+$(LANG1SCRIPT1).autogen.bin: $(LANG1).autogen.att.gz .deps/.d<br />
zcat < $< > .deps/$(LANG1).autogen.att<br />
lt-comp lr .deps/$(LANG1).autogen.att $@<br />
<br />
+$(LANG1).autogen.bin: $(LANG1SCRIPT1).autogen.bin<br />
+ cp $< $@<br />
+<br />
$(LANG1).automorf.bin: $(LANG1).automorf.att.gz .deps/.d<br />
zcat < $< > .deps/$(LANG1).automorf.att<br />
lt-comp lr .deps/$(LANG1).automorf.att $@<br />
@@ -75,6 +147,14 @@ $(LANG1).automorf.bin: $(LANG1).automorf.att.gz .deps/.d<br />
$(LANG1).autopgen.bin: $(BASENAME).post-$(LANG1).dix<br />
lt-comp lr $< $@<br />
<br />
+$(LANG1SCRIPT2).autogen.att.gz: $(LANG1SCRIPT2).autogen.hfst<br />
+ hfst-fst2txt $< | gzip -9 -c -n > $@<br />
+<br />
+$(LANG1SCRIPT2).autogen.bin: $(LANG1SCRIPT2).autogen.att.gz .deps/.d<br />
+ zcat < $< > .deps/$(LANG1SCRIPT2).autogen.att<br />
+ lt-comp lr .deps/$(LANG1SCRIPT2).autogen.att $@<br />
+<br />
+<br />
###############################################################################<br />
## Debugging transducers (for testvoc)<br />
###############################################################################<br />
diff --git a/modes.xml b/modes.xml<br />
index 49740ed..1e152f9 100644<br />
--- a/modes.xml<br />
+++ b/modes.xml<br />
@@ -36,6 +36,38 @@<br />
</pipeline><br />
</mode><br />
<br />
+ <mode name="abc_ORTH1-gener" install="yes"><br />
+ <pipeline><br />
+ <program name="lt-proc -g"><br />
+ <file name="abc@ORTH1.autogen.bin"/><br />
+ </program><br />
+ </pipeline><br />
+ </mode><br />
+<br />
+ <mode name="abc_ORTH2-gener" install="yes"><br />
+ <pipeline><br />
+ <program name="lt-proc -g"><br />
+ <file name="abc@ORTH2.autogen.bin"/><br />
+ </program><br />
+ </pipeline><br />
+ </mode><br />
+<br />
+ <mode name="abc_ORTH1-abc_ORTH2" install="yes"><br />
+ <pipeline><br />
+ <program name="hfst-proc"><br />
+ <file name="abc.ORTH1-ORTH2.hfst"/><br />
+ </program><br />
+ </pipeline><br />
+ </mode><br />
+<br />
+ <mode name="abc_ORTH2-abc_ORTH1" install="yes"><br />
+ <pipeline><br />
+ <program name="hfst-proc"><br />
+ <file name="abc.ORTH2-ORTH1.hfst"/><br />
+ </program><br />
+ </pipeline><br />
+ </mode><br />
+<br />
<mode name="abc-tagger" install="yes"><br />
<pipeline><br />
<program name="lt-proc -w"><br />
</pre></div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium_on_Mac_OS_X&diff=74103Apertium on Mac OS X2022-08-19T17:50:34Z<p>Firespeaker: /* Basic Installation Using Homebrew */</p>
<hr />
<div>[[Installation sur Mac OS X|En français]]<br />
<br />
Use either Homebrew or Macports.<br />
<br />
== Basic Installation Using Homebrew ==<br />
<br />
# Make sure '''Homebrew''' is installed<br />
## If not, you can get it from https://brew.sh<br />
# Install all dependencies by running the following in the terminal:<br />
#: <pre><br />
#:: brew install gperftools help2man pcre icu4c perl518 gawk autoconf automake pkg-config cmake wget gcc</pre><br />
#: Note that if you have a slightly older version of macOS, you'll want to use <code>apple-gcc42</code> instead of <code>gcc</code><br />
# <pre> curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo zsh</pre><br />
#: Then type your password<br />
#: Note that you'll probably need to type your sudo password, and the prompt may be covered up by the curl display.<br />
#: Note also that if you have an old version of macOS, you may need <code>bash</code> instead of <code>zsh</code><br />
<br />
== Basic Installation Using Macports ==<br />
<br />
# Make sure '''XCode''' is installed<br />
## If not, download it from http://developer.apple.com/tools/xcode/<br />
## Be sure to include '''Command Line Tools'''<br />
## (http://railsapps.github.io/xcode-command-line-tools.html is nice guide if you get stuck here)<br />
# Make sure '''Macports''' is installed<br />
## If not, download it from http://www.macports.org/install.php<br />
# Install all dependencies by running the following in the terminal:<br />
#: <pre><br />
#:: sudo port install autoconf automake expat flex \<br />
#:: gettext gperf help2man libiconv libtool \<br />
#:: libxml2 libxslt m4 ncurses p5-locale-gettext \<br />
#:: pcre perl5 pkgconfig zlib gawk icu cmake boost gperftools</pre><br />
# <pre> curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo bash</pre><br />
<br />
== Language data packages ==<br />
<br />
If you've installed tools with install-nightly.sh, you can install language data with<br />
https://apertium.projectjj.com/osx/install-nightly-data.sh . <br />
<br />
First, download the script and make it executable<br />
<br />
curl https://apertium.projectjj.com/osx/install-nightly-data.sh<br />
chmod +x install-nightly-data.sh<br />
<br />
If the curl command does not deliver the install-nightly-data.sh, just download it via your browser (click the link). Remeember to remove the .txt part the browser added to the filename. then run the chmod +x command as shown.<br />
<br />
Then to install a language pair:<br />
<br />
./install-nightly-data.sh apertium-eng-deu<br />
<br />
and try it out:<br />
<br />
echo 'Hello world' | apertium eng-deu<br />
<br />
This language date is compiled '''nightly''' from git versions, but you'll have to re-run the script to get new versions.<br />
<br />
The script can only fetch what's there. You can see available language pairs / packages at:<br />
<br />
* https://apertium.projectjj.com/osx/nightly/data.php<br />
<br />
Remember, you need to ''install tools first'' (see above sections).<br />
<br />
<br />
<br />
There is also a script for '''release''' versions of pairs, https://apertium.projectjj.com/osx/install-release-data.sh , which can fetch the pairs listed at:<br />
<br />
* https://apertium.projectjj.com/osx/release/data.php<br />
<br />
== Compiling from Source ==<br />
<br />
<div style="background-color:pink; text-align:center; line-height:2.5; border: 1px solid crimson;">Are you sure this is the part you want?<br/>If you're here because one of the previous sections didn't work, log on to [[IRC]] and describe what went wrong.</div><br />
<br />
<span style="color: #f00;">You probably want [[Prerequisites for Mac OS X]] which gives you all the dev tools you might want</span> – this page has more in-depth documentation on compiling the core/dev tools from source.<br />
<br />
<br />
There are two main options for installing, "system" and "local":<br />
<br />
* [[Apertium on Mac OS X (System)]] &mdash; The fastest and easiest way is to install in your main system path (that is in <code>/usr/local</code>), choose this if you have root access on your system.<br />
* [[Apertium on Mac OS X (Local)]] &mdash; The second way is to install locally (that is in your home directory, e.g. <code>/Users/myname/Local</code>), this is more contained but slower and more difficult, choose this if you don't have root, or want to make the Apertium installation completely separated from your main system. This can be better for developers.<br />
<br />
If you do not have any experience compiling source code, you can also follow the more user-friendly Mac guide [[Apertium on Mac OS X (User)]], which walks you through every step.<br />
<br />
<br />
<br />
[[Category:Documentation]]<br />
[[Category:Installation]]<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Apertium_on_Mac_OS_X&diff=74102Apertium on Mac OS X2022-08-19T17:49:40Z<p>Firespeaker: /* Basic Installation Using Homebrew */</p>
<hr />
<div>[[Installation sur Mac OS X|En français]]<br />
<br />
Use either Homebrew or Macports.<br />
<br />
== Basic Installation Using Homebrew ==<br />
<br />
# Make sure '''Homebrew''' is installed<br />
## If not, you can get it from https://brew.sh<br />
# Install all dependencies by running the following in the terminal:<br />
#: <pre><br />
#:: brew install gperftools help2man pcre icu4c perl518 gawk autoconf automake pkg-config cmake wget gcc</pre><br />
#: Note that if you have a slightly older version of macOS, you'll want to use <code>apple-gcc42</code> instead of <code>gcc</code><br />
# <pre> curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo zsh</pre><br />
#: Then type your password<br />
#: Note that you'll probably need to type your sudo password, and the prompt may be covered up by the curl display.<br />
#: Note also that if you have an old version of macOS, you may need <pre>bash</pre> instead of <pre>zsh</pre><br />
<br />
== Basic Installation Using Macports ==<br />
<br />
# Make sure '''XCode''' is installed<br />
## If not, download it from http://developer.apple.com/tools/xcode/<br />
## Be sure to include '''Command Line Tools'''<br />
## (http://railsapps.github.io/xcode-command-line-tools.html is nice guide if you get stuck here)<br />
# Make sure '''Macports''' is installed<br />
## If not, download it from http://www.macports.org/install.php<br />
# Install all dependencies by running the following in the terminal:<br />
#: <pre><br />
#:: sudo port install autoconf automake expat flex \<br />
#:: gettext gperf help2man libiconv libtool \<br />
#:: libxml2 libxslt m4 ncurses p5-locale-gettext \<br />
#:: pcre perl5 pkgconfig zlib gawk icu cmake boost gperftools</pre><br />
# <pre> curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo bash</pre><br />
<br />
== Language data packages ==<br />
<br />
If you've installed tools with install-nightly.sh, you can install language data with<br />
https://apertium.projectjj.com/osx/install-nightly-data.sh . <br />
<br />
First, download the script and make it executable<br />
<br />
curl https://apertium.projectjj.com/osx/install-nightly-data.sh<br />
chmod +x install-nightly-data.sh<br />
<br />
If the curl command does not deliver the install-nightly-data.sh, just download it via your browser (click the link). Remeember to remove the .txt part the browser added to the filename. then run the chmod +x command as shown.<br />
<br />
Then to install a language pair:<br />
<br />
./install-nightly-data.sh apertium-eng-deu<br />
<br />
and try it out:<br />
<br />
echo 'Hello world' | apertium eng-deu<br />
<br />
This language date is compiled '''nightly''' from git versions, but you'll have to re-run the script to get new versions.<br />
<br />
The script can only fetch what's there. You can see available language pairs / packages at:<br />
<br />
* https://apertium.projectjj.com/osx/nightly/data.php<br />
<br />
Remember, you need to ''install tools first'' (see above sections).<br />
<br />
<br />
<br />
There is also a script for '''release''' versions of pairs, https://apertium.projectjj.com/osx/install-release-data.sh , which can fetch the pairs listed at:<br />
<br />
* https://apertium.projectjj.com/osx/release/data.php<br />
<br />
== Compiling from Source ==<br />
<br />
<div style="background-color:pink; text-align:center; line-height:2.5; border: 1px solid crimson;">Are you sure this is the part you want?<br/>If you're here because one of the previous sections didn't work, log on to [[IRC]] and describe what went wrong.</div><br />
<br />
<span style="color: #f00;">You probably want [[Prerequisites for Mac OS X]] which gives you all the dev tools you might want</span> – this page has more in-depth documentation on compiling the core/dev tools from source.<br />
<br />
<br />
There are two main options for installing, "system" and "local":<br />
<br />
* [[Apertium on Mac OS X (System)]] &mdash; The fastest and easiest way is to install in your main system path (that is in <code>/usr/local</code>), choose this if you have root access on your system.<br />
* [[Apertium on Mac OS X (Local)]] &mdash; The second way is to install locally (that is in your home directory, e.g. <code>/Users/myname/Local</code>), this is more contained but slower and more difficult, choose this if you don't have root, or want to make the Apertium installation completely separated from your main system. This can be better for developers.<br />
<br />
If you do not have any experience compiling source code, you can also follow the more user-friendly Mac guide [[Apertium on Mac OS X (User)]], which walks you through every step.<br />
<br />
<br />
<br />
[[Category:Documentation]]<br />
[[Category:Installation]]<br />
[[Category:Documentation in English]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=How_to_use_null_flush_in_python&diff=74090How to use null flush in python2022-07-15T18:12:32Z<p>Firespeaker: </p>
<hr />
<div>Many Apertium executables have "null flush" modes (usually with <tt>-z</tt>), which allows the executable to run once, stay open, accept input, and flush the output only on a null character.<br />
<br />
Here's a simple example of how to implement a wrapper in python3 around <code>lt-proc</code> and a transducer.<br />
<br />
<pre><br />
from subprocess import Popen, PIPE<br />
<br />
things_to_transduce = ['foo', 'bar', 'hargle', 'bargle']<br />
<br />
transducer_process = Popen(["lt-proc", "-t", "-z", "transducer.bin"], stdin=PIPE, stdout=PIPE)<br />
<br />
def transduce(inputString):<br />
transducer_process.stdin.write(bytes('{}\n'.format(inputString), 'utf-8'))<br />
transducer_process.stdin.write(b'\0')<br />
transducer_process.stdin.flush()<br />
return repr(transducer_process.stdout.readline().strip(b'\0').strip(b'\n').decode())<br />
<br />
for thing_to_transduce in things_to_transduce:<br />
print(transduce(thing_to_transduce))<br />
<br />
transducer_process.stdin.close()<br />
transducer_process.wait()<br />
</pre><br />
<br />
There's a [https://github.com/apertium/lttoolbox/blob/master/tests/basictest.py more involved example in the lttoolbox tests], which adds handling for when the process hangs.<br />
<br />
<br />
[[Category:Documentation]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=How_to_use_null_flush_in_python&diff=74089How to use null flush in python2022-07-15T17:41:19Z<p>Firespeaker: </p>
<hr />
<div>Many Apertium executables have "null flush" modes (usually with <tt>-z</tt>), which allows the executable to run once, stay open, accept input, and flush the output only on a null character.<br />
<br />
Here's a simple example of how to implement a wrapper in python3 around <code>lt-proc</code> and a transducer.<br />
<br />
<pre><br />
from subprocess import Popen, PIPE<br />
<br />
things_to_transduce = ['foo', 'bar', 'hargle', 'bargle']<br />
<br />
transducer_process = Popen(["lt-proc", "-t", "-z", "transducer.bin"], stdin=PIPE, stdout=PIPE)<br />
<br />
def transduce(inputString):<br />
transducer_process.stdin.write(bytes('{}\n'.format(inputString), 'utf-8'))<br />
transducer_process.stdin.write(b'\0')<br />
transducer_process.stdin.flush()<br />
return repr(transducer_process.stdout.readline().strip(b'\0').strip(b'\n').decode())<br />
<br />
for thing_to_transduce in things_to_transduce:<br />
print(transduce(thing_to_transduce))<br />
<br />
transducer_process.stdin.close()<br />
transducer_process.wait()<br />
</pre><br />
<br />
<br />
[[Category:Documentation]]</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=How_to_use_null_flush_in_python&diff=74088How to use null flush in python2022-07-15T17:41:01Z<p>Firespeaker: Created page with "Many Apertium executables have "null flush" modes (usually with <tt>-z</tt>), which allows the executable to run once, stay open, accept input, and flush the output only on a..."</p>
<hr />
<div>Many Apertium executables have "null flush" modes (usually with <tt>-z</tt>), which allows the executable to run once, stay open, accept input, and flush the output only on a null character.<br />
<br />
Here's a simple example of how to implement a wrapper in python3 around <code>lt-proc</code> and a transducer.<br />
<br />
<pre><br />
from subprocess import Popen, PIPE<br />
<br />
things_to_transduce = ['foo', 'bar', 'hargle', 'bargle']<br />
<br />
transducer_process = Popen(["lt-proc", "-t", "-z", "transducer.bin"], stdin=PIPE, stdout=PIPE)<br />
<br />
def transduce(inputString):<br />
transducer_process.stdin.write(bytes('{}\n'.format(inputString), 'utf-8'))<br />
transducer_process.stdin.write(b'\0')<br />
transducer_process.stdin.flush()<br />
return repr(transducer_process.stdout.readline().strip(b'\0').strip(b'\n').decode())<br />
<br />
for thing_to_transduce in things_to_transduce:<br />
print(transduce(thing_to_transduce))<br />
<br />
transducer_process.stdin.close()<br />
transducer_process.wait()<br />
</pre></div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Season_of_Docs_2022/Organize_and_Update_Apertium_User_Documentation&diff=73905Google Season of Docs 2022/Organize and Update Apertium User Documentation2022-03-10T16:29:09Z<p>Firespeaker: /* About the project */</p>
<hr />
<div><br />
== About Apertium ==<br />
<br />
== About the project ==<br />
<br />
=== The problem ===<br />
[https://wiki.apertium.org Apertium's wiki] and other documentation are out of date, poorly organized, not visible enough, and just plain not user-friendly.<br />
<br />
This ranges from documentation of individual tools not reflecting their current state, to our best how-to guides reflecting how things were done a decade ago. Documentation is scattered between the Apertium wiki, individual GitHub repos, an out-of-date pdf "Book", and even published papers and third party sites.<br />
<br />
The result is new users and contributors wasting time reading out-of-date materials, and even long-time contributors having no way to be aware of changes to the tools they use.<br />
<br />
=== The solution ===<br />
<br />
The solution to the above problem is to create updated documentation for all pipeline modules and/or a full tutorial.<br />
<br />
Ideally documentation on a given tool will exist in a single place, and a full tutorial will also have a single unified source. One possibility is to generate one set of docs from another, or from a single unified source. For example, if we want tools to be documented in both their GitHub repos and on the wiki, we should generate one set of documentation from the other (or a third source). If we want a full tutorial to be on the wiki but also available in PDF format, then we should designate one source as the original and generate the others from them. <br />
<br />
=== The scope ===<br />
<br />
* Overview of the Apertium platform<br />
* All stages of the Apertium pipeline<br />
* The main approaches to and tools for each stage<br />
<br />
=== Measuring success ===<br />
<br />
== Timeline ==<br />
<br />
== Budget ==</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Season_of_Docs_2022/Organize_and_Update_Apertium_User_Documentation&diff=73904Google Season of Docs 2022/Organize and Update Apertium User Documentation2022-03-10T16:28:07Z<p>Firespeaker: /* About the project */</p>
<hr />
<div><br />
== About Apertium ==<br />
<br />
== About the project ==<br />
<br />
=== The problem ===<br />
[https://wiki.apertium.org Apertium's wiki] and other documentation are out of date, poorly organized, not visible enough, and just plain not user-friendly.<br />
<br />
This ranges from documentation of individual tools not reflecting their current state, to our best how-to guides reflecting how things were done a decade ago. Documentation is scattered between the Apertium wiki, individual GitHub repos, an out-of-date pdf "Book", and even published papers and third party sites.<br />
<br />
The result is new users and contributors wasting time reading out-of-date materials, and even long-time contributors having no way to be aware of changes to the tools they use.<br />
<br />
=== The solution ===<br />
<br />
The solution to the above problem is to create updated documentation for all pipeline modules and/or a full tutorial.<br />
<br />
Ideally documentation on a given tool will exist in a single place, and a full tutorial will also have a single unified source. One possibility is to generate one set of docs from another, or from a single unified source. For example, if we want tools to be documented in both their GitHub repos and on the wiki, we should generate one set of documentation from the other (or a third source). If we want a full tutorial to be on the wiki but also available in PDF format, then we should designate one source as the original and generate the others from them. <br />
<br />
=== The scope ===<br />
<br />
=== Measuring success ===<br />
<br />
== Timeline ==<br />
<br />
== Budget ==</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Season_of_Docs_2022/Organize_and_Update_Apertium_User_Documentation&diff=73903Google Season of Docs 2022/Organize and Update Apertium User Documentation2022-03-10T16:23:38Z<p>Firespeaker: /* The problem */</p>
<hr />
<div><br />
== About Apertium ==<br />
<br />
== About the project ==<br />
<br />
=== The problem ===<br />
[https://wiki.apertium.org Apertium's wiki] and other documentation are out of date, poorly organized, not visible enough, and just plain not user-friendly.<br />
<br />
This ranges from documentation of individual tools not reflecting their current state, to our best how-to guides reflecting how things were done a decade ago.<br />
<br />
The result is new users and contributors wasting time reading out-of-date materials, and even long-time contributors having no way to be aware of changes to the tools they use.<br />
<br />
=== The scope ===<br />
<br />
=== Measuring success ===<br />
<br />
== Timeline ==<br />
<br />
== Budget ==</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Season_of_Docs_2022/Organize_and_Update_Apertium_User_Documentation&diff=73902Google Season of Docs 2022/Organize and Update Apertium User Documentation2022-03-10T15:54:40Z<p>Firespeaker: Created page with " == About Apertium == == About the project == === The problem === === The scope === === Measuring success === == Timeline == == Budget =="</p>
<hr />
<div><br />
== About Apertium ==<br />
<br />
== About the project ==<br />
<br />
=== The problem ===<br />
<br />
=== The scope ===<br />
<br />
=== Measuring success ===<br />
<br />
== Timeline ==<br />
<br />
== Budget ==</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2022&diff=73900Google Summer of Code/Application 20222022-02-21T15:31:39Z<p>Firespeaker: /* Mentors */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
* Chat: https://wiki.apertium.org/wiki/IRC<br />
* Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (irc.oftc.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
<br />
https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
<br />
* Jonathan<br />
* Xavi Ivars<br />
* Marc Riera<br />
* Daniel Swanson<br />
* Unhammer<br />
* nlhowell<br />
* Sevilay<br />
* Hèctor (if relevant project)<br />
(add your names here!)<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
* Number of accepted students/contributors: 10<br />
* Number of those participants that are still active today: 1</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2022&diff=73898Google Summer of Code/Application 20222022-02-21T14:34:15Z<p>Firespeaker: /* Ideas list */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
* Chat: https://wiki.apertium.org/wiki/IRC<br />
* Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (irc.oftc.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
<br />
https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
<br />
* Jonathan<br />
* Xavi Ivars<br />
* Marc Riera<br />
* Daniel Swanson<br />
* Unhammer<br />
* nlhowell<br />
* Sevilay<br />
(add your names here!)<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
* Number of accepted students/contributors: 10<br />
* Number of those participants that are still active today: 1</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2022&diff=73897Google Summer of Code/Application 20222022-02-21T14:33:11Z<p>Firespeaker: /* Mentors */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
* Chat: https://wiki.apertium.org/wiki/IRC<br />
* Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (irc.oftc.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
(url)<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
<br />
* Jonathan<br />
* Xavi Ivars<br />
* Marc Riera<br />
* Daniel Swanson<br />
* Unhammer<br />
* nlhowell<br />
* Sevilay<br />
(add your names here!)<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
* Number of accepted students/contributors: 10<br />
* Number of those participants that are still active today: 1</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2022&diff=73870Google Summer of Code/Application 20222022-02-21T05:59:33Z<p>Firespeaker: /* Mentors */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
* Chat: https://wiki.apertium.org/wiki/IRC<br />
* Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (freenode.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
(url)<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
<br />
Jonathan, ... (add your names here!)<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
* Number of accepted students/contributors: 10<br />
* Number of those participants that are still active today: 1</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2022&diff=73869Google Summer of Code/Application 20222022-02-21T05:59:20Z<p>Firespeaker: /* Mentors */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
* Chat: https://wiki.apertium.org/wiki/IRC<br />
* Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (freenode.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
(url)<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
Jonathan, ...<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
* Number of accepted students/contributors: 10<br />
* Number of those participants that are still active today: 1</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2022&diff=73868Google Summer of Code/Application 20222022-02-21T05:51:28Z<p>Firespeaker: /* Communication Methods */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
* Chat: https://wiki.apertium.org/wiki/IRC<br />
* Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (freenode.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
(url)<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
* Number of accepted students/contributors: 10<br />
* Number of those participants that are still active today: 1</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2022&diff=73867Google Summer of Code/Application 20222022-02-21T05:51:15Z<p>Firespeaker: /* Program Retention Survey */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
Chat: https://wiki.apertium.org/wiki/IRC<br />
Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (freenode.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
(url)<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
* Number of accepted students/contributors: 10<br />
* Number of those participants that are still active today: 1</div>Firespeakerhttps://wiki.apertium.org/w/index.php?title=Google_Summer_of_Code/Application_2022&diff=73866Google Summer of Code/Application 20222022-02-21T05:46:32Z<p>Firespeaker: /* Program Application */</p>
<hr />
<div>== Register org ==<br />
<br />
=== Years previously participated in GSoC ===<br />
2021, 2020, 2019, 2018, 2017, 2016, 2014, 2013, 2012, 2011, 2010, 2009<br />
<br />
== Org Profile ==<br />
<br />
=== Website URL ===<br />
[http://wiki.apertium.org http://wiki.apertium.org]<br />
<br />
=== Logo ===<br />
[https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Apertium_logo.svg/1214px-Apertium_logo.svg.png]<br />
<br />
=== Tagline ===<br />
A free/open-source machine translation platform<br />
<br />
=== Primary Open Source License ===<br />
GNU General Public License version 3<br />
<br />
=== Year organisation started ===<br />
<br />
2006 (???)<br />
<br />
=== Link to source code ===<br />
<br />
https://github.com/apertium/<br />
<br />
=== Organisation categories ===<br />
<br />
* Science and medicine (healthcare, biotech, life sciences, academic research, etc.)<br />
* Other<br />
<br />
=== Organisation technologies ===<br />
C++, python, bash, XML, javascript <br />
<br />
=== Organisation topics ===<br />
machine translation, natural language processing, less-resourced languages, language technology<br />
<br />
=== Organisation description ===<br />
<br />
Apertium is a free/open-source machine translation platform, and the organisation focuses on primarily symbolic language technology for less-resourced languages.<br />
<br />
=== Contributor guidance ===<br />
<br />
https://wiki.apertium.org/wiki/Top_tips_for_GSOC_applications<br />
<br />
=== Communication Methods ===<br />
<br />
Chat: https://wiki.apertium.org/wiki/IRC<br />
Mailing List / Forum: apertium-stuff@lists.sourceforge.net<br />
<br />
<br />
== Organisation questionnaire ==<br />
=== Why does your org want to participate in Google Summer of Code? ===<br />
Apertium has been part of GSoC for over a decade and it has been a great experience. Apertium loves GSoC: it supports free/open-source (FOS) software as much as we do! Apertium needs GSoC: it offers an incredible opportunity (and resources!) allowing us to spread the word about our project, to attract new developers and consolidate the contribution of existing developers through mentoring, and to improve the platform in many ways: improving the engine, generating new tools and user interfaces, making Apertium available to other applications, improving the quality of the languages currently supported, adding new languages to it. Apertium loves less-resourced languages and GSoC gives an opportunity for developers speaking them to generate FOS language technologies for them. Apertium will gain: more developers getting to know FOS software and the ethos that comes with it, contributing to it, and especially contributors who are passionate about languages and computers.<br />
<br />
<br />
=== What would your org consider to be a successful GSoC program? ===<br />
<br />
<!-- New contributors, new features completed, more code written, better being able to guide new developers into open source world, etc. --><br />
<br />
A successful GSoC would see any combination of newly released language pairs, the addition of new technologies to the Apertium framework, the addition of features to our web infrastructure, and a fresh round of developers becoming excited by Apertium. We would especially be happy to see a successful project form the basis of a published academic paper and to gain new long-term contributors.<br />
<br />
=== How will you keep mentors engaged with their GSoC contributors? ===<br />
We select our mentors from among very active developers, with long-term commitment to this 18-year-old project — they are people we know well and whom we have met face-to-face at conferences, workshops, or even in daily life; some of them teach and do research at universities or work at companies using Apertium. For this reason, it is quite unlikely for mentors to disappear, since most of them have been embedded in our community for years. However, there is always the possibility that some problem comes up, so we also assign back-up mentors to all contributors, in many cases more than one back-up. If a mentor cannot continue for whatever reason, one of the backup co-mentors will take over, and one of the organisation administrators (themselves experienced GSoC mentors) will take on the role of second backup mentor. <br />
<br />
=== How will you keep your GSoC contributors on schedule to complete their projects? ===<br />
<br />
Apertium only accepts applications with a well-defined weekly schedule, clear milestones and deliverables, and, if possible, a section on risk management (risks, their probability, their severity, & mitigating actions). Applications should also plan for holidays, exams, and other absences. Contributors will be encouraged to let us know if they need to reschedule or take a break if needed. Contributors may also need consultation when they are stuck, or personal matters interfere with their work: we will, as we have in the past, try our best to reach out to them, be open and friendly, and provide as much support as we can to help them out. We've been in situations like this too! Detailed scheduling will avoid both mentors and contributors wasting time. If a mentor reports the unscheduled disappearance of a contributors (unexpected 72-hour silence), the contributors will be contacted by the administrators. If silence persists, their task will be frozen and we will report to Google, to proceed according to the rules of GSoC.<br />
<br />
=== How will you get your GSoC contributors involved in your community during GSoC? ===<br />
<br />
First, we encourage all prospective contributors to visit our IRC channel (freenode.net#apertium) as often as possible, even before the start of the program, since that will help them find a suitable mentor and a useful project that they can work on. We advise them strongly to read our wiki pages and manuals, use our system, try to break it and fix it, and finally tell us about it. As a result, contributors get familiar with Apertium before the coding period starts, which increases their chances of ending up with a successful project. In addition, we define coding challenges for each of the proposed projects, which serve both as an entry task, and as a means for getting our contributors familiar with Apertium and involved in our community in the early stages of the program. Finally, during the coding stage, we are available to talk to our contributors on a daily basis and give them suggestions and advice when they get stuck.<br />
<br />
=== Anything else we should know? (optional) ===<br />
<br />
<br />
=== Is your organization part of any government? ===<br />
No<br />
<br />
== Program Application ==<br />
<br />
=== Ideas list ===<br />
(url)<br />
<br />
=== Mentors ===<br />
(How many Mentors does your Organization have available to participate in this program?)<br />
<br />
=== Program Retention Survey ===<br />
<br />
(We're looking for more details on how many of your students/GSoC contributors from the above program are still active in your community today.)<br />
<br />
Number of accepted students/contributors: ___<br />
Number of those participants that are still active today: ___</div>Firespeaker