https://wiki.apertium.org/w/api.php?action=feedcontributions&user=83.104.99.209&feedformat=atomApertium - User contributions [en]2024-03-29T08:28:41ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=Welsh_to_English&diff=2020Welsh to English2007-10-11T12:10:51Z<p>83.104.99.209: </p>
<hr />
<div>{{TOCD}}<br />
<br />
==Roadmap==<br />
<br />
===apertium-cy-en 0.1===<br />
<br />
* 8,000 of the highest frequency words in each dictionary.<br />
* Rules dealing with basic verb tenses (past, present, future)<br />
* Basic word re-ordering for simple phrases.<br />
<br />
;Aims and uses<br />
<br />
* For a non-native speaker to be able to discern the topic of a general news item.<br />
* To be able to identify ''who'' said ''what'' to ''who''.<br />
* To be able to distinguish is a particular item is interesting enough to be translated properly.<br />
* Sentences of up to 5 words should be translated reasonably well in both directions.<br />
<br />
== Transfer ==<br />
<br />
<pre><br />
# Welsh<br />
: Literal<br />
@ Gloss (English)<br />
</pre><br />
<br />
=== Welsh to English ===<br />
<br />
==== Word order (VSO to SVO) ====<br />
<pre><br />
# Genir pawb yn rhydd ac yn gydradd â 'i gilydd mewn urddas a hawliau.<br />
: Be born everyone free and equal with each other in dignity and rights.<br />
<br />
@ Everyone is born free and equal with each other in dignity and rights.<br />
</pre><br />
==== Noun Noun -> Noun of Noun ====<br />
<pre><br />
# Llywodraeth Cynulliad Cymru<br />
: Government Assembly Wales ==> Government (of) Assembly (of) Wales<br />
<br />
@ Welsh Assembly Government<br />
</pre><br />
<br />
==== Noun Adjective -> Adjective Noun====<br />
<pre><br />
# bachgen hapus<br />
: boy happy<br />
<br />
@ happy boy<br />
<br />
# geneth bert<br />
: girl pretty<br />
<br />
@ pretty girl<br />
</pre><br />
<br />
====Compound prepositions====<br />
<pre><br />
<donnek> I've also thought of another wrinkle - compound prepositions<br />
<spectie> i will probably need to write a rule<br />
<donnek> eg ar ben (on top of)<br />
<donnek> lit on head<br />
<spectie> we can do a similar thing with those<br />
<spectie> for example:<br />
<donnek> becomes ar fy mhen (on my head, literally) = on top of me<br />
<donnek> ar ei ben, ar ei phen, ar ein pennau<br />
<spectie> are there many of them<br />
<donnek> maybe we don't need to think about them now, but just to flag them for later<br />
<spectie> if there are not many it might be worth making them multiwords<br />
<donnek> how do multiwords work<br />
<spectie> there are a few ways<br />
<spectie> depending on if one of the words inside the multiword inflects or not<br />
<donnek> that would be the case here<br />
<spectie> for example "take care"<br />
<spectie> "i take care of", "you take care of", "he takes care of"<br />
<spectie> but "take care" is treated as one verb<br />
<donnek> ok<br />
</pre><br />
<br />
====Attributive and predicative adjectives====<br />
<br />
<pre><br />
<spectie> its a problem with attributive/predicative<br />
<donnek> it's say something (which is) nice<br />
<spectie> but in english we don't distinguish between the two (at least in terms of morphology)<br />
<spectie> yes<br />
<spectie> in afrikaans they have a -e for attributive (e.g. feodale stelsel -- feudal system) <br />
<spectie> and "the system is feudal" - "die stelsel is feodaal"<br />
<spectie> donnek, aye<br />
<donnek> in Welsh the second would have yn before the adj<br />
<donnek> so we may not need anything to mark attrib/pred<br />
<br />
* Dywedodd rhywbeth neis wrthi = He said something nice to her<br />
* Mae'r peth yno yn neis = That thing is nice <br />
* Mae'n gar neis = It is a nice car<br />
<br />
<donnek> at first glance, we may just need a rule for rhyw+thing<br />
<donnek> rhyw = some<br />
<donnek> rhywbeth (something), rhywfaint (somewhat), etc<br />
<donnek> rhywle (somewhere)<br />
</pre><br />
<br />
====Possession====<br />
<br />
<pre><br />
Mae cath 'da Bwflw<br />
Bod+p1.sg.pres cath gyda Bwflw<br />
Be+p1.sg.pres cat with Beefalo<br />
`Beefalo has a cat'<br />
</pre><br />
<br />
;Apertium notes<br />
<br />
We can probably deal with this in interchunk as follows<br />
<br />
vbbod NP1 pr_gyda NP2<br />
<br />
-><br />
<br />
NP2 vbhave NP1<br />
<br />
====The 'yn' particle====<br />
<br />
<pre><br />
As well as meaning 'in', 'yn' is used to form the present participle of a verb in welsh. For example:<br />
<br />
dysgu = to learn<br />
yn dysgu = learning<br />
<br />
The present tense is formed by combining 'yn' with the corresponding form of 'bod' (to be) as follows:<br />
Mae Beefalo yn gweithio = Beefalo is working/Beefalo works<br />
<br />
note: when following a vowel, yn is abbreviated to 'n, e.g.<br />
Mae Beefalo'n gweithio<br />
<br />
</pre><br />
[[Category:Discussions]]</div>83.104.99.209https://wiki.apertium.org/w/index.php?title=Earley-based_structural_transfer_for_Apertium&diff=994Earley-based structural transfer for Apertium2007-08-03T13:46:43Z<p>83.104.99.209: </p>
<hr />
<div>Perhaps [http://en.wikipedia.org/wiki/Earley's_algorithm Earley's algorithm] to parse context-free grammars (which has a left-to-right longest-match philosophy as Apertium) could be used to perform more complex syntactical transformations; this could be useful for distant language pairs containing embedded structures.<br />
<br />
==Open questions==<br />
<br />
* Currently, Apertium uses text streams to communicate. I assume this would not be possible here.<br />
* When would one call the bilingual dictionary? Apertium Level 2 calls it in the first stage.<br />
* We should check whether this has been done before.<br />
* In case there is more than one parse of a sentence, there should be a way to select the most likely.<br />
<br />
==Existing parsers==<br />
<br />
Current free-software parsers which might be worth looking at:<br />
<br />
* [http://www.agfl.cs.ru.nl/ AGFL parser] (GPL)<br />
<br />
==Further reading==<br />
<br />
* Koichi Takeda [http://66.102.9.104/search?q=cache:E-giecHfU8QJ:acl.ldc.upenn.edu/P/P96/P96-1020.pdf+earley+algorithm+%22machine+translation%22&hl=en&ct=clnk&cd=6&client=iceweasel-a Pattern-Based Context-Free Grammars for Machine Translation]<br />
:This paper proposes the use of "pattern-based" context-free grammars as a basis for building machine translation (MT) systems.<br />
*Randall Sharp and Oliver Streiter [http://66.102.9.104/search?q=cache:QJHB-s5Ze8cJ:www.iai.uni-sb.de/docs/meta93.pdf+earley+algorithm+%22machine+translation%22&hl=en&ct=clnk&cd=7&client=iceweasel-a Simplifying the Complexity of Machine Translation]</div>83.104.99.209https://wiki.apertium.org/w/index.php?title=Using_linguistic_resources&diff=993Using linguistic resources2007-08-03T13:10:26Z<p>83.104.99.209: /* Ways of storing data */</p>
<hr />
<div>{{TOCD}}<br />
<br />
This page gives a brief overview to the kind of data and resources that can be useful in building a new language pair for Apertium, and how to go about building them if they do not already exist.<br />
<br />
==What dictionaries?==<br />
<br />
Each Apertium language pair requires 3 dictionary files. For instance, for the English-Afrikaans pair, these would be:<br />
<br />
* <code>apertium-en-af.af.dix.xml</code>: a list of Afrikaans words and their variants;<br />
* <code>apertium-en-af.en.dix.xml</code>: a list of English words and their variants;<br />
* <code>apertium-en-af.en-af.dix.xml</code>: a list which maps the Afrikaans words in af.dix to their equivalent English words in en.dix<br />
<br />
These dictionary files are not discussed further on this page &mdash; more information on their layout and structure is available at the [[Apertium New Language Pair HOWTO|HOWTO]].<br />
<br />
==Collecting linguistic data==<br />
<br />
Before these files can be produced, you need a collection of linguistic data which can be inserted into them. This data might consist of wordlists, word-corpora derived from web-crawlers such as Crubadán, grammar notes, existing ttranslation of open-source software such as KDE or GNOME, etc. Some practical suggestions on how to build some starter wordlists can be found at [[Building dictionaries]], but if you feel that this is too technical, just ask one of the Apertium team to put together something like this for you.<br />
<br />
A crucial point here is that the data must either have been gathered from scratch, or must be available under a license which is compatible with Apertium's [http://www.fsf.org/licensing/licenses/gpl.html GPL]. In other words, you ''definitely'' cannot just start copying published dictionaries or other material wholesale into your data store.<br />
<br />
It is unlikely that this data will be appropriate "as is" for use in Apertium, and it will need a greater or lesser amount of revision first. You do not have to be a first-language speaker to collect and systematise the data, but you should have a reasonable knowledge of the language, and be working in consultation with first-language speakers.<br />
<br />
It is possible to collect a small amount of linguistic data, and start testing it with Apertium. However, this is not recommended - your views of how the data should be segmented may change, leading to wasted work. Once initial contact has been made with the Apertium team, it is better to aim at collecting a sizeable wordlist (1,000-2,000 words) and coming to some preliminary decisions on how the language's sentences are structured. In the meantime, maintain contact with the Apertium team, and discuss any issues that have arisen in regard to your data collection.<br />
<br />
You will have a good idea of how the language works from your own knowledge of it, and from reviewing published materials (eg dictionaries, grammars) about it. From this you can decide on the particular elements of information that need to be noted down for each word in order to capture its meaning and variants.<br />
<br />
In many widely-spoken languages (e.g. English, German, Spanish) there may be a range of material available, and there may also be a significant number of people willing and able to work with you on collecting and systematising the data. However, for lesser-used languages (e.g. Breton, Kashubian) the amount of material and the number of helpers may be small &mdash; many of the lesser-used languages in KDE, for instance, only have one or two people working on them. If you are in this position, it is important to remember that "the best is the enemy of the good". You will not be able to create your language's equivalent of the Oxford English Dictionary overnight, and there is no point in trying. It is better to aim at a good basic foundation which will allow you to develop it later and fill in the gaps as time and manpower present themselves. So, for instance, while it might be ideal to have citation sentences for each word giving its typical use in context, this may be a luxury you cannot as yet afford.<br />
<br />
==Ways of storing data==<br />
<br />
It is easiest to start storing the words in a spreadsheet or database. Gnumeric, KSpread or OOCalc are examples of the former. Once complete, your data can be exported into a format (e.g. CSV, comma-separated values) where it can be used by other software to build the Apertium dictionaries. Databases such as PostgreSQL, MySQL and SQLite are even more attractive, provided you are familiar with them, since the data can be manipulated in various ways before exporting. Further information on the software mentioned here is [at this other page].<br />
<br />
You will then have to decide which basic information you should store for each word. For many European languages, for instance, you might consider using the following information for nouns:<br />
<br />
* base form, or lemma (usually the singular) <br />
* English meaning (assuming English is the other language of the pair)<br />
* clarification (where any enhancement of the meaning is required)<br />
* plural form<br />
* gender (eg masculine, feminine, neuter)<br />
* number (eg singular, dual, plural)<br />
* part of speech (by definition, this will be "noun")<br />
* source (where you got the word).<br />
<br />
Each base form should have a one-to-one relationship with its meaning in the target language. So, for instance, in Welsh, rather than have:<br />
<br />
:pres - money, brass<br />
<br />
we would have:<br />
<br />
:pres - money<br />
:pres - brass<br />
<br />
This is to allow easier manipulation of the data (for example, with this format it is easier to turn your Welsh-English wordlist into an English-Welsh one).<br />
<br />
The meaning in your target language should be kept as short as possible - choose the single word that matches the greatest proportion of contextual uses of the source language word. Then use the "clarification" entry to expand on this basic meaning. For instance, in Welsh, we would have:<br />
<br />
:Cymraeg - Welsh (language)<br />
:Cymreig - Welsh (non-language)<br />
<br />
where the former is used to talk only about the Welsh language, and the latter is used to refer to anything else (places, customs, etc). This approach will allow nuances of meaning to be captured when appropriate, without cluttering up the equivalence.<br />
<br />
The "part of speech" entry will allow you to combine various wordlists whenever necessary without losing information about the contents - you will be able to separate them again. Typical parts of speech in European languages might be: noun, proper noun, adjective, verb, adverb, preposition, pronoun, conjunction, interjection, interrogative, demonstrative, numeral.<br />
<br />
If you decide to note down idioms or longer phrases, you can give them some sort of POS tag such as "phrase", and let the grammarians argue over their exact structure later!<br />
<br />
The "source" entry is not essential, but may be useful if anyone ever queries whether your data infringes someone else's copyright. By definition, your data store will eventually contain all the words contained in, for example, small dictionaries &mdash; although the words themselves are not copyright, the selection and arrangement of words in a dictionary is. By using a "source" entry, you will be able to demonstrate that your selection of words has been independently gathered.<br />
<br />
Once you have your lists of words, you will have the contours of your language's landscape in place. However, to fill in the details, your data will also need to contain information on what forms these words take in context. For instance, in English the past tense of "see" is "saw". In Latvian, "sirds" (heart) is in the nominative case, but it has other forms such as "sirdij" (to a heart, dative) or "sirdis" (hearts, accusative). So, instead of noting the plural for your nouns, for instance, you may have decided to note instead some information which will allow you to predict these variants. In Latin, for instance, the accepted method is to note the nominative and genitive singular of any word, which will then allow you predict its other forms (eg "mensa, mensae" - table).<br />
<br />
If you have not done this as yet, the next stage is to go over your linguistic data adding information of this sort (these "sets" of variants are called "paradigms" in Apertium, and are an important component in how it works). In some cases, you may need to extend your spreadsheet or database to allow new entries. For instance, for English and German verbs the standard notation is to note the third person singular forms of the present, past and perfect tenses in addition to the infinitive:<br />
<br />
:bringen, bringt, brachte, gebracht<br />
:bring, brings, brought, brought <br />
<br />
so you might add additional columns for these. In the same way, additional columns could be added for noun cases, adjectival variants, and so on.<br />
<br />
In many European languages, there is a rich set of conjugational variants for verbs. It may be possible to capture these fairly easily, as in French or Spanish, by making the verb ending (eg -er, -ar) the main determiner for the variants, and noting any consequent spelling changes:<br />
:hablar (to speak), hablo (I speak)<br />
<br />
but<br />
<br />
:avergonzar (to shame), avergüenzo (I shame).<br />
<br />
In other languages (eg Greek), the situation may be more complex, and not so amenable to simple categorisation. Nevertheless, it is important to try to abstract some rules for verb form generation - at the very least, this may offer the possibility of another useful language tool, a verbform generator (see, for instance, [http://www3.sympatico.ca/sarrazip/dev/verbiste.html Verbiste] (French), [http://compjugador.sourceforge.net Compjugador] (Spanish), [http://www.rhedadur.org.uk Rhedadur] (Welsh). Many other conjugators can be found on [http://www.verbix.com Verbix] or by doing a simple Google search.<br />
<br />
After this work, you should have a set of internally consistent data that captures a lot of the key information about the most common words in your language, and you are now ready to start importing that data into Apertium. That merits a separate page [ref].<br />
<br />
==Some final notes==<br />
<br />
The first is that Apertium is a work in progress. It was originally developed for closely-related Romance languages, and is now expanding into a translation platform for a much wider range of languages. By definition, this means that future work will involve trying to accommodate linguistic constructs that are new to the system. For instance, the mutation system in Celtic languages has been handled by a small addition to the dictionary format. This is challenging and exciting, but by the same token you should not expect that the Apertium team will have an easy (or indeed any!) answer to a particular problem. Be prepared to collaborate on developing Apertium to deal with that problem.<br />
<br />
The second is that your carefully-collected data is ''not'' just an input into Apertium. You can use it to produce an online dictionary for your language (see, for instance, [http://www.eurfa.org.uk Eurfa] for Welsh), and it can also be converted easily into a print dictionary using something like LaTeX. The data can be used to build a spelling checker or a grammar checker using the tools available from the [http://borel.slu.edu/gramadoir/index.html Gramadóir] project.<br />
<br />
Without language data, it is impossible to build language tools. So by putting together your datastore, you have already taken an enormous step towards making the riches of your language available to others.</div>83.104.99.209https://wiki.apertium.org/w/index.php?title=Using_linguistic_resources&diff=992Using linguistic resources2007-08-03T13:07:59Z<p>83.104.99.209: </p>
<hr />
<div>{{TOCD}}<br />
<br />
This page gives a brief overview to the kind of data and resources that can be useful in building a new language pair for Apertium, and how to go about building them if they do not already exist.<br />
<br />
==What dictionaries?==<br />
<br />
Each Apertium language pair requires 3 dictionary files. For instance, for the English-Afrikaans pair, these would be:<br />
<br />
* <code>apertium-en-af.af.dix.xml</code>: a list of Afrikaans words and their variants;<br />
* <code>apertium-en-af.en.dix.xml</code>: a list of English words and their variants;<br />
* <code>apertium-en-af.en-af.dix.xml</code>: a list which maps the Afrikaans words in af.dix to their equivalent English words in en.dix<br />
<br />
These dictionary files are not discussed further on this page &mdash; more information on their layout and structure is available at the [[Apertium New Language Pair HOWTO|HOWTO]].<br />
<br />
==Collecting linguistic data==<br />
<br />
Before these files can be produced, you need a collection of linguistic data which can be inserted into them. This data might consist of wordlists, word-corpora derived from web-crawlers such as Crubadán, grammar notes, existing ttranslation of open-source software such as KDE or GNOME, etc. Some practical suggestions on how to build some starter wordlists can be found at [[Building dictionaries]], but if you feel that this is too technical, just ask one of the Apertium team to put together something like this for you.<br />
<br />
A crucial point here is that the data must either have been gathered from scratch, or must be available under a license which is compatible with Apertium's [http://www.fsf.org/licensing/licenses/gpl.html GPL]. In other words, you ''definitely'' cannot just start copying published dictionaries or other material wholesale into your data store.<br />
<br />
It is unlikely that this data will be appropriate "as is" for use in Apertium, and it will need a greater or lesser amount of revision first. You do not have to be a first-language speaker to collect and systematise the data, but you should have a reasonable knowledge of the language, and be working in consultation with first-language speakers.<br />
<br />
It is possible to collect a small amount of linguistic data, and start testing it with Apertium. However, this is not recommended - your views of how the data should be segmented may change, leading to wasted work. Once initial contact has been made with the Apertium team, it is better to aim at collecting a sizeable wordlist (1,000-2,000 words) and coming to some preliminary decisions on how the language's sentences are structured. In the meantime, maintain contact with the Apertium team, and discuss any issues that have arisen in regard to your data collection.<br />
<br />
You will have a good idea of how the language works from your own knowledge of it, and from reviewing published materials (eg dictionaries, grammars) about it. From this you can decide on the particular elements of information that need to be noted down for each word in order to capture its meaning and variants.<br />
<br />
In many widely-spoken languages (e.g. English, German, Spanish) there may be a range of material available, and there may also be a significant number of people willing and able to work with you on collecting and systematising the data. However, for lesser-used languages (e.g. Breton, Kashubian) the amount of material and the number of helpers may be small &mdash; many of the lesser-used languages in KDE, for instance, only have one or two people working on them. If you are in this position, it is important to remember that "the best is the enemy of the good". You will not be able to create your language's equivalent of the Oxford English Dictionary overnight, and there is no point in trying. It is better to aim at a good basic foundation which will allow you to develop it later and fill in the gaps as time and manpower present themselves. So, for instance, while it might be ideal to have citation sentences for each word giving its typical use in context, this may be a luxury you cannot as yet afford.<br />
<br />
==Ways of storing data==<br />
<br />
It is easiest to start storing the words in a spreadsheet or database. Gnumeric KSpread or OOCalc are examples of the former. Once complete, your data can be exported into a format (e.g. CSV, comma-separated values) where it can be used by other software to build the Apertium dictionaries. Databases such as PostgreSQL, MySQL and Sqlite are even more attractive, provided you are familiar with them, since the data can be manipulated in various ways before exporting. Further information on the software mentioned here is [at this other page].<br />
<br />
You will then have to decide which basic information you should store for each word. For many European languages, for instance, you might consider using the following information for nouns:<br />
<br />
* base form (usually the singular)<br />
* English meaning (assuming English is the other language of the pair)<br />
* clarification (where any enhancement of the meaning is required)<br />
* plural form<br />
* gender (eg masculine, feminine, neuter)<br />
* number (eg singular, dual, plural)<br />
* part of speech (by definition, this will be "noun")<br />
* source (where you got the word).<br />
<br />
Each base form should have a one-to-one relationship with its meaning in the target language. So, for instance, in Welsh, rather than have:<br />
<br />
:pres - money, brass<br />
<br />
we would have:<br />
<br />
:pres - money<br />
:pres - brass<br />
<br />
This is to allow easier manipulation of the data (for example, with this format it is easier to turn your Welsh-English wordlist into an English-Welsh one).<br />
<br />
The meaning in your target language should be kept as short as possible - choose the single word that matches the greatest proportion of contextual uses of the source language word. Then use the "clarification" entry to expand on this basic meaning. For instance, in Welsh, we would have:<br />
<br />
:Cymraeg - Welsh (language)<br />
:Cymreig - Welsh (non-language)<br />
<br />
where the former is used to talk only about the Welsh language, and the latter is used to refer to anything else (places, customs, etc). This approach will allow nuances of meaning to be captured when appropriate, without cluttering up the equivalence.<br />
<br />
The "part of speech" entry will allow you to combine various wordlists whenever necessary without losing information about the contents - you will be able to separate them again. Typical parts of speech in European languages might be: noun, proper noun, adjective, verb, adverb, preposition, pronoun, conjunction, interjection, interrogative, demonstrative, numeral.<br />
<br />
If you decide to note down idioms or longer phrases, you can give them some sort of POS tag such as "phrase", and let the grammarians argue over their exact structure later!<br />
<br />
The "source" entry is not essential, but may be useful if anyone ever queries whether your data infringes someone else's copyright. By definition, your data store will eventually contain all the words contained in, for example, small dictionaries &mdash; although the words themselves are not copyright, the selection and arrangement of words in a dictionary is. By using a "source" entry, you will be able to demonstrate that your selection of words has been independently gathered.<br />
<br />
Once you have your lists of words, you will have the contours of your language's landscape in place. However, to fill in the details, your data will also need to contain information on what forms these words take in context. For instance, in English the past tense of "see" is "saw". In Latvian, "sirds" (heart) is in the nominative case, but it has other forms such as "sirdij" (to a heart, dative) or "sirdis" (hearts, accusative). So, instead of noting the plural for your nouns, for instance, you may have decided to note instead some information which will allow you to predict these variants. In Latin, for instance, the accepted method is to note the nominative and genitive singular of any word, which will then allow you predict its other forms (eg "mensa, mensae" - table).<br />
<br />
If you have not done this as yet, the next stage is to go over your linguistic data adding information of this sort (these "sets" of variants are called "paradigms" in Apertium, and are an important component in how it works). In some cases, you may need to extend your spreadsheet or database to allow new entries. For instance, for English and German verbs the standard notation is to note the third person singular forms of the present, past and perfect tenses in addition to the infinitive:<br />
<br />
:bringen, bringt, brachte, gebracht<br />
:bring, brings, brought, brought <br />
<br />
so you might add additional columns for these. In the same way, additional columns could be added for noun cases, adjectival variants, and so on.<br />
<br />
In many European languages, there is a rich set of conjugational variants for verbs. It may be possible to capture these fairly easily, as in French or Spanish, by making the verb ending (eg -er, -ar) the main determiner for the variants, and noting any consequent spelling changes:<br />
:hablar (to speak), hablo (I speak)<br />
<br />
but<br />
<br />
:avergonzar (to shame), avergüenzo (I shame).<br />
<br />
In other languages (eg Greek), the situation may be more complex, and not so amenable to simple categorisation. Nevertheless, it is important to try to abstract some rules for verb form generation - at the very least, this may offer the possibility of another useful language tool, a verbform generator (see, for instance, [http://www3.sympatico.ca/sarrazip/dev/verbiste.html Verbiste] (French), [http://compjugador.sourceforge.net Compjugador] (Spanish), [http://www.rhedadur.org.uk Rhedadur] (Welsh). Many other conjugators can be found on [http://www.verbix.com Verbix] or by doing a simple Google search.<br />
<br />
After this work, you should have a set of internally consistent data that captures a lot of the key information about the most common words in your language, and you are now ready to start importing that data into Apertium. That merits a separate page [ref].<br />
<br />
==Some final notes==<br />
<br />
The first is that Apertium is a work in progress. It was originally developed for closely-related Romance languages, and is now expanding into a translation platform for a much wider range of languages. By definition, this means that future work will involve trying to accommodate linguistic constructs that are new to the system. For instance, the mutation system in Celtic languages has been handled by a small addition to the dictionary format. This is challenging and exciting, but by the same token you should not expect that the Apertium team will have an easy (or indeed any!) answer to a particular problem. Be prepared to collaborate on developing Apertium to deal with that problem.<br />
<br />
The second is that your carefully-collected data is ''not'' just an input into Apertium. You can use it to produce an online dictionary for your language (see, for instance, [http://www.eurfa.org.uk Eurfa] for Welsh), and it can also be converted easily into a print dictionary using something like LaTeX. The data can be used to build a spelling checker or a grammar checker using the tools available from the [http://borel.slu.edu/gramadoir/index.html Gramadóir] project.<br />
<br />
Without language data, it is impossible to build language tools. So by putting together your datastore, you have already taken an enormous step towards making the riches of your language available to others.</div>83.104.99.209https://wiki.apertium.org/w/index.php?title=Using_linguistic_resources&diff=991Using linguistic resources2007-08-03T12:55:26Z<p>83.104.99.209: </p>
<hr />
<div>Each Apertium language pair requires 3 dictionary files. For instance, for the English-Afrikaans pair, these would be:<br />
<br />
* <code>apertium-en-af.af.dix.xml</code>: a list of Afrikaans words and their variants;<br />
* <code>apertium-en-af.en.dix.xml</code>: a list of English words and their variants;<br />
* <code>apertium-en-af.en-af.dix.xml</code>: a list which maps the Afrikaans words in af.dix to their equivalent English words in en.dix<br />
<br />
These dictionary files are not discussed further on this page &mdash; more information on their layout and structure is available at the [[Apertium New Language Pair HOWTO|HOWTO]].<br />
<br />
==Collecting linguistic data==<br />
<br />
Before these files can be produced, you need a collection of linguistic data which can be inserted into them. This data might consist of wordlists, word-corpora derived from web-crawlers such as Crubadán, grammar notes, existing ttranslation of open-source software such as KDE or GNOME, etc. Some practical suggestions on how to build some starter wordlists can be found at [[Building dictionaries]], but if you feel that this is too technical, just ask one of the Apertium team to put together something like this for you.<br />
<br />
A crucial point here is that the data must either have been gathered from scratch, or must be available under a license which is compatible with Apertium's [http://www.fsf.org/licensing/licenses/gpl.html GPL]. In other words, you ''definitely'' cannot just start copying published dictionaries or other material wholesale into your data store.<br />
<br />
It is unlikely that this data will be appropriate "as is" for use in Apertium, and it will need a greater or lesser amount of revision first. You do not have to be a first-language speaker to collect and systematise the data, but you should have a reasonable knowledge of the language, and be working in consultation with first-language speakers.<br />
<br />
It is possible to collect a small amount of linguistic data, and start testing it with Apertium. However, this is not recommended - your views of how the data should be segmented may change, leading to wasted work. Once initial contact has been made with the Apertium team, it is better to aim at collecting a sizeable wordlist (1,000-2,000 words) and coming to some preliminary decisions on how the language's sentences are structured. In the meantime, maintain contact with the Apertium team, and discuss any issues that have arisen in regard to your data collection.<br />
<br />
You will have a good idea of how the language works from your own knowledge of it, and from reviewing published materials (eg dictionaries, grammars) about it. From this you can decide on the particular elements of information that need to be noted down for each word in order to capture its meaning and variants.<br />
<br />
In many widely-spoken languages (e.g. English, German, Spanish) there may be a range of material available, and there may also be a significant number of people willing and able to work with you on collecting and systematising the data. However, for lesser-used languages (e.g. Breton, Kashubian) the amount of material and the number of helpers may be small &mdash; many of the lesser-used languages in KDE, for instance, only have one or two people working on them. If you are in this position, it is important to remember that "the best is the enemy of the good". You will not be able to create your language's equivalent of the Oxford English Dictionary overnight, and there is no point in trying. It is better to aim at a good basic foundation which will allow you to develop it later and fill in the gaps as time and manpower present themselves. So, for instance, while it might be ideal to have citation sentences for each word giving its typical use in context, this may be a luxury you cannot as yet afford.<br />
<br />
==Ways of storing data==<br />
<br />
It is easiest to start storing the words in a spreadsheet or database. Gnumeric KSpread or OOCalc are examples of the former. Once complete, your data can be exported into a format (e.g. CSV, comma-separated values) where it can be used by other software to build the Apertium dictionaries. Databases such as PostgreSQL, MySQL and Sqlite are even more attractive, provided you are familiar with them, since the data can be manipulated in various ways before exporting. Further information on the software mentioned here is [at this other page].<br />
<br />
You will then have to decide which basic information you should store for each word. For many European languages, for instance, you might consider using the following information for nouns:<br />
<br />
* base form (usually the singular)<br />
* English meaning (assuming English is the other language of the pair)<br />
* clarification (where any enhancement of the meaning is required)<br />
* plural form<br />
* gender (eg masculine, feminine, neuter)<br />
* number (eg singular, dual, plural)<br />
* part of speech (by definition, this will be "noun")<br />
* source (where you got the word).<br />
<br />
Each base form should have a one-to-one relationship with its meaning in the target language. So, for instance, in Welsh, rather than have:<br />
<br />
:pres - money, brass<br />
<br />
we would have:<br />
<br />
:pres - money<br />
:pres - brass<br />
<br />
This is to allow easier manipulation of the data (for example, with this format it is easier to turn your Welsh-English wordlist into an English-Welsh one).<br />
<br />
The meaning in your target language should be kept as short as possible - choose the single word that matches the greatest proportion of contextual uses of the source language word. Then use the "clarification" entry to expand on this basic meaning. For instance, in Welsh, we would have:<br />
<br />
:Cymraeg - Welsh (language)<br />
:Cymreig - Welsh (non-language)<br />
<br />
where the former is used to talk only about the Welsh language, and the latter is used to refer to anything else (places, customs, etc). This approach will allow nuances of meaning to be captured when appropriate, without cluttering up the equivalence.<br />
<br />
The "part of speech" entry will allow you to combine various wordlists whenever necessary without losing information about the contents - you will be able to separate them again. Typical parts of speech in European languages might be: noun, proper noun, adjective, verb, adverb, preposition, pronoun, conjunction, interjection, interrogative, demonstrative, numeral.<br />
<br />
If you decide to note down idioms or longer phrases, you can give them some sort of POS tag such as "phrase", and let the grammarians argue over their exact structure later!<br />
<br />
The "source" entry is not essential, but may be useful if anyone ever queries whether your data infringes someone else's copyright. By definition, your data store will eventually contain all the words contained in, for example, small dictionaries &mdash; although the words themselves are not copyright, the selection and arrangement of words in a dictionary is. By using a "source" entry, you will be able to demonstrate that your selection of words has been independently gathered.<br />
<br />
Once you have your lists of words, you will have the contours of your language's landscape in place. However, to fill in the details, your data will also need to contain information on what forms these words take in context. For instance, in English the past tense of "see" is "saw". In Latvian, "sirds" (heart) is in the nominative case, but it has other forms such as "sirdij" (to a heart, dative) or "sirdis" (hearts, accusative). So, instead of noting the plural for your nouns, for instance, you may have decided to note instead some information which will allow you to predict these variants. In Latin, for instance, the accepted method is to note the nominative and genitive singular of any word, which will then allow you predict its other forms (eg "mensa, mensae" - table).<br />
<br />
If you have not done this as yet, the next stage is to go over your linguistic data adding information of this sort (these "sets" of variants are called "paradigms" in Apertium, and are an important component in how it works). In some cases, you may need to extend your spreadsheet or database to allow new entries. For instance, for English and German verbs the standard notation is to note the third person singular forms of the present, past and perfect tenses in addition to the infinitive:<br />
<br />
:bringen, bringt, brachte, gebracht<br />
:bring, brings, brought, brought <br />
<br />
so you might add additional columns for these. In the same way, additional columns could be added for noun cases, adjectival variants, and so on.<br />
<br />
In many European languages, there is a rich set of conjugational variants for verbs. It may be possible to capture these fairly easily, as in French or Spanish, by making the verb ending (eg -er, -ar) the main determiner for the variants, and noting any consequent spelling changes:<br />
:hablar (to speak), hablo (I speak)<br />
<br />
but<br />
<br />
:avergonzar (to shame), avergüenzo (I shame).<br />
<br />
In other languages (eg Greek), the situation may be more complex, and not so amenable to simple categorisation. Nevertheless, it is important to try to abstract some rules for verb form generation - at the very least, this may offer the possibility of another useful language tool, a verbform generator (see, for instance, [http://www3.sympatico.ca/sarrazip/dev/verbiste.html Verbiste] (French), [http://compjugador.sourceforge.net Compjugador] (Spanish), [http://www.rhedadur.org.uk Rhedadur] (Welsh). Many other conjugators can be found on [http://www.verbix.com Verbix] or by doing a simple Google search.<br />
<br />
After this work, you should have a set of internally consistent data that captures a lot of the key information about the most common words in your language, and you are now ready to start importing that data into Apertium. That merits a separate page [ref].<br />
<br />
==Some final notes==<br />
<br />
The first is that Apertium is a work in progress. It was originally developed for closely-related Romance languages, and is now expanding into a translation platform for a much wider range of languages. By definition, this means that future work will involve trying to accommodate linguistic constructs that are new to the system. For instance, the mutation system in Celtic languages has been handled by a small addition to the dictionary format. This is challenging and exciting, but by the same token you should not expect that the Apertium team will have an easy (or indeed any!) answer to a particular problem. Be prepared to collaborate on developing Apertium to deal with that problem.<br />
<br />
The second is that your carefully-collected data is ''not'' just an input into Apertium. You can use it to produce an online dictionary for your language (see, for instance, [http://www.eurfa.org.uk Eurfa] for Welsh), and it can also be converted easily into a print dictionary using something like LaTeX. The data can be used to build a spelling checker or a grammar checker using the tools available from the [http://borel.slu.edu/gramadoir/index.html Gramadóir] project.<br />
<br />
Without language data, it is impossible to build language tools. So by putting together your datastore, you have already taken an enormous step towards making the riches of your language available to others.</div>83.104.99.209https://wiki.apertium.org/w/index.php?title=Using_linguistic_resources&diff=990Using linguistic resources2007-08-03T12:51:13Z<p>83.104.99.209: </p>
<hr />
<div>Each Apertium language pair requires 3 dictionary files. For instance, for the English-Afrikaans pair, these would be:<br />
<br />
* <code>apertium-en-af.af.dix.xml</code>: a list of Afrikaans words and their variants;<br />
* <code>apertium-en-af.en.dix.xml</code>: a list of English words and their variants;<br />
* <code>apertium-en-af.en-af.dix.xml</code>: a list which maps the Afrikaans words in af.dix to their equivalent English words in en.dix<br />
<br />
These dictionary files are not discussed further on this page - more information on their layout and structure is available at [to be written].<br />
<br />
Before these files can be produced, you need a collection of linguistic data which can be inserted into them. This data might consist of wordlists, word-corpora derived from web-crawlers such as Crubadán, grammar notes, existing ttranslation of open-source software such as KDE or GNOME, etc. Some practical suggestions on how to build some starter wordlists can be found at [[Building dictionaries]], but if you feel that this is too technical, just ask one of the Apertium team to put together something like this for you.<br />
<br />
A crucial point here is that the data must either have been gathered from scratch, or must be available under a license which is compatible with Apertium's [http://www.fsf.org/licensing/licenses/gpl.html GPL]. In other words, you ''definitely'' cannot just start copying published dictionaries or other material wholesale into your data store.<br />
<br />
It is unlikely that this data will be appropriate "as is" for use in Apertium, and it will need a greater or lesser amount of revision first. You do not have to be a first-language speaker to collect and systematise the data, but you should have a reasonable knowledge of the language, and be working in consultation with first-language speakers.<br />
<br />
It is possible to collect a small amount of linguistic data, and start testing it with Apertium. However, this is not recommended - your views of how the data should be segmented may change, leading to wasted work. Once initial contact has been made with the Apertium team, it is better to aim at collecting a sizeable wordlist (1,000-2,000 words) and coming to some preliminary decisions on how the language's sentences are structured. In the meantime, maintain contact with the Apertium team, and discuss any issues that have arisen in regard to your data collection.<br />
<br />
You will have a good idea of how the language works from your own knowledge of it, and from reviewing published materials (eg dictionaries, grammars) about it. From this you can decide on the particular elements of information that need to be noted down for each word in order to capture its meaning and variants.<br />
<br />
In many widely-spoken languages (e.g. English, German, Spanish) there may be a range of material available, and there may also be a significant number of people willing and able to work with you on collecting and systematising the data. However, for lesser-used languages (e.g. Breton, Kashubian) the amount of material and the number of helpers may be small &mdash; many of the lesser-used languages in KDE, for instance, only have one or two people working on them. If you are in this position, it is important to remember that "the best is the enemy of the good". You will not be able to create your language's equivalent of the Oxford English Dictionary overnight, and there is no point in trying. It is better to aim at a good basic foundation which will allow you to develop it later and fill in the gaps as time and manpower present themselves. So, for instance, while it might be ideal to have citation sentences for each word giving its typical use in context, this may be a luxury you cannot as yet afford.<br />
<br />
It is easiest to start storing the words in a spreadsheet or database. Gnumeric KSpread or OOCalc are examples of the former. Once complete, your data can be exported into a format (e.g. CSV, comma-separated values) where it can be used by other software to build the Apertium dictionaries. Databases such as PostgreSQL, MySQL and Sqlite are even more attractive, provided you are familiar with them, since the data can be manipulated in various ways before exporting. Further information on the software mentioned here is [at this other page].<br />
<br />
You will then have to decide which basic information you should store for each word. For many European languages, for instance, you might consider using the following information for nouns:<br />
<br />
* base form (usually the singular)<br />
* English meaning (assuming English is the other language of the pair)<br />
* clarification (where any enhancement of the meaning is required)<br />
* plural form<br />
* gender (eg masculine, feminine, neuter)<br />
* number (eg singular, dual, plural)<br />
* part of speech (by definition, this will be "noun")<br />
* source (where you got the word).<br />
<br />
Each base form should have a one-to-one relationship with its meaning in the target language. So, for instance, in Welsh, rather than have:<br />
<br />
:pres - money, brass<br />
<br />
we would have:<br />
<br />
:pres - money<br />
:pres - brass<br />
<br />
This is to allow easier manipulation of the data (for example, with this format it is easier to turn your Welsh-English wordlist into an English-Welsh one).<br />
<br />
The meaning in your target language should be kept as short as possible - choose the single word that matches the greatest proportion of contextual uses of the source language word. Then use the "clarification" entry to expand on this basic meaning. For instance, in Welsh, we would have:<br />
<br />
:Cymraeg - Welsh (language)<br />
:Cymreig - Welsh (non-language)<br />
<br />
where the former is used to talk only about the Welsh language, and the latter is used to refer to anything else (places, customs, etc). This approach will allow nuances of meaning to be captured when appropriate, without cluttering up the equivalence.<br />
<br />
The "part of speech" entry will allow you to combine various wordlists whenever necessary without losing information about the contents - you will be able to separate them again. Typical parts of speech in European languages might be: noun, proper noun, adjective, verb, adverb, preposition, pronoun, conjunction, interjection, interrogative, demonstrative, numeral.<br />
<br />
If you decide to note down idioms or longer phrases, you can give them some sort of POS tag such as "phrase", and let the grammarians argue over their exact structure later!<br />
<br />
The "source" entry is not essential, but may be useful if anyone ever queries whether your data infringes someone else's copyright. By definition, your data store will eventually contain all the words contained in, for example, small dictionaries &mdash; although the words themselves are not copyright, the selection and arrangement of words in a dictionary is. By using a "source" entry, you will be able to demonstrate that your selection of words has been independently gathered.<br />
<br />
Once you have your lists of words, you will have the contours of your language's landscape in place. However, to fill in the details, your data will also need to contain information on what forms these words take in context. For instance, in English the past tense of "see" is "saw". In Latvian, "sirds" (heart) is in the nominative case, but it has other forms such as "sirdij" (to a heart, dative) or "sirdis" (hearts, accusative). So, instead of noting the plural for your nouns, for instance, you may have decided to note instead some information which will allow you to predict these variants. In Latin, for instance, the accepted method is to note the nominative and genitive singular of any word, which will then allow you predict its other forms (eg "mensa, mensae" - table).<br />
<br />
If you have not done this as yet, the next stage is to go over your linguistic data adding information of this sort (these "sets" of variants are called "paradigms" in Apertium, and are an important component in how it works). In some cases, you may need to extend your spreadsheet or database to allow new entries. For instance, for English and German verbs the standard notation is to note the third person singular forms of the present, past and perfect tenses in addition to the infinitive:<br />
<br />
:bringen, bringt, brachte, gebracht<br />
:bring, brings, brought, brought <br />
<br />
so you might add additional columns for these. In the same way, additional columns could be added for noun cases, adjectival variants, and so on.<br />
<br />
In many European languages, there is a rich set of conjugational variants for verbs. It may be possible to capture these fairly easily, as in French or Spanish, by making the verb ending (eg -er, -ar) the main determiner for the variants, and noting any consequent spelling changes:<br />
:hablar (to speak), hablo (I speak)<br />
<br />
but<br />
<br />
:avergonzar (to shame), avergüenzo (I shame).<br />
<br />
In other languages (eg Greek), the situation may be more complex, and not so amenable to simple categorisation. Nevertheless, it is important to try to abstract some rules for verb form generation - at the very least, this may offer the possibility of another useful language tool, a verbform generator (see, for instance, [http://www3.sympatico.ca/sarrazip/dev/verbiste.html Verbiste] (French), [http://compjugador.sourceforge.net Compjugador] (Spanish), [http://www.rhedadur.org.uk Rhedadur] (Welsh). Many other conjugators can be found on [http://www.verbix.com Verbix] or by doing a simple Google search.<br />
<br />
After this work, you should have a set of internally consistent data that captures a lot of the key information about the most common words in your language, and you are now ready to start importing that data into Apertium. That merits a separate page [ref].<br />
<br />
There are a couple of final points that should be made.<br />
<br />
The first is that Apertium is a work in progress. It was originally developed for closely-related Romance languages, and is now expanding into a translation platform for a much wider range of languages. By definition, this means that future work will involve trying to accommodate linguistic constructs that are new to the system. For instance, the mutation system in Celtic languages has been handled by a small addition to the dictionary format. This is challenging and exciting, but by the same token you should not expect that the Apertium team will have an easy (or indeed any!) answer to a particular problem. Be prepared to collaborate on developing Apertium to deal with that problem.<br />
<br />
The second is that your carefully-collected data is ''not'' just an input into Apertium. You can use it to produce an online dictionary for your language (see, for instance, [http://www.eurfa.org.uk Eurfa] for Welsh), and it can also be converted easily into a print dictionary using something like LaTeX. The data can be used to build a spelling checker or a grammar checker using the tools available from the [http://borel.slu.edu/gramadoir/index.html Gramadóir] project.<br />
<br />
Without language data, it is impossible to build language tools. So by putting together your datastore, you have already taken an enormous step towards making the riches of your language available to others.</div>83.104.99.209https://wiki.apertium.org/w/index.php?title=Using_linguistic_resources&diff=989Using linguistic resources2007-08-03T12:50:22Z<p>83.104.99.209: minor formatting</p>
<hr />
<div>Each Apertium language pair requires 3 dictionary files. For instance, for the English-Afrikaans pair, these would be:<br />
<br />
* <code>apertium-en-af.af.dix.xml</code>: a list of Afrikaans words and their variants;<br />
* <code>apertium-en-af.en.dix.xml</code>: a list of English words and their variants;<br />
* <code>apertium-en-af.en-af.dix.xml</code>: a list which maps the Afrikaans words in af.dix to their equivalent English words in en.dix<br />
<br />
These dictionary files are not discussed further on this page - more information on their layout and structure is available at [to be written].<br />
<br />
Before these files can be produced, you need a collection of linguistic data which can be inserted into them. This data might consist of wordlists, word-corpora derived from web-crawlers such as Crubadán, grammar notes, existing ttranslation of open-source software such as KDE or GNOME, etc. Some practical suggestions on how to build some starter wordlists can be found at [[Building dictionaries]], but if you feel that this is too technical, just ask one of the Apertium team to put together something like this for you.<br />
<br />
A crucial point here is that the data must either have been gathered from scratch, or must be available under a license which is compatible with Apertium's GPL<ref>Free Software Foundation. [http://www.fsf.org/licensing/licenses/gpl.html GNU General Public Transport]</ref>. In other words, you ''definitely'' cannot just start copying published dictionaries or other material wholesale into your data store.<br />
<br />
It is unlikely that this data will be appropriate "as is" for use in Apertium, and it will need a greater or lesser amount of revision first. You do not have to be a first-language speaker to collect and systematise the data, but you should have a reasonable knowledge of the language, and be working in consultation with first-language speakers.<br />
<br />
It is possible to collect a small amount of linguistic data, and start testing it with Apertium. However, this is not recommended - your views of how the data should be segmented may change, leading to wasted work. Once initial contact has been made with the Apertium team, it is better to aim at collecting a sizeable wordlist (1,000-2,000 words) and coming to some preliminary decisions on how the language's sentences are structured. In the meantime, maintain contact with the Apertium team, and discuss any issues that have arisen in regard to your data collection.<br />
<br />
You will have a good idea of how the language works from your own knowledge of it, and from reviewing published materials (eg dictionaries, grammars) about it. From this you can decide on the particular elements of information that need to be noted down for each word in order to capture its meaning and variants.<br />
<br />
In many widely-spoken languages (e.g. English, German, Spanish) there may be a range of material available, and there may also be a significant number of people willing and able to work with you on collecting and systematising the data. However, for lesser-used languages (e.g. Breton, Kashubian) the amount of material and the number of helpers may be small &mdash; many of the lesser-used languages in KDE, for instance, only have one or two people working on them. If you are in this position, it is important to remember that "the best is the enemy of the good". You will not be able to create your language's equivalent of the Oxford English Dictionary overnight, and there is no point in trying. It is better to aim at a good basic foundation which will allow you to develop it later and fill in the gaps as time and manpower present themselves. So, for instance, while it might be ideal to have citation sentences for each word giving its typical use in context, this may be a luxury you cannot as yet afford.<br />
<br />
It is easiest to start storing the words in a spreadsheet or database. Gnumeric KSpread or OOCalc are examples of the former. Once complete, your data can be exported into a format (e.g. CSV, comma-separated values) where it can be used by other software to build the Apertium dictionaries. Databases such as PostgreSQL, MySQL and Sqlite are even more attractive, provided you are familiar with them, since the data can be manipulated in various ways before exporting. Further information on the software mentioned here is [at this other page].<br />
<br />
You will then have to decide which basic information you should store for each word. For many European languages, for instance, you might consider using the following information for nouns:<br />
<br />
* base form (usually the singular)<br />
* English meaning (assuming English is the other language of the pair)<br />
* clarification (where any enhancement of the meaning is required)<br />
* plural form<br />
* gender (eg masculine, feminine, neuter)<br />
* number (eg singular, dual, plural)<br />
* part of speech (by definition, this will be "noun")<br />
* source (where you got the word).<br />
<br />
Each base form should have a one-to-one relationship with its meaning in the target language. So, for instance, in Welsh, rather than have:<br />
<br />
:pres - money, brass<br />
<br />
we would have:<br />
<br />
:pres - money<br />
:pres - brass<br />
<br />
This is to allow easier manipulation of the data (for example, with this format it is easier to turn your Welsh-English wordlist into an English-Welsh one).<br />
<br />
The meaning in your target language should be kept as short as possible - choose the single word that matches the greatest proportion of contextual uses of the source language word. Then use the "clarification" entry to expand on this basic meaning. For instance, in Welsh, we would have:<br />
<br />
:Cymraeg - Welsh (language)<br />
:Cymreig - Welsh (non-language)<br />
<br />
where the former is used to talk only about the Welsh language, and the latter is used to refer to anything else (places, customs, etc). This approach will allow nuances of meaning to be captured when appropriate, without cluttering up the equivalence.<br />
<br />
The "part of speech" entry will allow you to combine various wordlists whenever necessary without losing information about the contents - you will be able to separate them again. Typical parts of speech in European languages might be: noun, proper noun, adjective, verb, adverb, preposition, pronoun, conjunction, interjection, interrogative, demonstrative, numeral.<br />
<br />
If you decide to note down idioms or longer phrases, you can give them some sort of POS tag such as "phrase", and let the grammarians argue over their exact structure later!<br />
<br />
The "source" entry is not essential, but may be useful if anyone ever queries whether your data infringes someone else's copyright. By definition, your data store will eventually contain all the words contained in, for example, small dictionaries &mdash; although the words themselves are not copyright, the selection and arrangement of words in a dictionary is. By using a "source" entry, you will be able to demonstrate that your selection of words has been independently gathered.<br />
<br />
Once you have your lists of words, you will have the contours of your language's landscape in place. However, to fill in the details, your data will also need to contain information on what forms these words take in context. For instance, in English the past tense of "see" is "saw". In Latvian, "sirds" (heart) is in the nominative case, but it has other forms such as "sirdij" (to a heart, dative) or "sirdis" (hearts, accusative). So, instead of noting the plural for your nouns, for instance, you may have decided to note instead some information which will allow you to predict these variants. In Latin, for instance, the accepted method is to note the nominative and genitive singular of any word, which will then allow you predict its other forms (eg "mensa, mensae" - table).<br />
<br />
If you have not done this as yet, the next stage is to go over your linguistic data adding information of this sort (these "sets" of variants are called "paradigms" in Apertium, and are an important component in how it works). In some cases, you may need to extend your spreadsheet or database to allow new entries. For instance, for English and German verbs the standard notation is to note the third person singular forms of the present, past and perfect tenses in addition to the infinitive:<br />
<br />
:bringen, bringt, brachte, gebracht<br />
:bring, brings, brought, brought <br />
<br />
so you might add additional columns for these. In the same way, additional columns could be added for noun cases, adjectival variants, and so on.<br />
<br />
In many European languages, there is a rich set of conjugational variants for verbs. It may be possible to capture these fairly easily, as in French or Spanish, by making the verb ending (eg -er, -ar) the main determiner for the variants, and noting any consequent spelling changes:<br />
:hablar (to speak), hablo (I speak)<br />
<br />
but<br />
<br />
:avergonzar (to shame), avergüenzo (I shame).<br />
<br />
In other languages (eg Greek), the situation may be more complex, and not so amenable to simple categorisation. Nevertheless, it is important to try to abstract some rules for verb form generation - at the very least, this may offer the possibility of another useful language tool, a verbform generator (see, for instance, [http://www3.sympatico.ca/sarrazip/dev/verbiste.html Verbiste] (French), [http://compjugador.sourceforge.net Compjugador] (Spanish), [http://www.rhedadur.org.uk Rhedadur] (Welsh). Many other conjugators can be found on [http://www.verbix.com Verbix] or by doing a simple Google search.<br />
<br />
After this work, you should have a set of internally consistent data that captures a lot of the key information about the most common words in your language, and you are now ready to start importing that data into Apertium. That merits a separate page [ref].<br />
<br />
There are a couple of final points that should be made.<br />
<br />
The first is that Apertium is a work in progress. It was originally developed for closely-related Romance languages, and is now expanding into a translation platform for a much wider range of languages. By definition, this means that future work will involve trying to accommodate linguistic constructs that are new to the system. For instance, the mutation system in Celtic languages has been handled by a small addition to the dictionary format. This is challenging and exciting, but by the same token you should not expect that the Apertium team will have an easy (or indeed any!) answer to a particular problem. Be prepared to collaborate on developing Apertium to deal with that problem.<br />
<br />
The second is that your carefully-collected data is ''not'' just an input into Apertium. You can use it to produce an online dictionary for your language (see, for instance, [http://www.eurfa.org.uk Eurfa] for Welsh), and it can also be converted easily into a print dictionary using something like LaTeX. The data can be used to build a spelling checker or a grammar checker using the tools available from the [http://borel.slu.edu/gramadoir/index.html Gramadóir] project.<br />
<br />
Without language data, it is impossible to build language tools. So by putting together your datastore, you have already taken an enormous step towards making the riches of your language available to others.</div>83.104.99.209