https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Leftmostcat&feedformat=atomApertium - User contributions [en]2024-03-28T16:40:29ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=PMC_proposals/Move_apertium_to_github&diff=43632PMC proposals/Move apertium to github2013-09-08T08:03:09Z<p>Leftmostcat: Grammar</p>
<hr />
<div>{{TOCD}}<br />
==Summary==<br />
<br />
git provides a large number of advantages over subversion, including a very good branching mechanism, offline commit history, a bisection tool for locating broken commits, and excellent merge/rebase capabilities. <br />
<br />
Making use of a service such as github would also allow for each apertium module to be in a separate repository, with the possibility for creating central repositories (such as incubator) which link to all of the included modules. github also provides an issue tracker and a system for making commits in a personal fork of the upstream repository, then requesting that your changes be pulled into upstream. Note that apertium can retain its current method of allowing people to commit directly, but retain the option of using pull requests for those who don't plan to contribute regularly. Sourceforge could be retained for mailing lists and similar services.<br />
<br />
Migration of the repositories from subversion to git should be relatively simple. Tools exist for creating git repositories from subversion while retaining all commit history. The migration should begin with smaller apertium modules, such as the contents of nursery and incubator. The more central modules, such as lttoolbox and apertium itself, can be moved last. Documentation will need be updated, but a simple guide similar to https://wiki.gnome.org/TranslationProject/GitHowTo should be sufficient. Much of the information contained therein is probably not necessary for apertium workflow, making for a simpler, easier-to-write document. For more complex requirements, the existing git documentation is excellent and there are many resources for a variety of git recipes. I will create a draft version of a document covering apertium general use prior to the beginning of the move.<br />
<br />
Proposed by: [[User:Leftmostcat]]<br />
<br />
==Related reading==<br />
* Github Organizations: https://help.github.com/categories/2/articles<br />
* GH Org Teams: https://help.github.com/articles/how-do-i-set-up-a-team<br />
* Example of similar structure: https://github.com/metabrainz<br />
<br />
==In detail==<br />
<br />
==Caveats==<br />
* The svn repo contains several larger binaries and their history. The total sum of those would need to be cloned for every person who intends to seriously work with the subproject. A shallow clone (equivalent to svn checkout) can only be used for basic patchwork (''cannot clone, fetch, push into, or push from shallow clones''). See https://git.wiki.kernel.org/index.php/GitFaq#How_do_I_do_a_quick_clone_without_history_revisions.3F and following point. [[User:Tino Didriksen|Tino Didriksen]] 16:56, 6 September 2013 (UTC)<br />
*: Because of the ability to separate repositories, the impact of this would be minimized. To work on a language pair, it would only be necessary to clone the pair itself. —[[User:Leftmostcat|Leftmostcat]] 17:16, 6 September 2013 (UTC)<br />
*: I once (about 2 years ago?) tried to checkout all of apertium svn into one big git repo (ie. with full history). It took less space than the SVN checkout. --[[User:Unhammer|unhammer]] 07:59, 8 September 2013 (UTC)<br />
*: I would not recommend shallow clones, since 1) most apertiumers will be new to git, and it just adds more complexity 2) you typically don't save much drive space: http://blogs.gnome.org/simos/2009/04/18/git-clones-vs-shallow-git-clones/ 3) people will be checking out a repo at a time, not everything that was in SVN, and 4) maybe it's not such a bad thing that repos with many versions of big binaries stand out like a sore thumb ;-) --[[User:Unhammer|unhammer]] 07:59, 8 September 2013 (UTC)<br />
<br />
==Comments==<br />
The apertium github organisation is at https://github.com/apertium?tab=members<br />
<br />
==Voting==<br />
[[Category:Project Management Committee]]</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=PMC_proposals/Move_apertium_to_github&diff=43624PMC proposals/Move apertium to github2013-09-07T22:01:52Z<p>Leftmostcat: /* Summary */ Discuss documentation more fully</p>
<hr />
<div>{{TOCD}}<br />
==Summary==<br />
<br />
git provides a large number of advantages over subversion, including a very good branching mechanism, offline commit history, a bisection tool for locating broken commits, and excellent merge/rebase capabilities. Making use of a service such as github would also allow for each apertium module to be in a separate repository, with the possibility for creating central repositories (such as incubator) which link to all of the included modules. github also provides an issue tracker and a system for making commits in a personal fork of the upstream repository, then requesting that your changes be pulled into upstream. Note that apertium can retain its current method of allowing people to commit directly, but retain the option of using pull requests for those who don't plan to contribute regularly. Sourceforge could be retained for mailing lists and similar services.<br />
<br />
Migration from subversion to git should be relatively simple. Tools exist for creating git repositories from subversion while retaining all commit history. The migration should begin with smaller apertium modules, such as the contents of nursery and incubator. The more central modules, such as lttoolbox and apertium itself, can be moved last. Documentation will need be updated, but a simple guide similar to https://wiki.gnome.org/TranslationProject/GitHowTo should be sufficient. Much of the information contained therein is probably not necessary for apertium workflow, making for a simpler, easier-to-write document. For more complex requirements, the existing git documentation is excellent and there are many resources for a variety of git recipes. I will create a draft version of a document covering apertium general use prior to the beginning of the move.<br />
<br />
Proposed by: [[User:Leftmostcat]]<br />
<br />
==Related reading==<br />
* Github Organizations: https://help.github.com/categories/2/articles<br />
* GH Org Teams: https://help.github.com/articles/how-do-i-set-up-a-team<br />
* Example of similar structure: https://github.com/metabrainz<br />
<br />
==In detail==<br />
<br />
==Caveats==<br />
* The svn repo contains several larger binaries and their history. The total sum of those would need to be cloned for every person who intends to seriously work with the subproject. A shallow clone (equivalent to svn checkout) can only be used for basic patchwork (''cannot clone, fetch, push into, or push from shallow clones''). See https://git.wiki.kernel.org/index.php/GitFaq#How_do_I_do_a_quick_clone_without_history_revisions.3F and following point. [[User:Tino Didriksen|Tino Didriksen]] 16:56, 6 September 2013 (UTC)<br />
*: Because of the ability to separate repositories, the impact of this would be minimized. To work on a language pair, it would only be necessary to clone the pair itself. —[[User:Leftmostcat|Leftmostcat]] 17:16, 6 September 2013 (UTC)<br />
<br />
==Comments==<br />
<br />
<br />
==Voting==<br />
[[Category:Project Management Committee]]</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=PMC_proposals/Move_apertium_to_github&diff=43622PMC proposals/Move apertium to github2013-09-06T17:16:44Z<p>Leftmostcat: /* Caveats */ Followup</p>
<hr />
<div>{{TOCD}}<br />
==Summary==<br />
<br />
git provides a large number of advantages over subversion, including a very good branching mechanism, offline commit history, a bisection tool for locating broken commits, and excellent merge/rebase capabilities. Making use of a service such as github would also allow for each apertium module to be in a separate repository, with the possibility for creating central repositories (such as incubator) which link to all of the included modules. github also provides an issue tracker and a system for making commits in a personal fork of the upstream repository, then requesting that your changes be pulled into upstream. Note that apertium can retain its current method of allowing people to commit directly, but retain the option of using pull requests for those who don't plan to contribute regularly. Sourceforge could be retained for mailing lists and similar services.<br />
<br />
Migration from subversion to git should be relatively simple. Tools exist for creating git repositories from subversion while retaining all commit history. The migration should begin with smaller apertium modules, such as the contents of nursery and incubator. The more central modules, such as lttoolbox and apertium itself, can be moved last. Documentation can be updated, but a simple guide similar to https://wiki.gnome.org/TranslationProject/GitHowTo should be sufficient. Much of the information contained therein is probably not necessary for apertium workflow, making for a simpler, easier-to-write document. For more complex requirements, the existing git documentation is excellent and there are many resources for a variety of git recipes.<br />
<br />
Proposed by: [[User:Leftmostcat]]<br />
<br />
==Related reading==<br />
* Github Organizations: https://help.github.com/categories/2/articles<br />
* GH Org Teams: https://help.github.com/articles/how-do-i-set-up-a-team<br />
* Example of similar structure: https://github.com/metabrainz<br />
<br />
==In detail==<br />
<br />
==Caveats==<br />
* The svn repo contains several larger binaries and their history. The total sum of those would need to be cloned for every person who intends to seriously work with the subproject. A shallow clone (equivalent to svn checkout) can only be used for basic patchwork (''cannot clone, fetch, push into, or push from shallow clones''). See https://git.wiki.kernel.org/index.php/GitFaq#How_do_I_do_a_quick_clone_without_history_revisions.3F and following point. [[User:Tino Didriksen|Tino Didriksen]] 16:56, 6 September 2013 (UTC)<br />
*: Because of the ability to separate repositories, the impact of this would be minimized. To work on a language pair, it would only be necessary to clone the pair itself. —[[User:Leftmostcat|Leftmostcat]] 17:16, 6 September 2013 (UTC)<br />
<br />
==Comments==<br />
<br />
<br />
==Voting==</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=PMC_proposals/Move_apertium_to_github&diff=43618PMC proposals/Move apertium to github2013-09-06T16:33:46Z<p>Leftmostcat: /* Summary */ Clarify intentions regarding pull requests</p>
<hr />
<div>{{TOCD}}<br />
==Summary==<br />
<br />
git provides a large number of advantages over subversion, including a very good branching mechanism, offline commit history, a bisection tool for locating broken commits, and excellent merge/rebase capabilities. Making use of a service such as github would also allow for each apertium module to be in a separate repository, with the possibility for creating central repositories (such as incubator) which link to all of the included modules. github also provides an issue tracker and a system for making commits in a personal fork of the upstream repository, then requesting that your changes be pulled into upstream. Note that apertium can retain its current method of allowing people to commit directly, but retain the option of using pull requests for those who don't plan to contribute regularly. Sourceforge could be retained for mailing lists and similar services.<br />
<br />
Migration from subversion to git should be relatively simple. Tools exist for creating git repositories from subversion while retaining all commit history. The migration should begin with smaller apertium modules, such as the contents of nursery and incubator. The more central modules, such as lttoolbox and apertium itself, can be moved last. Documentation can be updated, but a simple guide similar to https://wiki.gnome.org/TranslationProject/GitHowTo should be sufficient. Much of the information contained therein is probably not necessary for apertium workflow, making for a simpler, easier-to-write document. For more complex requirements, the existing git documentation is excellent and there are many resources for a variety of git recipes.<br />
<br />
Proposed by: [[User:Leftmostcat]]<br />
<br />
==Related reading==<br />
* Github Organizations: https://help.github.com/categories/2/articles<br />
* GH Org Teams: https://help.github.com/articles/how-do-i-set-up-a-team<br />
<br />
==In detail==<br />
<br />
==Caveats==<br />
<br />
<br />
==Comments==<br />
<br />
<br />
==Voting==</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=PMC_proposals/Move_apertium_to_github&diff=43609PMC proposals/Move apertium to github2013-09-06T13:53:15Z<p>Leftmostcat: Add draft git proposal</p>
<hr />
<div>{{TOCD}}<br />
==Summary==<br />
<br />
git provides a large number of advantages over subversion, including a very good branching mechanism, offline commit history, a bisection tool for locating broken commits, and excellent merge/rebase capabilities. Making use of a service such as github would also allow for each apertium module to be in a separate repository, with the possibility for creating central repositories (such as incubator) which link to all of the included modules. github also provides an issue tracker and a system for making commits in a personal fork of the upstream repository, then requesting that your changes be pulled into upstream. Sourceforge could be retained for mailing lists and similar services.<br />
<br />
Migration from subversion to git should be relatively simple. Tools exist for creating git repositories from subversion while retaining all commit history. The migration should begin with smaller apertium modules, such as the contents of nursery and incubator. The more central modules, such as lttoolbox and apertium itself, can be moved last. Documentation can be updated, but a simple guide similar to https://wiki.gnome.org/TranslationProject/GitHowTo should be sufficient. Much of the information contained therein is probably not necessary for apertium workflow, making for a simpler, easier-to-write document. For more complex requirements, the existing git documentation is excellent and there are many resources for a variety of git recipes.<br />
<br />
Proposed by: [[User:Leftmostcat]]<br />
<br />
==In detail==<br />
<br />
<br />
==Caveats==<br />
<br />
<br />
==Comments==<br />
<br />
<br />
==Voting==</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=Apertium_on_Fedora&diff=43349Apertium on Fedora2013-08-19T19:06:40Z<p>Leftmostcat: /* Installing the newest version from SVN */ Update URL</p>
<hr />
<div>The installation of Apertium on Fedora is similar to the other distributions. <br />
<br />
==Installing the newest version from SVN ==<br />
Open a Terminal window, authenticate as root and go through the following steps.<br />
<br />
Step 0: '''Run a system update.''' (Optional)<br />
<pre>su -<br />
yum update</pre><br />
This is optional, but recommended.<br />
<br />
Step 1: '''Install the prerequisites.'''<br />
<pre>yum install subversion make gcc gcc-c++ pcre-devel libxml2-devel flex libtool automake autoconf</pre><br />
<br />
Step 2: '''Download apertium, lttoolbox and language pairs from SVN.'''<br />
<pre><br />
svn co http://svn.code.sf.net/p/apertium/svn/trunk apertium<br />
</pre><br />
''Note'': The above checkout will download lots of files with all the released language pairs. If you have limited bandwidth or disk space (or time), please follow the [[Minimal installation from SVN]] instead.<br />
<br />
Step 3: '''Compile and install lttoolbox.'''<br />
<pre><br />
cd apertium<br />
cd lttoolbox/<br />
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh<br />
make<br />
make install<br />
ldconfig<br />
</pre><br />
<br />
Step 4: '''Compile and install apertium.'''<br />
<pre><br />
cd ..<br />
cd apertium/<br />
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh<br />
make<br />
make install<br />
ldconfig<br />
</pre><br />
<br />
Step 5: '''You can now compile the language pairs that you want to use.''' It's the same procedure for every pair. <br />
<br />
''Note: we give an example with apertium-fr-es''<br />
<pre><br />
cd ..<br />
cd apertium-fr-es/<br />
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh<br />
make<br />
make install<br />
</pre><br />
<br />
Step 6: '''Try it out.'''<br />
<pre><br />
echo "J'ai deux frères" | apertium fr-es<br />
</pre><br />
<br />
==Troubleshooting==<br />
<br />
===PCRE missing===<br />
<br />
In case you got the following message or something similar in the terminal...<br />
<br />
<pre>checking for pcreposix.h... no<br />
configure: error: *** unable to locate pcreposix.h include<br />
file ***</pre><br />
<br />
...then you will need to install '''pcre-devel''':<br />
<pre>yum install pcre-devel</pre><br />
<br />
[[Category:Installation]]<br />
[[Category:Documentation in English]]</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=Calculating_coverage&diff=12848Calculating coverage2009-06-02T16:47:29Z<p>Leftmostcat: Better transwiki removal</p>
<hr />
<div>Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]). <br />
<br />
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)<br />
<br />
wikicat.sh:<br />
<pre><br />
#!/bin/sh<br />
# Clean up wikitext for running through apertium-destxt<br />
<br />
# awk prints full lines, make sure each html element has one<br />
bzcat "$@" | sed 's/>/>\<br />
/g' | sed 's/</\<br />
</g' |\<br />
# want only stuff between <text...> and </text><br />
awk '<br />
/<text.*>/,/<\/text>/ { print $0 }<br />
' |\<br />
sed 's/\./ /g' |\<br />
# Drop all transwiki links<br />
sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\<br />
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]<br />
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\<br />
# wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo]<br />
sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\<br />
# remove entities<br />
sed 's/&[^;]*;/ /g' |\<br />
# and put space instead of punctuation<br />
sed 's/[;:?,]/ /g' |\<br />
# Keep only lines starting with a capital letter, removing tables with style info etc.<br />
grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here<br />
</pre><br />
<br />
count-tokenized.sh:<br />
<pre><br />
#!/bin/sh<br />
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage<br />
<br />
# Calculate the number of tokenised words in the corpus:<br />
apertium-destxt | lt-proc $1 |apertium-retxt |\<br />
# for some reason putting the newline in directly doesn't work, so two seds<br />
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\<br />
^/g' <br />
</pre><br />
<br />
To find all tokens from a wiki dump:<br />
<pre><br />
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt<br />
cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l<br />
</pre><br />
To find all tokens with at least one analysis (naïve coverage):<br />
<pre><br />
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l<br />
</pre><br />
To find the top unknown tokens:<br />
<pre><br />
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space<br />
grep '\/\*' | sort -f | uniq -c | sort -gr | head <br />
<br />
</pre><br />
<br />
== Script ready to run ==<br />
<br />
corpus-stat.sh<br />
<pre><br />
#!/bin/sh<br />
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage<br />
<br />
<br />
# Example use:<br />
# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh<br />
<br />
<br />
#CMD="cat corpa/en.crp.txt"<br />
CMD="cat"<br />
<br />
F=/tmp/corpus-stat-res.txt<br />
<br />
# Calculate the number of tokenised words in the corpus:<br />
# for some reason putting the newline in directly doesn't work, so two seds<br />
$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\<br />
^/g' > $F<br />
<br />
NUMWORDS=`cat $F | wc -l`<br />
echo "Number of tokenised words in the corpus: $NUMWORDS"<br />
<br />
<br />
<br />
# Calculate the number of words that are not unknown<br />
<br />
NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l`<br />
echo "Number of known words in the corpus: $NUMKNOWNWORDS"<br />
<br />
<br />
# Calculate the coverage<br />
<br />
COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"`<br />
echo "Coverage: $COVERAGE %"<br />
<br />
<br />
# Show the top-ten unknown words.<br />
<br />
echo "Top unknown words in the corpus:"<br />
cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10<br />
<br />
</pre><br />
Sample output:<br />
<pre><br />
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh<br />
Number of tokenised words in the corpus: 478187<br />
Number of known words in the corpus: 450255<br />
Coverage: 94.2 %<br />
Top unknown words in the corpus:<br />
191 ^Apollo/*Apollo$<br />
104 ^Aramaic/*Aramaic$<br />
91 ^Alberta/*Alberta$<br />
81 ^de/*de$<br />
80 ^Abu/*Abu$<br />
63 ^Bakr/*Bakr$<br />
62 ^Agassi/*Agassi$<br />
59 ^Carnegie/*Carnegie$<br />
58 ^Agrippina/*Agrippina$<br />
58 ^Achilles/*Achilles$<br />
56 ^Adelaide/*Adelaide$<br />
</pre><br />
<br />
[[Category:Documentation]]</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=Calculating_coverage&diff=12631Calculating coverage2009-05-25T13:50:43Z<p>Leftmostcat: New wikicat.sh; fixes greediness, kills transwikis, puts comments where they'd typically be expected</p>
<hr />
<div>Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]). <br />
<br />
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)<br />
<br />
wikicat.sh:<br />
<pre><br />
#!/bin/sh<br />
# Clean up wikitext for running through apertium-destxt<br />
<br />
# awk prints full lines, make sure each html element has one<br />
bzcat "$@" | sed 's/>/>\<br />
/g' | sed 's/</\<br />
</g' |\<br />
# want only stuff between <text...> and </text><br />
awk '<br />
/<text.*>/,/<\/text>/ { print $0 }<br />
' |\<br />
sed 's/\./ /g' |\<br />
# Drop all transwiki links<br />
sed 's/\[\[\([a-z]\{2,3\}\):[^]]\+\]\]//g' |\<br />
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]<br />
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\<br />
# remove entities<br />
sed 's/&[^;]*;/ /g' |\<br />
# and put space instead of punctuation<br />
sed 's/[;:?,]/ /g' |\<br />
# Keep only lines starting with a capital letter, removing tables with style info etc.<br />
grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here<br />
</pre><br />
<br />
count-tokenized.sh:<br />
<pre><br />
#!/bin/sh<br />
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage<br />
<br />
# Calculate the number of tokenised words in the corpus:<br />
apertium-destxt | lt-proc $1 |apertium-retxt |\<br />
# for some reason putting the newline in directly doesn't work, so two seds<br />
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\<br />
^/g' <br />
</pre><br />
<br />
To find all tokens from a wiki dump:<br />
<pre><br />
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt<br />
cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l<br />
</pre><br />
To find all tokens with at least one analysis (naïve coverage):<br />
<pre><br />
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l<br />
</pre><br />
To find the top unknown tokens:<br />
<pre><br />
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space<br />
grep '\/\*' | sort -f | uniq -c | sort -gr | head <br />
<br />
</pre><br />
<br />
== Script ready to run ==<br />
<br />
corpus-stat.sh<br />
<pre><br />
#!/bin/sh<br />
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage<br />
<br />
<br />
# Example use:<br />
# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh<br />
<br />
<br />
#CMD="cat corpa/en.crp.txt"<br />
CMD="cat"<br />
<br />
F=/tmp/corpus-stat-res.txt<br />
<br />
# Calculate the number of tokenised words in the corpus:<br />
# for some reason putting the newline in directly doesn't work, so two seds<br />
$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\<br />
^/g' > $F<br />
<br />
NUMWORDS=`cat $F | wc -l`<br />
echo "Number of tokenised words in the corpus: $NUMWORDS"<br />
<br />
<br />
<br />
# Calculate the number of words that are not unknown<br />
<br />
NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l`<br />
echo "Number of known words in the corpus: $NUMKNOWNWORDS"<br />
<br />
<br />
# Calculate the coverage<br />
<br />
COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"`<br />
echo "Coverage: $COVERAGE %"<br />
<br />
<br />
# Show the top-ten unknown words.<br />
<br />
echo "Top unknown words in the corpus:"<br />
cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10<br />
<br />
</pre><br />
Sample output:<br />
<pre><br />
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh<br />
Number of tokenised words in the corpus: 478187<br />
Number of known words in the corpus: 450255<br />
Coverage: 94.2 %<br />
Top unknown words in the corpus:<br />
191 ^Apollo/*Apollo$<br />
104 ^Aramaic/*Aramaic$<br />
91 ^Alberta/*Alberta$<br />
81 ^de/*de$<br />
80 ^Abu/*Abu$<br />
63 ^Bakr/*Bakr$<br />
62 ^Agassi/*Agassi$<br />
59 ^Carnegie/*Carnegie$<br />
58 ^Agrippina/*Agrippina$<br />
58 ^Achilles/*Achilles$<br />
56 ^Adelaide/*Adelaide$<br />
</pre><br />
<br />
[[Category:Documentation]]</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=User:Leftmostcat/Application&diff=10952User:Leftmostcat/Application2009-03-23T17:54:13Z<p>Leftmostcat: New page: I am applying to work on the ga-gd language pair for Google Summer of Code 2009. I have a shiny [http://www.leftmostcat.net/gsocapp2009.pdf application] w...</p>
<hr />
<div>I am applying to work on the [[Scottish Gaelic and Irish|ga-gd language pair]] for [[Google Summer of Code]] 2009. I have a shiny [http://www.leftmostcat.net/gsocapp2009.pdf application] written in LaTeX.</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=User:Leftmostcat&diff=10951User:Leftmostcat2009-03-23T17:52:16Z<p>Leftmostcat: Make an application page so spectie is happy</p>
<hr />
<div>My name is Sean Burke, though I also go by Seán de Búrca on some projects. I am working on an undergraduate degree in linguistics at the University of Montana in the United States. My primary interest is in the Celtic languages, especially Irish. I am currently finishing a year abroad studying Irish at University College Cork, Ireland. I am frequently in the Apertium IRC channel as Leftmost.<br />
<br />
I am applying for the 2009 [[Google Summer of Code]]. Feel free to have a gander at my [[/Application|application]].</div>Leftmostcathttps://wiki.apertium.org/w/index.php?title=User:Leftmostcat&diff=10950User:Leftmostcat2009-03-23T17:50:30Z<p>Leftmostcat: New page: My name is Sean Burke, though I also go by Seán de Búrca on some projects. I am working on an undergraduate degree in linguistics at the University of Montana in the United States. My pr...</p>
<hr />
<div>My name is Sean Burke, though I also go by Seán de Búrca on some projects. I am working on an undergraduate degree in linguistics at the University of Montana in the United States. My primary interest is in the Celtic languages, especially Irish. I am currently finishing a year abroad studying Irish at University College Cork, Ireland. I am frequently in the Apertium IRC channel as Leftmost.<br />
<br />
I am applying for the 2009 [[Google Summer of Code]] to work on the [[Scottish Gaelic and Irish|ga-gd language pair]]. My application is to be found here: [http://www.leftmostcat.net/gsocapp2009.pdf].</div>Leftmostcat