Difference between revisions of "The Right Way to count dix stems"

From Apertium
Jump to navigation Jump to search
m
Line 1: Line 1:
 
This page documents how to count stems in dix files. There's a ready-made script that does it available at [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixcounter.py http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixcounter.py].
 
This page documents how to count stems in dix files. There's a ready-made script that does it available at [http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixcounter.py http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixcounter.py].
   
  +
== How-to ==
 
We want to import xml data first in the tree form.
 
We want to import xml data first in the tree form.
 
Getting xml file information via regular expression would be cumbersome and [[How_can_I_parse_XML_with_regular_expressions|often undoable]].
 
Getting xml file information via regular expression would be cumbersome and [[How_can_I_parse_XML_with_regular_expressions|often undoable]].
Line 28: Line 29:
 
and for the '''monolingual''' dictionaries:
 
and for the '''monolingual''' dictionaries:
 
<pre>
 
<pre>
len(tree.findall("*[@id='main']/*[@lm]"))
+
len(tree.findall("section/*[@lm]"))
 
</pre>
 
</pre>
(we choose all e tags with 'lm' attribute inside section with 'main' id)
+
(we choose all e tags with 'lm' attribute inside any section—not just section id="main", because there can be other sections too)
 
: This only counts section with id=main, some dix have lots of content words in other sections.
 
: This only counts section with id=main, some dix have lots of content words in other sections.
   
 
: Alternatively you can just use xmllint on the command line: <code>xmllint --xpath 'count(//section/e])' *dix</code> (or <code>'count(//section[@id="main"]/*[@lm])'</code> or <code>'count(//section[@id="main"]/e/p/l)'</code>
 
: Alternatively you can just use xmllint on the command line: <code>xmllint --xpath 'count(//section/e])' *dix</code> (or <code>'count(//section[@id="main"]/*[@lm])'</code> or <code>'count(//section[@id="main"]/e/p/l)'</code>
   
  +
== Ready-made ==
  +
  +
All the above is implemented in the [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixcounter.py dixcounter.py] script, which will detect if a dictionary is a monodix or bidix and count accordingly.
  +
  +
== See Also ==
  +
  +
* [[The Right Way to count lexc stems]]
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Tools]]

Revision as of 23:41, 7 January 2014

This page documents how to count stems in dix files. There's a ready-made script that does it available at http://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/dixcounter.py.

How-to

We want to import xml data first in the tree form. Getting xml file information via regular expression would be cumbersome and often undoable. There are much neater ways how to accomplish this and XML was specially designed for storing and retrieving information easily.

Here is a quick guide how to process xml in python

import xml.etree.ElementTree as xml
root = xml.fromstring("dictionaryXML")

Dictionary tag is going to be the root of the tree and its children tags are going to be child nodes of the root.

For bilingual dictionaries we want to count the number of word pairs. We have <l></l> and <r></r> tags for a pair inside the <e></e> tag.

For monolingual dictionaries we want to count the number of lemmas. That means <e> tag with lm attribute.

We use root.findall() to get all occurrences and then len() to get the size of the resultant array.

So the command for the bilingual dictionaries:

len(tree.findall("*[@id='main']/e//l"))

(we choose all <l> tags which are inside <e></e> tags inside section with 'main' id)

and for the monolingual dictionaries:

len(tree.findall("section/*[@lm]"))

(we choose all e tags with 'lm' attribute inside any section—not just section id="main", because there can be other sections too)

This only counts section with id=main, some dix have lots of content words in other sections.
Alternatively you can just use xmllint on the command line: xmllint --xpath 'count(//section/e])' *dix (or 'count(//section[@id="main"]/*[@lm])' or 'count(//section[@id="main"]/e/p/l)'

Ready-made

All the above is implemented in the dixcounter.py script, which will detect if a dictionary is a monodix or bidix and count accordingly.

See Also