Difference between revisions of "Xml grep"

From Apertium
Jump to navigation Jump to search
(xml starlet)
(count(//@lm))
Line 3: Line 3:
 
Specifying the full path and the full pardef name:
 
Specifying the full path and the full pardef name:
 
<pre>
 
<pre>
$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eo-en.en.dix
+
$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eng.eng.dix
 
<pardef n="gen__apos">
 
<pardef n="gen__apos">
 
<e> <p><l/> <r/></p></e>
 
<e> <p><l/> <r/></p></e>
Line 12: Line 12:
 
But for dix files, it should be the same if you specify a relative path:
 
But for dix files, it should be the same if you specify a relative path:
 
<pre>
 
<pre>
$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eo-en.en.dix
+
$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eng.eng.dix
 
<pardef n="gen__apos">
 
<pardef n="gen__apos">
 
<e> <p><l/> <r/></p></e>
 
<e> <p><l/> <r/></p></e>
Line 21: Line 21:
 
You can also search for substrings by using the 'contains' function:
 
You can also search for substrings by using the 'contains' function:
 
<pre>
 
<pre>
$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eo-en.en.dix
+
$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eng.eng.dix
 
<pardef n="expensive__adj">
 
<pardef n="expensive__adj">
 
<e> <p><l/> <r><s n="adj"/></r></p></e>
 
<e> <p><l/> <r><s n="adj"/></r></p></e>
Line 32: Line 32:
 
To get all c attributes:
 
To get all c attributes:
 
<pre>
 
<pre>
$ xmllint --xpath '//@c' apertium-eo-en.en.dix
+
$ xmllint --xpath '//@c' apertium-eng.eng.dix
 
</pre>
 
</pre>
   
 
To get c attributes only from &lt;e&gt; elements:
 
To get c attributes only from &lt;e&gt; elements:
 
<pre>
 
<pre>
$ xmllint --xpath '//e/@c' apertium-eo-en.en.dix
+
$ xmllint --xpath '//e/@c' apertium-eng.eng.dix
 
</pre>
 
</pre>
   
 
To get all attributes of the e element that has the lm "cake":
 
To get all attributes of the e element that has the lm "cake":
 
<pre>
 
<pre>
$ xmllint --xpath '//e[lm="cake"]/@*' apertium-eo-en.en.dix
+
$ xmllint --xpath '//e[lm="cake"]/@*' apertium-eng.eng.dix
 
</pre>
 
</pre>
   
Line 48: Line 48:
 
To get the second dictionary section:
 
To get the second dictionary section:
 
<pre>
 
<pre>
$ xmllint --xpath '/dictionary/section[2]/' apertium-eo-en.en.dix
+
$ xmllint --xpath '/dictionary/section[2]/' apertium-eng.eng.dix
 
</pre>
 
</pre>
 
(or section[position()=2])
 
(or section[position()=2])
  +
  +
  +
To count how many lm attributes (should equal how many lemmas) you have:
  +
<pre>
  +
$ xmllint --xpath 'count(//e/@lm)' apertium-eng.eng.dix
  +
</pre>
   
   

Revision as of 07:57, 30 May 2014

When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements, without falling for the tempation to parse XML with regex.

Specifying the full path and the full pardef name:

$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eng.eng.dix
<pardef n="gen__apos">
  <e>       <p><l/>          <r/></p></e>
  <e>       <p><l>'</l>         <r><j/>'<s n="gen"/></r></p></e>
</pardef>

But for dix files, it should be the same if you specify a relative path:

$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eng.eng.dix
<pardef n="gen__apos">
  <e>       <p><l/>          <r/></p></e>
  <e>       <p><l>'</l>         <r><j/>'<s n="gen"/></r></p></e>
</pardef>

You can also search for substrings by using the 'contains' function:

$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eng.eng.dix
<pardef n="expensive__adj">
  <e>       <p><l/>          <r><s n="adj"/></r></p></e>
</pardef>
<pardef n="ca__adj">…
# etc; gives all the adj pardefs


To get all c attributes:

$ xmllint --xpath '//@c' apertium-eng.eng.dix

To get c attributes only from <e> elements:

$ xmllint --xpath '//e/@c' apertium-eng.eng.dix

To get all attributes of the e element that has the lm "cake":

$ xmllint --xpath '//e[lm="cake"]/@*' apertium-eng.eng.dix


To get the second dictionary section:

$ xmllint --xpath '/dictionary/section[2]/' apertium-eng.eng.dix

(or section[position()=2])


To count how many lm attributes (should equal how many lemmas) you have:

$ xmllint --xpath 'count(//e/@lm)' apertium-eng.eng.dix


Some corpora are formatted in XML and put e.g. the real text contents inside a particular element. Say the corpus puts all text inside <sentence> elements, you can grep them out with:

$ xmllint --xpath '*/sentence/text()' corpus.xml

But I want XML awk/sed/diff/patch/join/etc.!

To do more complex xml munging, you might want to install XML Starlet.


External links