Xml grep

From Apertium
Revision as of 07:57, 30 May 2014 by Unhammer (talk | contribs) (count(//@lm))
Jump to navigation Jump to search

When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements, without falling for the tempation to parse XML with regex.

Specifying the full path and the full pardef name:

$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eng.eng.dix
<pardef n="gen__apos">
  <e>       <p><l/>          <r/></p></e>
  <e>       <p><l>'</l>         <r><j/>'<s n="gen"/></r></p></e>
</pardef>

But for dix files, it should be the same if you specify a relative path:

$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eng.eng.dix
<pardef n="gen__apos">
  <e>       <p><l/>          <r/></p></e>
  <e>       <p><l>'</l>         <r><j/>'<s n="gen"/></r></p></e>
</pardef>

You can also search for substrings by using the 'contains' function:

$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eng.eng.dix
<pardef n="expensive__adj">
  <e>       <p><l/>          <r><s n="adj"/></r></p></e>
</pardef>
<pardef n="ca__adj">…
# etc; gives all the adj pardefs


To get all c attributes:

$ xmllint --xpath '//@c' apertium-eng.eng.dix

To get c attributes only from <e> elements:

$ xmllint --xpath '//e/@c' apertium-eng.eng.dix

To get all attributes of the e element that has the lm "cake":

$ xmllint --xpath '//e[lm="cake"]/@*' apertium-eng.eng.dix


To get the second dictionary section:

$ xmllint --xpath '/dictionary/section[2]/' apertium-eng.eng.dix

(or section[position()=2])


To count how many lm attributes (should equal how many lemmas) you have:

$ xmllint --xpath 'count(//e/@lm)' apertium-eng.eng.dix


Some corpora are formatted in XML and put e.g. the real text contents inside a particular element. Say the corpus puts all text inside <sentence> elements, you can grep them out with:

$ xmllint --xpath '*/sentence/text()' corpus.xml

But I want XML awk/sed/diff/patch/join/etc.!

To do more complex xml munging, you might want to install XML Starlet.


External links