Difference between revisions of "Xml grep"

From Apertium
Jump to navigation Jump to search
m
Line 1: Line 1:
 
When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements.
 
When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements.
   
 
Specifying the full path and the full pardef name:
Examples:
 
 
<pre>
 
<pre>
# Specifying the full path and the full pardef name:
 
 
$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eo-en.en.dix
 
$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eo-en.en.dix
 
<pardef n="gen__apos">
 
<pardef n="gen__apos">
Line 9: Line 8:
 
<e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e>
 
<e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e>
 
</pardef>
 
</pardef>
  +
</pre>
   
# But for dix files, it should be the same if you specify a relative path:
+
But for dix files, it should be the same if you specify a relative path:
  +
<pre>
 
$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eo-en.en.dix
 
$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eo-en.en.dix
 
<pardef n="gen__apos">
 
<pardef n="gen__apos">
Line 16: Line 17:
 
<e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e>
 
<e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e>
 
</pardef>
 
</pardef>
  +
</pre>
   
# You can also search for substrings by using the 'contains' function:
+
You can also search for substrings by using the 'contains' function:
  +
<pre>
 
$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eo-en.en.dix
 
$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eo-en.en.dix
 
<pardef n="expensive__adj">
 
<pardef n="expensive__adj">
Line 24: Line 27:
 
<pardef n="ca__adj">…
 
<pardef n="ca__adj">…
 
# etc; gives all the adj pardefs
 
# etc; gives all the adj pardefs
  +
</pre>
  +
  +
  +
Some corpora are formatted in XML and put e.g. the real text contents inside a particular element. Say the corpus puts all text inside &lt;sentence&gt; elements, you can grep them out with:
  +
<pre>
  +
$ xmllint --xpath '*/sentence/text()' corpus.xml
 
</pre>
 
</pre>
   

Revision as of 16:44, 29 August 2012

When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements.

Specifying the full path and the full pardef name:

$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eo-en.en.dix
<pardef n="gen__apos">
  <e>       <p><l/>          <r/></p></e>
  <e>       <p><l>'</l>         <r><j/>'<s n="gen"/></r></p></e>
</pardef>

But for dix files, it should be the same if you specify a relative path:

$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eo-en.en.dix
<pardef n="gen__apos">
  <e>       <p><l/>          <r/></p></e>
  <e>       <p><l>'</l>         <r><j/>'<s n="gen"/></r></p></e>
</pardef>

You can also search for substrings by using the 'contains' function:

$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eo-en.en.dix
<pardef n="expensive__adj">
  <e>       <p><l/>          <r><s n="adj"/></r></p></e>
</pardef>
<pardef n="ca__adj">…
# etc; gives all the adj pardefs


Some corpora are formatted in XML and put e.g. the real text contents inside a particular element. Say the corpus puts all text inside <sentence> elements, you can grep them out with:

$ xmllint --xpath '*/sentence/text()' corpus.xml