Difference between revisions of "Xml grep"
m |
|||
Line 1: | Line 1: | ||
When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements. |
When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements. |
||
⚫ | |||
Examples: |
|||
<pre> |
<pre> |
||
⚫ | |||
$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eo-en.en.dix |
$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eo-en.en.dix |
||
<pardef n="gen__apos"> |
<pardef n="gen__apos"> |
||
Line 9: | Line 8: | ||
<e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e> |
<e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e> |
||
</pardef> |
</pardef> |
||
</pre> |
|||
But for dix files, it should be the same if you specify a relative path: |
|||
<pre> |
|||
$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eo-en.en.dix |
$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eo-en.en.dix |
||
<pardef n="gen__apos"> |
<pardef n="gen__apos"> |
||
Line 16: | Line 17: | ||
<e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e> |
<e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e> |
||
</pardef> |
</pardef> |
||
</pre> |
|||
You can also search for substrings by using the 'contains' function: |
|||
<pre> |
|||
$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eo-en.en.dix |
$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eo-en.en.dix |
||
<pardef n="expensive__adj"> |
<pardef n="expensive__adj"> |
||
Line 24: | Line 27: | ||
<pardef n="ca__adj">… |
<pardef n="ca__adj">… |
||
# etc; gives all the adj pardefs |
# etc; gives all the adj pardefs |
||
</pre> |
|||
Some corpora are formatted in XML and put e.g. the real text contents inside a particular element. Say the corpus puts all text inside <sentence> elements, you can grep them out with: |
|||
<pre> |
|||
$ xmllint --xpath '*/sentence/text()' corpus.xml |
|||
</pre> |
</pre> |
||
Revision as of 16:44, 29 August 2012
When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements.
Specifying the full path and the full pardef name:
$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eo-en.en.dix <pardef n="gen__apos"> <e> <p><l/> <r/></p></e> <e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e> </pardef>
But for dix files, it should be the same if you specify a relative path:
$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eo-en.en.dix <pardef n="gen__apos"> <e> <p><l/> <r/></p></e> <e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e> </pardef>
You can also search for substrings by using the 'contains' function:
$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eo-en.en.dix <pardef n="expensive__adj"> <e> <p><l/> <r><s n="adj"/></r></p></e> </pardef> <pardef n="ca__adj">… # etc; gives all the adj pardefs
Some corpora are formatted in XML and put e.g. the real text contents inside a particular element. Say the corpus puts all text inside <sentence> elements, you can grep them out with:
$ xmllint --xpath '*/sentence/text()' corpus.xml