Difference between revisions of "Xml grep"
Line 42: | Line 42: | ||
To get all attributes of the e element that has the lm "cake": |
To get all attributes of the e element that has the lm "cake": |
||
<pre> |
<pre> |
||
$ xmllint --xpath '//e[lm="cake"]/@*' apertium-eng.eng.dix |
$ xmllint --xpath '//e[@lm="cake"]/@*' apertium-eng.eng.dix |
||
</pre> |
</pre> |
||
Line 67: | Line 67: | ||
To do more complex xml munging, you might want to install [http://xmlstar.sourceforge.net/ XML Starlet]. The syntax takes a bit of getting used to, but it's quite powerful. If nothing else, it's worth getting because it can append newlines after each selection. E.g. |
To do more complex xml munging, you might want to install [http://xmlstar.sourceforge.net/ XML Starlet]. The syntax takes a bit of getting used to, but it's quite powerful. If nothing else, it's worth getting because it can append newlines after each selection. E.g. |
||
$ xmllint --xpath '//e[@r="LR"]//text()' apertium-eng.eng.dix |
|||
would get text content of entries marked LR, but there will be no separator between each match. XML Starlet requires -a -few -more -options, but can output newlines after each match: |
would get text content of entries marked LR, but there will be no separator between each match. XML Starlet requires -a -few -more -options, but can output newlines after each match: |
||
$ xmlstarlet sel -t -m '//e[@r="LR"]//text()' -c . -n apertium-eng.eng.dix |
|||
"sel" means select/query, -t means output as text, -m means "match this XPath expression", "-c ." here means "output what was matched" and -n means add a newline. |
"sel" means select/query, -t means output as text, -m means "match this XPath expression", "-c ." here means "output what was matched" and -n means add a newline. |
||
Note that on some systems, the command for xmlstarlet is simply "xml". |
|||
==I get "Unknown option --xpath"== |
==I get "Unknown option --xpath"== |
Revision as of 12:43, 3 December 2014
When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements, without falling for the tempation to parse XML with regex.
Specifying the full path and the full pardef name:
$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eng.eng.dix <pardef n="gen__apos"> <e> <p><l/> <r/></p></e> <e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e> </pardef>
But for dix files, it should be the same if you specify a relative path:
$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eng.eng.dix <pardef n="gen__apos"> <e> <p><l/> <r/></p></e> <e> <p><l>'</l> <r><j/>'<s n="gen"/></r></p></e> </pardef>
You can also search for substrings by using the 'contains' function:
$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eng.eng.dix <pardef n="expensive__adj"> <e> <p><l/> <r><s n="adj"/></r></p></e> </pardef> <pardef n="ca__adj">… # etc; gives all the adj pardefs
To get all c attributes:
$ xmllint --xpath '//@c' apertium-eng.eng.dix
To get c attributes only from <e> elements:
$ xmllint --xpath '//e/@c' apertium-eng.eng.dix
To get all attributes of the e element that has the lm "cake":
$ xmllint --xpath '//e[@lm="cake"]/@*' apertium-eng.eng.dix
To get the second dictionary section:
$ xmllint --xpath '/dictionary/section[2]/' apertium-eng.eng.dix
(or section[position()=2])
To count how many lm attributes (should equal how many lemmas) you have:
$ xmllint --xpath 'count(//e/@lm)' apertium-eng.eng.dix
Some corpora are formatted in XML and put e.g. the real text contents inside a particular element. Say the corpus puts all text inside <sentence> elements, you can grep them out with:
$ xmllint --xpath '*/sentence/text()' corpus.xml
But I want XML awk/sed/diff/patch/join/etc.!
To do more complex xml munging, you might want to install XML Starlet. The syntax takes a bit of getting used to, but it's quite powerful. If nothing else, it's worth getting because it can append newlines after each selection. E.g.
$ xmllint --xpath '//e[@r="LR"]//text()' apertium-eng.eng.dix
would get text content of entries marked LR, but there will be no separator between each match. XML Starlet requires -a -few -more -options, but can output newlines after each match:
$ xmlstarlet sel -t -m '//e[@r="LR"]//text()' -c . -n apertium-eng.eng.dix
"sel" means select/query, -t means output as text, -m means "match this XPath expression", "-c ." here means "output what was matched" and -n means add a newline.
Note that on some systems, the command for xmlstarlet is simply "xml".
I get "Unknown option --xpath"
You need to upgrade your libxml2 (--xpath was added in 2.7.7, back in Mar 15 2010).