Xml grep

From Apertium
Revision as of 11:12, 20 June 2016 by Nikita Medyankin (talk | contribs) (→‎It's slow on big files: minor spelling correction)
Jump to navigation Jump to search

When working with xml, you'll often want to grep out an element that spans several lines. This can be hacked with awk or perl, but a more elegant solution is to use the parser in libxml2 (which is a requirement when installing apertium, so should be installed on your system already). This lets you use a simple version of XPath expressions to grep out full XML elements, without falling for the tempation to parse XML with regex.

Specifying the full path and the full pardef name:

$ xmllint --xpath '/dictionary/pardefs/pardef[@n="gen__apos"]' apertium-eng.eng.dix
<pardef n="gen__apos">
  <e>       <p><l/>          <r/></p></e>
  <e>       <p><l>'</l>         <r><j/>'<s n="gen"/></r></p></e>

But for dix files, it should be the same if you specify a relative path:

$ xmllint --xpath '//pardef[@n="gen__apos"]' apertium-eng.eng.dix
<pardef n="gen__apos">
  <e>       <p><l/>          <r/></p></e>
  <e>       <p><l>'</l>         <r><j/>'<s n="gen"/></r></p></e>

You can also search for substrings by using the 'contains' function:

$ xmllint --xpath '//pardef[contains(@n,"_adj")]' apertium-eng.eng.dix
<pardef n="expensive__adj">
  <e>       <p><l/>          <r><s n="adj"/></r></p></e>
<pardef n="ca__adj">…
# etc; gives all the adj pardefs

To get all c attributes:

$ xmllint --xpath '//@c' apertium-eng.eng.dix

To get c attributes only from <e> elements:

$ xmllint --xpath '//e/@c' apertium-eng.eng.dix

To get all attributes of the e element that has the lm "cake":

$ xmllint --xpath '//e[@lm="cake"]/@*' apertium-eng.eng.dix

To get the second dictionary section:

$ xmllint --xpath '/dictionary/section[2]/' apertium-eng.eng.dix

(or section[position()=2])

To count how many lm attributes (should equal how many lemmas) you have:

$ xmllint --xpath 'count(//e/@lm)' apertium-eng.eng.dix

Some corpora are formatted in XML and put e.g. the real text contents inside a particular element. Say the corpus puts all text inside <sentence> elements, you can grep them out with:

$ xmllint --xpath '*/sentence/text()' corpus.xml

But I want XML awk/sed/diff/patch/join/etc.!

To do more complex xml munging, you might want to install XML Starlet. The syntax takes a bit of getting used to, but it's quite powerful. If nothing else, it's worth getting because it can append newlines after each selection. E.g.

 $ xmllint --xpath '//e[@r="LR"]//text()' apertium-eng.eng.dix

would get text content of entries marked LR, but there will be no separator between each match. XML Starlet requires -a -few -more -options, but can output newlines after each match:

 $ xmlstarlet sel -t -m '//e[@r="LR"]//text()' -c . -n apertium-eng.eng.dix

"sel" means select/query, -t means output as text, -m means "match this XPath expression", "-c ." here means "output what was matched" and -n means add a newline.

To use xmlstarlet as a sed replacement, try e.g.

 $ xmlstarlet ed -P -u '//e[@lm="Norge"]/@r' -v "LR" apertium-nno.nno.dix

to make r be LR on the entry with lemma Norge.

You can have several -c's (and -n's and -o's) per -m, so if you wanted to output all the text from <l>'s and <r>'s under the same <p> in a file, tab-separated, you could do:

 $ xmlstarlet sel -t -o $'LEFT\tRIGHT' -n -m '//e/p' -c 'l/text()' -o $'\t' -c 'r/text()' -n apertium-nno-nob.nno-nob.dix 

Note that on some systems, the command for xmlstarlet is simply "xml".

It's slow on big files

Both xmllint and xmlstarlet seem to sponge the whole file into ram before doing anything.

What if you just want to run on the first 1000 lines? If you try "head -1000 file | xmlstarlet" it'll complain since your tags aren't closed. Here's a trick to avoid that:

xzcat corpus.xml.xz | head -1000 | xmllint --recover - | xmllint --xpath '//foo/bar'

The --recover option to xmllint will close end tags for you :-) and prints what it closed to stderr.

E.g. to get some articles out of a wikipedia dump:

 bzcat dawiki-20160111-pages-articles.xml.bz2 \
  | head -1000 \
  | xmllint --recover - \
  | xmlstarlet sel -N w='http://www.mediawiki.org/xml/export-0.10/' -t -c '//w:text/text()'

I get "Unknown option --xpath"

You need to upgrade your libxml2 (--xpath was added in 2.7.7, back in Mar 15 2010). Or use xmlstarlet.

External links

See also