Difference between revisions of "Talk:Calculating coverage"

From Apertium
Jump to navigation Jump to search
m (gsed)
(→‎wikicat2.sh: new section)
 
Line 4: Line 4:
   
 
:: <code>sudo port install ssed</code> for Super-Sed (long lost cousin of [http://www.youtube.com/results?search_query=super-ted&search_type=&aq=f Super-Ted]). Felt it important to point out the fact to potential other Mac users though... or maybe I should've written "Mac? Use ssed" instead. (Or <code>sudo port install gsed</code> for GNU sed, which seems faster than ssed.) [[User:Unhammer|Unhammer]]
 
:: <code>sudo port install ssed</code> for Super-Sed (long lost cousin of [http://www.youtube.com/results?search_query=super-ted&search_type=&aq=f Super-Ted]). Felt it important to point out the fact to potential other Mac users though... or maybe I should've written "Mac? Use ssed" instead. (Or <code>sudo port install gsed</code> for GNU sed, which seems faster than ssed.) [[User:Unhammer|Unhammer]]
  +
  +
== wikicat2.sh ==
  +
  +
I like to keep the punctuation, this is an alternative wikicat script. Use <code>$ ./wikicat2.sh blah-pages-articles.xml.bz2</code>:
  +
<pre>
  +
#!/bin/sh
  +
# clean up wiki for running through apertium-destxt
  +
  +
# awk prints full lines, make sure each html element has one
  +
bzcat "$@" | sed 's/>/>\n/g' | sed 's/</\n</g' |\
  +
# want only stuff between <text...> and </text>
  +
awk '
  +
/<text.*>/,/<\/text>/ { print $0 }
  +
' |\
  +
  +
sed 's/\[\[\([a-z]\{2,3\}\):[^]]\+\]\]//g' |\
  +
# Drop all transwiki links
  +
  +
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
  +
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
  +
sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\
  +
# wiki markup, retain `bar fie' from [http://foo bar fie]
  +
  +
sed 's/&.*;/ /g' |\
  +
# remove entities greedily, so as to get rid of hidden html too
  +
  +
# Keep only lines starting with a capital letter, removing tables with style info etc.
  +
grep '^[ \t]*[A-ZÆØÅ]' # Your alphabet here
  +
</pre>

Latest revision as of 09:34, 11 March 2010

(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)

I've come across this before, the best thing to do here is install GNU sed ;) - Francis Tyers 22:48, 25 March 2009 (UTC)
sudo port install ssed for Super-Sed (long lost cousin of Super-Ted). Felt it important to point out the fact to potential other Mac users though... or maybe I should've written "Mac? Use ssed" instead. (Or sudo port install gsed for GNU sed, which seems faster than ssed.) Unhammer

wikicat2.sh[edit]

I like to keep the punctuation, this is an alternative wikicat script. Use $ ./wikicat2.sh blah-pages-articles.xml.bz2:

#!/bin/sh
# clean up wiki for running through apertium-destxt

# awk prints full lines, make sure each html element has one
bzcat "$@" | sed 's/>/>\n/g' | sed 's/</\n</g' |\
# want only stuff between <text...> and </text>
awk '
/<text.*>/,/<\/text>/ { print $0 }
' |\

sed 's/\[\[\([a-z]\{2,3\}\):[^]]\+\]\]//g' |\
# Drop all transwiki links

sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\
# wiki markup, retain `bar fie' from [http://foo bar fie]

sed 's/&.*;/ /g' |\
# remove entities greedily, so as to get rid of hidden html too

# Keep only lines starting with a capital letter, removing tables with style info etc.
grep '^[ \t]*[A-ZÆØÅ]' # Your alphabet here