Difference between revisions of "Talk:Calculating coverage"
Jump to navigation
Jump to search
m (User talk:Unhammer/Coverage moved to Talk:Calculating coverage) |
(→wikicat2.sh: new section) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
:I've come across this before, the best thing to do here is install GNU sed ;) - [[User:Francis Tyers|Francis Tyers]] 22:48, 25 March 2009 (UTC) |
:I've come across this before, the best thing to do here is install GNU sed ;) - [[User:Francis Tyers|Francis Tyers]] 22:48, 25 March 2009 (UTC) |
||
:: <code>sudo port install ssed</code> for Super-Sed (long lost cousin of [http://www.youtube.com/results?search_query=super-ted&search_type=&aq=f Super-Ted]). Felt it important to point out the fact to potential other Mac users though... or maybe I should've written "Mac? Use ssed" instead. (Or <code>sudo port install gsed</code> for GNU sed, which seems faster than ssed.) [[User:Unhammer|Unhammer]] |
|||
== wikicat2.sh == |
|||
I like to keep the punctuation, this is an alternative wikicat script. Use <code>$ ./wikicat2.sh blah-pages-articles.xml.bz2</code>: |
|||
<pre> |
|||
#!/bin/sh |
|||
# clean up wiki for running through apertium-destxt |
|||
# awk prints full lines, make sure each html element has one |
|||
bzcat "$@" | sed 's/>/>\n/g' | sed 's/</\n</g' |\ |
|||
# want only stuff between <text...> and </text> |
|||
awk ' |
|||
/<text.*>/,/<\/text>/ { print $0 } |
|||
' |\ |
|||
sed 's/\[\[\([a-z]\{2,3\}\):[^]]\+\]\]//g' |\ |
|||
# Drop all transwiki links |
|||
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ |
|||
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]] |
|||
sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\ |
|||
# wiki markup, retain `bar fie' from [http://foo bar fie] |
|||
sed 's/&.*;/ /g' |\ |
|||
# remove entities greedily, so as to get rid of hidden html too |
|||
# Keep only lines starting with a capital letter, removing tables with style info etc. |
|||
grep '^[ \t]*[A-ZÆØÅ]' # Your alphabet here |
|||
</pre> |
Latest revision as of 09:34, 11 March 2010
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
- I've come across this before, the best thing to do here is install GNU sed ;) - Francis Tyers 22:48, 25 March 2009 (UTC)
wikicat2.sh[edit]
I like to keep the punctuation, this is an alternative wikicat script. Use $ ./wikicat2.sh blah-pages-articles.xml.bz2
:
#!/bin/sh # clean up wiki for running through apertium-destxt # awk prints full lines, make sure each html element has one bzcat "$@" | sed 's/>/>\n/g' | sed 's/</\n</g' |\ # want only stuff between <text...> and </text> awk ' /<text.*>/,/<\/text>/ { print $0 } ' |\ sed 's/\[\[\([a-z]\{2,3\}\):[^]]\+\]\]//g' |\ # Drop all transwiki links sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\ # wiki markup, retain `bar fie' from [http://foo bar fie] sed 's/&.*;/ /g' |\ # remove entities greedily, so as to get rid of hidden html too # Keep only lines starting with a capital letter, removing tables with style info etc. grep '^[ \t]*[A-ZÆØÅ]' # Your alphabet here