<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.apertium.org/w/index.php?action=history&amp;feed=atom&amp;title=WikiExtractor</id>
	<title>WikiExtractor - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.apertium.org/w/index.php?action=history&amp;feed=atom&amp;title=WikiExtractor"/>
	<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;action=history"/>
	<updated>2026-05-29T23:43:54Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.34.1</generator>
	<entry>
		<id>https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;diff=66202&amp;oldid=prev</id>
		<title>Shardulc at 02:47, 10 March 2018</title>
		<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;diff=66202&amp;oldid=prev"/>
		<updated>2018-03-10T02:47:57Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 02:47, 10 March 2018&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-deleted&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;div&gt;{{Github-unmigrated-tool}}&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;&#039;&#039;&#039;WikiExtractor&#039;&#039;&#039; is a script for extracting a text corpus from Wikipedia dumps.&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;&#039;&#039;&#039;WikiExtractor&#039;&#039;&#039; is a script for extracting a text corpus from Wikipedia dumps.&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;br /&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Shardulc</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;diff=59578&amp;oldid=prev</id>
		<title>Francis Tyers at 11:41, 18 August 2016</title>
		<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;diff=59578&amp;oldid=prev"/>
		<updated>2016-08-18T11:41:13Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 11:41, 18 August 2016&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 12:&lt;/td&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 12:&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;br /&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;You want the file that ends in: &amp;lt;code&amp;gt;-pages-articles.xml.bz2&amp;lt;/code&amp;gt;, e.g. &amp;lt;code&amp;gt;euwiki-20160701-pages-articles.xml.bz2&amp;lt;/code&amp;gt;&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;You want the file that ends in: &amp;lt;code&amp;gt;-pages-articles.xml.bz2&amp;lt;/code&amp;gt;, e.g. &amp;lt;code&amp;gt;euwiki-20160701-pages-articles.xml.bz2&amp;lt;/code&amp;gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-deleted&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-deleted&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;div&gt;You&#039;ll need to download the file, you can use &amp;lt;code&amp;gt;wget&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;curl&amp;lt;/code&amp;gt; or something...&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-deleted&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-deleted&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;div&gt;&amp;lt;pre&amp;gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-deleted&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;div&gt;$ wget &amp;lt;url&amp;gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-deleted&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;div&gt;&amp;lt;/pre&amp;gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;br /&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;And you run it like this:&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;And you run it like this:&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Francis Tyers</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;diff=59131&amp;oldid=prev</id>
		<title>Francis Tyers: Francis Tyers moved page Wiki Extractor to WikiExtractor</title>
		<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;diff=59131&amp;oldid=prev"/>
		<updated>2016-07-09T13:37:12Z</updated>

		<summary type="html">&lt;p&gt;Francis Tyers moved page &lt;a href=&quot;/wiki/Wiki_Extractor&quot; class=&quot;mw-redirect&quot; title=&quot;Wiki Extractor&quot;&gt;Wiki Extractor&lt;/a&gt; to &lt;a href=&quot;/wiki/WikiExtractor&quot; title=&quot;WikiExtractor&quot;&gt;WikiExtractor&lt;/a&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;1&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;1&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 13:37, 9 July 2016&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-notice&quot; lang=&quot;en&quot;&gt;&lt;div class=&quot;mw-diff-empty&quot;&gt;(No difference)&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</summary>
		<author><name>Francis Tyers</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;diff=59130&amp;oldid=prev</id>
		<title>Francis Tyers: Created page with &quot;&#039;&#039;&#039;WikiExtractor&#039;&#039;&#039; is a script for extracting a text corpus from Wikipedia dumps.  You can find the script here:  https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/...&quot;</title>
		<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=WikiExtractor&amp;diff=59130&amp;oldid=prev"/>
		<updated>2016-07-09T13:36:21Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;&amp;#039;&amp;#039;&amp;#039;WikiExtractor&amp;#039;&amp;#039;&amp;#039; is a script for extracting a text corpus from Wikipedia dumps.  You can find the script here:  https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;WikiExtractor&amp;#039;&amp;#039;&amp;#039; is a script for extracting a text corpus from Wikipedia dumps.&lt;br /&gt;
&lt;br /&gt;
You can find the script here:&lt;br /&gt;
&lt;br /&gt;
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py&lt;br /&gt;
&lt;br /&gt;
You find a Wikipedia dump here:&lt;br /&gt;
&lt;br /&gt;
https://dumps.wikimedia.org/backup-index.html&lt;br /&gt;
&lt;br /&gt;
Navigate to the page of the language you are interested in, there will be a link called &amp;lt;code&amp;gt;&amp;amp;lt;language code&amp;amp;gt;&amp;lt;/code&amp;gt;wiki.&lt;br /&gt;
&lt;br /&gt;
You want the file that ends in: &amp;lt;code&amp;gt;-pages-articles.xml.bz2&amp;lt;/code&amp;gt;, e.g. &amp;lt;code&amp;gt;euwiki-20160701-pages-articles.xml.bz2&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And you run it like this:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ python3 WikiExtractor.py --infn euwiki-20160701-pages-articles.xml.bz2&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
It will spit a lot of output (the article titles) and output a file called &amp;lt;code&amp;gt;wiki.txt&amp;lt;/code&amp;gt;. This is your corpus.&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Francis Tyers</name></author>
		
	</entry>
</feed>