Turkish and Kyrgyz/Making a corpus from azattyk

From Apertium

< Turkish and Kyrgyz

Jump to navigation Jump to search

Background information

Get from
- http://www.azattyk.org/archive/[1]/[date]/[2]/[2].html
- [1] and [2] (see table below)
- [date] is in format yyyymmdd
find each «li class="date archive_listrow_date"»[3]«/li>>; date is [3]
all «li»[4]«/li» between above «li class="date arhive_listrow_date"»«/li» and next «li class="date arhive_listrow_date"»«/li» is an article
each «li»[4]«/li» contains an «a href="[5]"»[6]«/a»; [5] is relative url of article, [6] is title of article

possible values for [1] and [2]
[1]	[2]
ky-kyrgyzstan	392
ky-central_asia	393
ky-world	394
ky-politics	395
ky-human_rights	396
ky-economy	397
ky-culture	398
ky-voice_of_people	399
ky-sport	400

Algorithm

generate all permutations of values of above table + dates in a given range
grab all pages generated by those permutations
get all links by finding above «li» elements
use scrapers.ScraperAzattyk to get contents of pages
use something like scraper_classes.Source.add_to_archive() to make xml archive
(use xml2txt.py to dump archive for use with stuff?)

Retrieved from "https://wiki.apertium.org/w/index.php?title=Turkish_and_Kyrgyz/Making_a_corpus_from_azattyk&oldid=28724"