Turkish and Kyrgyz/Making a corpus from azattyk
< Turkish and Kyrgyz
Jump to navigation
Jump to search
Revision as of 08:51, 3 October 2011 by Firespeaker (talk | contribs)
Background information[edit]
- Get from
- http://www.azattyk.org/archive/[1]/[date]/[2]/[2].html
- [1] and [2] (see table below)
- [date] is in format yyyymmdd
- find each «li class="date archive_listrow_date"»[3]«/li>>; date is [3]
- all «li»[4]«/li» between above «li class="date arhive_listrow_date"»«/li» and next «li class="date arhive_listrow_date"»«/li» is an article
- each «li»[4]«/li» contains an «a href="[5]"»[6]«/a»; [5] is relative url of article, [6] is title of article
[1] | [2] |
---|---|
ky-kyrgyzstan | 392 |
ky-central_asia | 393 |
ky-world | 394 |
ky-politics | 395 |
ky-human_rights | 396 |
ky-economy | 397 |
ky-culture | 398 |
ky-voice_of_people | 399 |
ky-sport | 400 |
Algorithm[edit]
- generate all permutations of values of above table + dates in a given range
- grab all pages generated by those permutations
- get all links by finding above «li» elements
- use scrapers.ScraperAzattyk to get contents of pages
- use something like scraper_classes.Source.add_to_archive() to make xml archive
- (use xml2txt.py to dump archive for use with stuff?)