Difference between revisions of "Turkish and Kyrgyz/Making a corpus from azattyk"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) |
Firespeaker (talk | contribs) |
||
Line 1: | Line 1: | ||
+ | == Background information == |
||
+ | |||
# Get from |
# Get from |
||
#* http://www.azattyk.org/archive/[1]/[date]/[2]/[2].html |
#* http://www.azattyk.org/archive/[1]/[date]/[2]/[2].html |
||
Line 32: | Line 34: | ||
| ky-sport || 400 |
| ky-sport || 400 |
||
|} |
|} |
||
+ | |||
+ | |||
+ | == Algorithm == |
||
+ | |||
+ | # generate all permutations of values of above table + dates in a given range |
||
+ | # grab all pages generated by those permutations |
||
+ | # get all links by finding above «li» elements |
||
+ | # use scrapers.ScraperAzattyk to get contents of pages |
||
+ | # use something like scraper_classes.Source.add_to_archive() to make xml archive |
||
+ | # (use xml2txt.py to dump archive for use with stuff?) |
Latest revision as of 08:51, 3 October 2011
Background information[edit]
- Get from
- http://www.azattyk.org/archive/[1]/[date]/[2]/[2].html
- [1] and [2] (see table below)
- [date] is in format yyyymmdd
- find each «li class="date archive_listrow_date"»[3]«/li>>; date is [3]
- all «li»[4]«/li» between above «li class="date arhive_listrow_date"»«/li» and next «li class="date arhive_listrow_date"»«/li» is an article
- each «li»[4]«/li» contains an «a href="[5]"»[6]«/a»; [5] is relative url of article, [6] is title of article
[1] | [2] |
---|---|
ky-kyrgyzstan | 392 |
ky-central_asia | 393 |
ky-world | 394 |
ky-politics | 395 |
ky-human_rights | 396 |
ky-economy | 397 |
ky-culture | 398 |
ky-voice_of_people | 399 |
ky-sport | 400 |
Algorithm[edit]
- generate all permutations of values of above table + dates in a given range
- grab all pages generated by those permutations
- get all links by finding above «li» elements
- use scrapers.ScraperAzattyk to get contents of pages
- use something like scraper_classes.Source.add_to_archive() to make xml archive
- (use xml2txt.py to dump archive for use with stuff?)