Turkish and Kyrgyz/Making a corpus from azattyk

From Apertium
Jump to navigation Jump to search

Background information[edit]

  1. Get from
  2. find each «li class="date archive_listrow_date"»[3]«/li>>; date is [3]
  3. all «li»[4]«/li» between above «li class="date arhive_listrow_date"»«/li» and next «li class="date arhive_listrow_date"»«/li» is an article
  4. each «li»[4]«/li» contains an «a href="[5]"»[6]«/a»; [5] is relative url of article, [6] is title of article


possible values for [1] and [2]
[1] [2]
ky-kyrgyzstan 392
ky-central_asia 393
ky-world 394
ky-politics 395
ky-human_rights 396
ky-economy 397
ky-culture 398
ky-voice_of_people 399
ky-sport 400


Algorithm[edit]

  1. generate all permutations of values of above table + dates in a given range
  2. grab all pages generated by those permutations
  3. get all links by finding above «li» elements
  4. use scrapers.ScraperAzattyk to get contents of pages
  5. use something like scraper_classes.Source.add_to_archive() to make xml archive
  6. (use xml2txt.py to dump archive for use with stuff?)