Difference between revisions of "Turkish and Kyrgyz/Making a corpus from azattyk"

From Apertium
Jump to navigation Jump to search
(Created page with '# Get from ** http://www.azattyk.org/archive/[1]/20100101/[2]/[2].html **{|class="wikitable" |+ possible values for [1] and [2] |- ! [1] ! [2] |- | ky-kyrgyzstan || 392 |- | ky-c…')
 
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
  +
== Background information ==
  +
 
# Get from
 
# Get from
** http://www.azattyk.org/archive/[1]/20100101/[2]/[2].html
+
#* http://www.azattyk.org/archive/[1]/[date]/[2]/[2].html
  +
#* [1] and [2] (see table below)
**{|class="wikitable"
 
  +
#* [date] is in format yyyymmdd
 
# find each «li class="date archive_listrow_date"»[3]«/li>>; date is [3]
 
# all «li»[4]«/li» between above «li class="date arhive_listrow_date"»«/li» and next «li class="date arhive_listrow_date"»«/li» is an article
 
# each «li»[4]«/li» contains an «a href="[5]"»[6]«/a»; [5] is relative url of article, [6] is title of article
  +
  +
 
{|class="wikitable"
 
|+ possible values for [1] and [2]
 
|+ possible values for [1] and [2]
 
|-
 
|-
Line 25: Line 34:
 
| ky-sport || 400
 
| ky-sport || 400
 
|}
 
|}
  +
# find each <li class="date archive_listrow_date">[3]</li>; date is [3]
 
  +
# all <li>[4]</li> between above <li class="date arhive_listrow_date"></li> and next <li class="date arhive_listrow_date"></li> is an article
 
  +
== Algorithm ==
# each <li>[4]</li> contains an <a href="[5]">[6]</a>; [5] is relative url of article, [6] is title of article
 
  +
  +
# generate all permutations of values of above table + dates in a given range
  +
# grab all pages generated by those permutations
  +
# get all links by finding above «li» elements
  +
# use scrapers.ScraperAzattyk to get contents of pages
  +
# use something like scraper_classes.Source.add_to_archive() to make xml archive
  +
# (use xml2txt.py to dump archive for use with stuff?)

Latest revision as of 08:51, 3 October 2011

Background information[edit]

  1. Get from
  2. find each «li class="date archive_listrow_date"»[3]«/li>>; date is [3]
  3. all «li»[4]«/li» between above «li class="date arhive_listrow_date"»«/li» and next «li class="date arhive_listrow_date"»«/li» is an article
  4. each «li»[4]«/li» contains an «a href="[5]"»[6]«/a»; [5] is relative url of article, [6] is title of article


possible values for [1] and [2]
[1] [2]
ky-kyrgyzstan 392
ky-central_asia 393
ky-world 394
ky-politics 395
ky-human_rights 396
ky-economy 397
ky-culture 398
ky-voice_of_people 399
ky-sport 400


Algorithm[edit]

  1. generate all permutations of values of above table + dates in a given range
  2. grab all pages generated by those permutations
  3. get all links by finding above «li» elements
  4. use scrapers.ScraperAzattyk to get contents of pages
  5. use something like scraper_classes.Source.add_to_archive() to make xml archive
  6. (use xml2txt.py to dump archive for use with stuff?)