Difference between revisions of "Turkish and Kyrgyz/Making a corpus from azattyk"

Latest revision as of 08:51, 3 October 2011

Get from
- http://www.azattyk.org/archive/[1]/[date]/[2]/[2].html
- [1] and [2] (see table below)
- [date] is in format yyyymmdd
find each «li class="date archive_listrow_date"»[3]«/li>>; date is [3]
all «li»[4]«/li» between above «li class="date arhive_listrow_date"»«/li» and next «li class="date arhive_listrow_date"»«/li» is an article
each «li»[4]«/li» contains an «a href="[5]"»[6]«/a»; [5] is relative url of article, [6] is title of article

@@ Line 1: / Line 1: @@
+== Background information ==
 # Get from
-** http://www.azattyk.org/archive/[1]/20100101/[2]/[2].html
+#* http://www.azattyk.org/archive/[1]/[date]/[2]/[2].html
+#* [1] and [2] (see table below)
-**{|class="wikitable"
+#* [date] is in format yyyymmdd
+# find each «li class="date archive_listrow_date"»[3]«/li>>; date is [3]
+# all «li»[4]«/li» between above «li class="date arhive_listrow_date"»«/li» and next «li class="date arhive_listrow_date"»«/li» is an article
+# each «li»[4]«/li» contains an «a href="[5]"»[6]«/a»; [5] is relative url of article, [6] is title of article
+{|class="wikitable"
 |+ possible values for [1] and [2]
 |-
@@ Line 25: / Line 34: @@
 | ky-sport || 400
 |}
-# find each <li class="date archive_listrow_date">[3]</li>; date is [3]
-# all <li>[4]</li> between above <li class="date arhive_listrow_date"></li> and next <li class="date arhive_listrow_date"></li> is an article
+== Algorithm ==
-# each <li>[4]</li> contains an <a href="[5]">[6]</a>; [5] is relative url of article, [6] is title of article
+# generate all permutations of values of above table + dates in a given range
+# grab all pages generated by those permutations
+# get all links by finding above «li» elements
+# use scrapers.ScraperAzattyk to get contents of pages
+# use something like scraper_classes.Source.add_to_archive() to make xml archive
+# (use xml2txt.py to dump archive for use with stuff?)