Difference between revisions of "Aravot.am"
Line 14: | Line 14: | ||
Template : ''http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html'' |
Template : ''http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html'' |
||
''ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html |
''ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html' |
||
Within each main page are links to all the articles. |
|||
'''2000''' |
'''2000''' |
||
Line 31: | Line 31: | ||
Note: Beware of spelling of some months: October = oktember |
Note: Beware of spelling of some months: October = oktember |
||
Within each main page are links to all the articles. |
|||
'''2001''' |
'''2001''' |
||
Line 45: | Line 45: | ||
ie : ''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm'' |
ie : ''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm'' |
||
Within each main page are links to all the articles. |
|||
'''2002 - 2003''' |
'''2002 - 2003''' |
||
Line 61: | Line 61: | ||
''ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm'' |
''ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm'' |
||
Within each main page are links to all the articles. |
|||
Note for 1998 - 2003 |
Note for 1998 - 2003 |
||
Line 70: | Line 70: | ||
ie : http://archive.aravot.am/2003 |
ie : http://archive.aravot.am/2003 |
||
'''2004 - 2011 October''' |
'''2004 - 2011 October''' |
||
'''2004''' |
*'''2004''' |
||
Main archive page : http://www.aravot.am/aravot_arc_2004.htm |
Main archive page : http://www.aravot.am/aravot_arc_2004.htm |
||
'''2005 - 2011''' |
*'''2005 - 2011''' |
||
Main archive page : http://www.aravot.am/aravot_arc.htm |
Main archive page : http://www.aravot.am/aravot_arc.htm |
||
Line 121: | Line 124: | ||
''' |
''' |
||
== Scraping == |
== Scraping == |
||
''' |
''' |
Revision as of 05:47, 13 December 2012
Extracting from http://www.aravot.am
How to get to the index of each day and where the main archive page is.
1998 - 2000
Armenian
Main Archive page : http://archive.aravot.am/index_arc.html
Grabbing Articles by date
Template : http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html
ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'
2000
Armenian
Main archive page : http://archive.aravot.am/arc_00_01.htm
Articles by date Template : http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm
ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm
Note: Beware of spelling of some months: October = oktember
2001
Armenian
Main archive page : http://archive.aravot.am/arc_00_01.htm
Articles by date
Template : http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htmItalic text
ie : http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
2002 - 2003
Armenian
Main archive page : http://archive.aravot.am/aravot_arc_2002.htm
Articles by date
Template : http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm
ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm
ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm
Note for 1998 - 2003
Most likely English and Russian articles do not exist.
Cannot access the parent directories
ie : http://archive.aravot.am/2003
2004 - 2011 October
- 2004
Main archive page : http://www.aravot.am/aravot_arc_2004.htm
- 2005 - 2011
Main archive page : http://www.aravot.am/aravot_arc.htm
Articles by date
Template http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm
[language] is one of the following: arm, eng, rus
Examples
http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm
http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm
http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm
2011 October 21 - Present
Articles by date
Template :http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/
ie : http://www.aravot.am/en/2011/12/26/21854/
Examples
English
http://www.aravot.am/en/2011/12/26/21854/
Russian
http://www.aravot.am/ru/2011/12/26/21854/
Armenian
http://www.aravot.am/2011/12/26/21854/
Note : For Armenian the language is taken out Note : not all articles exist in all three languages (as shown in the examples)
Scraping
1998 - 2003
Difficulty: Medium
The only language found is Armenian.
Start at the index.htm of each date.
Parse for any links to articles.
Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.
Example
index.htm
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
First link on the left goes directly to an article
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
Second link goes to a summary page
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm
2004 - 2011
Difficulty: Hard
Articles can be in Armenian, Russian or English, but not every article is in every language.
There are also some huge holes in the archive for languages other than Armenian
2004, English only has 2 months
http://www.aravot.am/2004/aravot_eng/
Easy to download all the articles from one day
Difficult to match article to article form different languages
File names for the same articles are not always the same
Some articles exist in one language but not the other
Index page are not always the same either
http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm
http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm
2011 October 21 - Present
Difficulty: Easy
Giving a range of dates, go to an archive of each day in all three languages.
Example http://www.aravot.am/ru/2011/12/26/
Scrape all the article links, then find the article full text
Save html with date, language, article number
Try to match articles with same date and article number.
Using only article number should be safe enough, but date would make them more unique