Difference between revisions of "Aravot.am"
Line 143: | Line 143: | ||
Difficulty: Medium |
Difficulty: Medium |
||
− | The only language found is Armenian. |
+ | *The only language found is Armenian. |
− | Start at the index.htm of each date. |
+ | *Start at the index.htm of each date. |
− | Parse for any links to articles. |
+ | *Parse for any links to articles. |
− | Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these. |
+ | *Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these. |
'''Example''' |
'''Example''' |
||
Line 164: | Line 164: | ||
''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm'' |
''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm'' |
||
+ | |||
+ | |||
'''2004 - 2011''' |
'''2004 - 2011''' |
||
Line 169: | Line 171: | ||
Difficulty: Hard |
Difficulty: Hard |
||
− | Articles can be in Armenian, Russian or English, but not every article is in every language. |
+ | *Articles can be in Armenian, Russian or English, but not every article is in every language. |
− | There are also some huge holes in the archive for languages other than Armenian |
+ | *There are also some huge holes in the archive for languages other than Armenian |
− | 2004, English only has 2 months |
+ | *2004, English only has 2 months |
''http://www.aravot.am/2004/aravot_eng/'' |
''http://www.aravot.am/2004/aravot_eng/'' |
||
− | Easy to download all the articles from one day |
+ | *Easy to download all the articles from one day |
− | Difficult to match article to article form different languages |
+ | *Difficult to match article to article form different languages |
− | File names for the same articles are not always the same |
+ | *File names for the same articles are not always the same |
− | Some articles exist in one language but not the other |
+ | *Some articles exist in one language but not the other |
− | Index page are not always the same either |
+ | *Index page are not always the same either |
+ | |||
⚫ | |||
+ | |||
⚫ | |||
⚫ | |||
⚫ | |||
'''2011 October 21 - Present''' |
'''2011 October 21 - Present''' |
||
Line 198: | Line 202: | ||
'''Example''' |
'''Example''' |
||
+ | |||
''http://www.aravot.am/ru/2011/12/26/'' |
''http://www.aravot.am/ru/2011/12/26/'' |
||
− | Scrape all the article links, then find the article full text |
+ | *Scrape all the article links, then find the article full text |
− | Save html with date, language, article number |
+ | *Save html with date, language, article number |
− | Try to match articles with same date and article number. |
+ | *Try to match articles with same date and article number. |
− | Using only article number should be safe enough, but date would make them more unique |
+ | *Using only article number should be safe enough, but date would make them more unique |
Revision as of 05:52, 13 December 2012
Extracting from http://www.aravot.am
How to get to the index of each day and where the main archive page is.
1998 - 2000
Armenian
Main Archive page : http://archive.aravot.am/index_arc.html
Grabbing Articles by date
Template : http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html
ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'
2000
Armenian
Main archive page : http://archive.aravot.am/arc_00_01.htm
Articles by date Template : http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm
ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm
Note: Beware of spelling of some months: October = oktember
2001
Armenian
Main archive page : http://archive.aravot.am/arc_00_01.htm
Articles by date
Template : http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htmItalic text
ie : http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
2002 - 2003
Armenian
Main archive page : http://archive.aravot.am/aravot_arc_2002.htm
Articles by date
Template : http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm
ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm
ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm
Note for 1998 - 2003
Most likely English and Russian articles do not exist.
Cannot access the parent directories
ie : http://archive.aravot.am/2003
2004 - 2011 October
- 2004
Main archive page : http://www.aravot.am/aravot_arc_2004.htm
- 2005 - 2011
Main archive page : http://www.aravot.am/aravot_arc.htm
Articles by date
Template http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm
[language] is one of the following: arm, eng, rus
Examples
Armenian
http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm
English
http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm
Russian
http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm
2011 October 21 - Present
Articles by date
Template :http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/
ie : http://www.aravot.am/en/2011/12/26/21854/
Examples
English
http://www.aravot.am/en/2011/12/26/21854/
Russian
http://www.aravot.am/ru/2011/12/26/21854/
Armenian
http://www.aravot.am/2011/12/26/21854/
Note : For Armenian, [lang] is taken out
Note : Not all articles exist in all three languages (as shown in the examples)
Scraping
1998 - 2003
Difficulty: Medium
- The only language found is Armenian.
- Start at the index.htm of each date.
- Parse for any links to articles.
- Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.
Example
index.htm
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
First link on the left goes directly to an article
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
Second link goes to a summary page
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm
2004 - 2011
Difficulty: Hard
- Articles can be in Armenian, Russian or English, but not every article is in every language.
- There are also some huge holes in the archive for languages other than Armenian
- 2004, English only has 2 months
http://www.aravot.am/2004/aravot_eng/
- Easy to download all the articles from one day
- Difficult to match article to article form different languages
- File names for the same articles are not always the same
- Some articles exist in one language but not the other
- Index page are not always the same either
http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm
http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm
2011 October 21 - Present
Difficulty: Easy
Giving a range of dates, go to an archive of each day in all three languages.
Example
http://www.aravot.am/ru/2011/12/26/
- Scrape all the article links, then find the article full text
- Save html with date, language, article number
- Try to match articles with same date and article number.
- Using only article number should be safe enough, but date would make them more unique