Aravot.am
Index
How to get to the index of each day and where the main archive page is.
1998 - 2000
Armenian
- Main archive page : http://archive.aravot.am/index_arc.html
Articles by date
- Template : http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html
ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'
2000
Armenian
- Main archive page : http://archive.aravot.am/arc_00_01.htm
Articles by date
- Template : http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm
ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm
- Note: Beware of spelling of some months: October = oktember
2001
Armenian
- Main archive page : http://archive.aravot.am/arc_00_01.htm
Articles by date
- Template : http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htmItalic text
ie : http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
2002 - 2003
Armenian
- Main archive page : http://archive.aravot.am/aravot_arc_2002.htm
Articles by date
- Template : http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm
ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm
ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm
- Note for 1998 - 2003
- English and Russian articles most likely, do not exist.
- The parent directories cannot be accessed, so it is not for certain.
ie : http://archive.aravot.am/2003
2004 - 2011 October
- 2004
- Main archive page : http://www.aravot.am/aravot_arc_2004.htm
- 2005 - 2011
- Main archive page : http://www.aravot.am/aravot_arc.htm
Articles by date
- Template http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm
[language] is one of the following: arm, eng, rus
Examples
Armenian
http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm
English
http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm
Russian
http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm
2011 October 21 - Present
Articles by date
- Template :http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/
ie : http://www.aravot.am/en/2011/12/26/21854/
Examples
English
http://www.aravot.am/en/2011/12/26/21854/
Russian
http://www.aravot.am/ru/2011/12/26/21854/
Armenian
http://www.aravot.am/2011/12/26/21854/
- Note: For Armenian, [lang] is taken out
- Note: Not all articles exist in all three languages (as shown in the examples)
Scraping
1998 - 2003
Difficulty: Medium
- The only language found is Armenian.
- Start at the index.htm of each date.
- Parse for any links to articles.
- Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.
Example
- index.htm
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
- First link on the left goes directly to an article
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
- Second link goes to a summary page
http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm
2004 - 2011
Difficulty: Hard
- Articles can be in Armenian, Russian or English, but not every article is in every language.
- There are also some huge holes in the archive for languages other than Armenian
- 2004, English only has 2 months
http://www.aravot.am/2004/aravot_eng/
- Easy to download all the articles from one day
- Difficult to match article to article form different languages
- File names for the same articles are not always the same
- Some articles exist in one language but not the other
- Index page are not always the same either
http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm
http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm
2011 October 21 - Present
Difficulty: Easy
- Giving a range of dates, go to an archive of each day in all three languages.
Example
http://www.aravot.am/ru/2011/12/26/
- Scrape all the article links, then find the article full text
- Save html with date, language, article number
- Try to match articles with same date and article number.
- Using only article number should be safe enough, but date would make them more unique
Scraping and aligning
How to scrape aravot.am:
Scraping
Write the Python script with Beautiful Soup, which aids in screen scraping.
English, Russian, and Armenian will be scraped from aravot.am. Loop through the archive pages for each day of the given range and for each language to get all the article URLs.
Loop through all the article URLs for each day and language and get the HTML tags that store content (e.g. the paragraph tag and the preformatted text tag) by using Beautiful Soup. Append the contents to a string.
Sentence Segmentation
Some things need to be done with the text before tokenising it. For English, smart quotes should be replaced:
output_en=output_en.replace('“','"') output_en=output_en.replace('”','"') output_en=output_en.replace("‘","'") output_en=output_en.replace("’","'")
Please refer to the sentence segmentation page for sentence tokenisation. I used NLTK Punkt.