Difference between revisions of "Aravot.am"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:
 
'''Armenian'''
 
'''Armenian'''
   
Main archive page : http://archive.aravot.am/index_arc.html
+
*Main archive page : http://archive.aravot.am/index_arc.html
   
 
'''Articles by date'''
 
'''Articles by date'''
   
Template : ''<nowiki>http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html</nowiki>''
+
*Template : ''<nowiki>http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html</nowiki>''
   
 
''ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'
 
''ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'
Line 22: Line 22:
 
'''Armenian'''
 
'''Armenian'''
   
Main archive page : http://archive.aravot.am/arc_00_01.htm
+
*Main archive page : http://archive.aravot.am/arc_00_01.htm
   
 
'''Articles by date'''
 
'''Articles by date'''
   
Template : ''<nowiki>http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''
+
*Template : ''<nowiki>http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''
   
 
''ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm''
 
''ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm''
   
Note: Beware of spelling of some months: October = oktember
+
*Note: Beware of spelling of some months: October = oktember
   
   
Line 38: Line 38:
 
'''Armenian'''
 
'''Armenian'''
   
Main archive page : http://archive.aravot.am/arc_00_01.htm
+
*Main archive page : http://archive.aravot.am/arc_00_01.htm
   
 
'''Articles by date'''
 
'''Articles by date'''
   
Template : <nowiki>http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''Italic text''
+
*Template : <nowiki>http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''Italic text''
   
 
ie : ''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm''
 
ie : ''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm''
Line 52: Line 52:
 
'''Armenian'''
 
'''Armenian'''
   
Main archive page : http://archive.aravot.am/aravot_arc_2002.htm
+
*Main archive page : http://archive.aravot.am/aravot_arc_2002.htm
   
 
'''Articles by date'''
 
'''Articles by date'''
   
Template : ''<nowiki>http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''
+
*Template : ''<nowiki>http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''
   
 
''ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm''
 
''ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm''
Line 64: Line 64:
   
   
Note for 1998 - 2003
+
*Note for 1998 - 2003
   
Most likely English and Russian articles do not exist.
+
*Most likely English and Russian articles do not exist.
   
Cannot access the parent directories
+
*Cannot access the parent directories
   
 
ie : http://archive.aravot.am/2003
 
ie : http://archive.aravot.am/2003
Line 79: Line 79:
 
*'''2004'''
 
*'''2004'''
   
Main archive page : http://www.aravot.am/aravot_arc_2004.htm
+
*Main archive page : http://www.aravot.am/aravot_arc_2004.htm
   
 
*'''2005 - 2011'''
 
*'''2005 - 2011'''
   
Main archive page : http://www.aravot.am/aravot_arc.htm
+
*Main archive page : http://www.aravot.am/aravot_arc.htm
   
 
'''Articles by date'''
 
'''Articles by date'''
   
Template ''<nowiki>http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm</nowiki>''
+
*Template ''<nowiki>http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm</nowiki>''
   
 
[language] is one of the following: arm, eng, rus
 
[language] is one of the following: arm, eng, rus
Line 113: Line 113:
 
'''Articles by date'''
 
'''Articles by date'''
   
Template :''<nowiki>http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/</nowiki>''
+
*Template :''<nowiki>http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/</nowiki>''
   
 
''ie : http://www.aravot.am/en/2011/12/26/21854/''
 
''ie : http://www.aravot.am/en/2011/12/26/21854/''
Line 131: Line 131:
 
''http://www.aravot.am/2011/12/26/21854/''
 
''http://www.aravot.am/2011/12/26/21854/''
   
Note: For Armenian, [lang] is taken out
+
*Note: For Armenian, [lang] is taken out
   
Note: Not all articles exist in all three languages (as shown in the examples)
+
*Note: Not all articles exist in all three languages (as shown in the examples)
   
 
'''
 
'''

Revision as of 06:02, 13 December 2012

Extracting from http://www.aravot.am

How to get to the index of each day and where the main archive page is.

1998 - 2000

Armenian

Articles by date

  • Template : http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html

ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'


2000

Armenian

Articles by date

  • Template : http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm

ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm

  • Note: Beware of spelling of some months: October = oktember


2001

Armenian

Articles by date

  • Template : http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htmItalic text

ie : http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm


2002 - 2003

Armenian

Articles by date

  • Template : http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm

ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm

ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm


  • Note for 1998 - 2003
  • Most likely English and Russian articles do not exist.
  • Cannot access the parent directories

ie : http://archive.aravot.am/2003



2004 - 2011 October

  • 2004
  • 2005 - 2011

Articles by date

  • Template http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm

[language] is one of the following: arm, eng, rus

Examples

Armenian

http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm

English

http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm

Russian

http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm



2011 October 21 - Present

Articles by date

  • Template :http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/

ie : http://www.aravot.am/en/2011/12/26/21854/

Examples

English

http://www.aravot.am/en/2011/12/26/21854/

Russian

http://www.aravot.am/ru/2011/12/26/21854/

Armenian

http://www.aravot.am/2011/12/26/21854/

  • Note: For Armenian, [lang] is taken out
  • Note: Not all articles exist in all three languages (as shown in the examples)

Scraping

1998 - 2003

Difficulty: Medium

  • The only language found is Armenian.
  • Start at the index.htm of each date.
  • Parse for any links to articles.
  • Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.

Example

  • index.htm

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

  • First link on the left goes directly to an article

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

  • Second link goes to a summary page

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm


2004 - 2011

Difficulty: Hard

  • Articles can be in Armenian, Russian or English, but not every article is in every language.
  • There are also some huge holes in the archive for languages other than Armenian
  • 2004, English only has 2 months

http://www.aravot.am/2004/aravot_eng/

  • Easy to download all the articles from one day
  • Difficult to match article to article form different languages
  • File names for the same articles are not always the same
  • Some articles exist in one language but not the other
  • Index page are not always the same either

http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm

http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm


2011 October 21 - Present

Difficulty: Easy

  • Giving a range of dates, go to an archive of each day in all three languages.

Example

http://www.aravot.am/ru/2011/12/26/

  • Scrape all the article links, then find the article full text
  • Save html with date, language, article number
  • Try to match articles with same date and article number.
  • Using only article number should be safe enough, but date would make them more unique