Difference between revisions of "Aravot.am"

Revision as of 22:42, 7 January 2013

Index

How to get to the index of each day and where the main archive page is.

1998 - 2000

Armenian

Main archive page : http://archive.aravot.am/index_arc.html

Articles by date

Template : http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html

ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'

2000

Armenian

Main archive page : http://archive.aravot.am/arc_00_01.htm

Articles by date

Template : http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm

ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm

Note: Beware of spelling of some months: October = oktember

2001

Armenian

Main archive page : http://archive.aravot.am/arc_00_01.htm

Articles by date

Template : http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htmItalic text

ie : http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

2002 - 2003

Armenian

Main archive page : http://archive.aravot.am/aravot_arc_2002.htm

Articles by date

Template : http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm

ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm

ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm

Note for 1998 - 2003

English and Russian articles most likely, do not exist.

The parent directories cannot be accessed, so it is not for certain.

ie : http://archive.aravot.am/2003

2004 - 2011 October

2004

Main archive page : http://www.aravot.am/aravot_arc_2004.htm

2005 - 2011

Main archive page : http://www.aravot.am/aravot_arc.htm

Articles by date

Template http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm

[language] is one of the following: arm, eng, rus

Examples

Armenian

http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm

English

http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm

Russian

http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm

2011 October 21 - Present

Articles by date

Template :http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/

ie : http://www.aravot.am/en/2011/12/26/21854/

Examples

English

http://www.aravot.am/en/2011/12/26/21854/

Russian

http://www.aravot.am/ru/2011/12/26/21854/

Armenian

http://www.aravot.am/2011/12/26/21854/

Note: For Armenian, [lang] is taken out

Note: Not all articles exist in all three languages (as shown in the examples)

Scraping

1998 - 2003

Difficulty: Medium

The only language found is Armenian.

Start at the index.htm of each date.

Parse for any links to articles.

Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.

Example

index.htm

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

First link on the left goes directly to an article

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

Second link goes to a summary page

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm

2004 - 2011

Difficulty: Hard

Articles can be in Armenian, Russian or English, but not every article is in every language.

There are also some huge holes in the archive for languages other than Armenian

2004, English only has 2 months

http://www.aravot.am/2004/aravot_eng/

Easy to download all the articles from one day

Difficult to match article to article form different languages

File names for the same articles are not always the same

Some articles exist in one language but not the other

Index page are not always the same either

http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm

http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm

2011 October 21 - Present

Difficulty: Easy

Giving a range of dates, go to an archive of each day in all three languages.

Example

http://www.aravot.am/ru/2011/12/26/

Scrape all the article links, then find the article full text

Save html with date, language, article number

Try to match articles with same date and article number.

Using only article number should be safe enough, but date would make them more unique

Scraping and aligning

How to scrape aravot.am:

Scraping

Write the Python script with Beautiful Soup, which aids in screen scraping.

English, Russian, and Armenian will be scraped from aravot.am. Loop through the archive pages for each day of the given range and for each language to get all the article URLs.

Loop through all the article URLs for each day and language and get the HTML tags that store content (e.g. the paragraph tag and the preformatted text tag) by using Beautiful Soup. Append the contents to a string.

Sentence Segmentation

Some things need to be done with the text before tokenising it. For English, smart quotes should be replaced:

output_en=output_en.replace('“','"')
              output_en=output_en.replace('”','"')
              output_en=output_en.replace("‘","'")
              output_en=output_en.replace("’","'")

Please refer to the sentence segmentation page for sentence tokenisation. I used NLTK Punkt.

Difference between revisions of "Aravot.am"

Revision as of 22:42, 7 January 2013

Contents

Index

Scraping

Scraping and aligning

Scraping

Sentence Segmentation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 222: / Line 222: @@
 English, Russian, and Armenian will be scraped from aravot.am. Loop through the archive pages for each day of the given range and for each language to get all the article URLs.
-Loop through all the article URLs for each day and language and get the HTML tags that store content (e.g. <code><p>, <pre>, etc.</code>).
+Loop through all the article URLs for each day and language and get the HTML tags that store content (e.g. the paragraph tag and the preformatted text tag) by using Beautiful Soup. Append the contents to a string.
+===Sentence Segmentation===
+Some things need to be done with the text before tokenising it. For English, smart quotes should be replaced:
+<pre>
+output_en=output_en.replace('“','"')
+              output_en=output_en.replace('”','"')
+              output_en=output_en.replace("‘","'")
+              output_en=output_en.replace("’","'")
+</pre>
+Please refer to the [http://wiki.apertium.org/wiki/Sentence_segmenting sentence segmentation] page for sentence tokenisation. I used NLTK Punkt.