Difference between revisions of "Aravot.am"

Latest revision as of 00:17, 9 January 2013

Index[edit]

How to get to the index of each day and where the main archive page is.

1998 - 2000

Armenian

Main archive page : http://archive.aravot.am/index_arc.html

Articles by date

Template : http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html

ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'

2000

Armenian

Main archive page : http://archive.aravot.am/arc_00_01.htm

Articles by date

Template : http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm

ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm

Note: Beware of spelling of some months: October = oktember

2001

Armenian

Main archive page : http://archive.aravot.am/arc_00_01.htm

Articles by date

Template : http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htmItalic text

ie : http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

2002 - 2003

Armenian

Main archive page : http://archive.aravot.am/aravot_arc_2002.htm

Articles by date

Template : http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm

ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm

ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm

Note for 1998 - 2003

English and Russian articles most likely, do not exist.

The parent directories cannot be accessed, so it is not for certain.

ie : http://archive.aravot.am/2003

2004 - 2011 October

2004

Main archive page : http://www.aravot.am/aravot_arc_2004.htm

2005 - 2011

Main archive page : http://www.aravot.am/aravot_arc.htm

Articles by date

Template http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm

[language] is one of the following: arm, eng, rus

Examples

Armenian

http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm

English

http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm

Russian

http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm

2011 October 21 - Present

Articles by date

Template :http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/

ie : http://www.aravot.am/en/2011/12/26/21854/

Examples

English

http://www.aravot.am/en/2011/12/26/21854/

Russian

http://www.aravot.am/ru/2011/12/26/21854/

Armenian

http://www.aravot.am/2011/12/26/21854/

Note: For Armenian, [lang] is taken out

Note: Not all articles exist in all three languages (as shown in the examples)

Scraping[edit]

1998 - 2003

Difficulty: Medium

The only language found is Armenian.

Start at the index.htm of each date.

Parse for any links to articles.

Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.

Example

index.htm

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

First link on the left goes directly to an article

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

Second link goes to a summary page

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm

2004 - 2011

Difficulty: Hard

Articles can be in Armenian, Russian or English, but not every article is in every language.

There are also some huge holes in the archive for languages other than Armenian

2004, English only has 2 months

http://www.aravot.am/2004/aravot_eng/

Easy to download all the articles from one day

Difficult to match article to article form different languages

File names for the same articles are not always the same

Some articles exist in one language but not the other

Index page are not always the same either

http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm

http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm

2011 October 21 - Present

Difficulty: Easy

Giving a range of dates, go to an archive of each day in all three languages.

Example

http://www.aravot.am/ru/2011/12/26/

Scrape all the article links, then find the article full text

Save html with date, language, article number

Try to match articles with same date and article number.

Using only article number should be safe enough, but date would make them more unique

Scraping and aligning[edit]

How to scrape, segment, and align text from aravot.am:

Scraping[edit]

English, Russian, and Armenian articles' content will be scraped from aravot.am.

Write the Python script with Beautiful Soup, which aids in screen scraping. Loop through the archive pages for each day (and month and year) of the given range and for each language to get all the article URLs.

Loop through all the article URLs for each day and language and get the HTML tags that store article contents (e.g. <pre>, <p>, etc.) by using Beautiful Soup. Append the contents to a string.

The Python 3 scraper for Aravot.am can be found here.

Sentence Segmentation[edit]

Some things need to be done with the content string before tokenising it. For English, smart quotes should be replaced. This is an example code snippet:

output_en=output_en.replace('“','"')
output_en=output_en.replace('”','"')
output_en=output_en.replace("‘","'")
output_en=output_en.replace("’","'")

Please refer to the sentence segmentation page for sentence tokenisation. NLTK Punkt can be used.

After tokenisation, Article contents can be saved to a file. The file structure can be: year/month (the month folder contains the articles, in all three languages, for that month). You will have to loop through this process for all files for sentence segmentation.

Sentence Aligning[edit]

Hunalign will be used to align sentences from two languages. For example, hy-en and hy-ru pairs articles can be ran through.

Hunalign can be downloaded from here. To build Hunalign on Linux/Unix/Mac OS X, run these commands in the directory where Hunalign was downloaded:

tar zxvf hunalign-1.1.tgz
cd hunalign-1.1/src/hunalign
make

Hunalign needs a bilingual dictionary:

The dictionary consists of newline-separated dictionary items. An item consists of a target languge phrase and a source language phrase, separated by the ” @ ” sequence. Multiword phrases are allowed. The words of a phrase are space-separated as usual. IMPORTANT NOTE: In the current version, for historical reasons, the target language phrases come first. Therefore the ordering is the opposite of the ordering of the command-line arguments or the results.

(The above quote is from here.)

A sample Armenian to English dictionary based on Apertium's hye-eng bidix (converted to a Hunalign-compatible dictionary format via a simple Python script) can be found here.

Information about all Hunalign commands can be found here.

If you do not have a bilingual dictionary, you can use the -realign argument.

Hunalign will build its own dictionary based on the two text files (in two languages) that you input. You can input a 0 byte file as a placeholder for the dictionary argument.

Hunalign is used like this:

hunalign [ common_arguments ] [ -hand=hand_align_file ] dictionary_file source_text target_text

Use the -utf argument to input and output in UTF-8.

Use the -bisent argument to only print bisentences (one-to-one alignment segments).

An example command would be:

src/hunalign/hunalign  -utf -bisent -realign -text dict.txt en_2012_06_23_84517.txt hy_2012_06_23_84517.txt

In this command:

The command is executed in Hunspell's root directory.
The -utf argument (explained above) is used.
The -bisent command (explained above) is used.
The bilingual dictionary used is called dict.txt
en_2012_06_23_84517.txt is the file in the source language.
hy_2012_06_23_84517.txt is the file in the target language.

You need to loop through all the files that you generated after the sentence segmentation step. You can create a simple Python script to do so:

os.popen('~/hunalign-1.1/src/hunalign/hunalign -utf -bisent -text dict.txt '+en+' '+hy+' > ~/hunalign-1.1/align/'+year+'/'+month+'/hy-'+lang+'/'+num).read()

Hunalign is under this file path: ~/hunalign-1.1/src/hunalign/hunalign.
en is the variable with the file path to the source language article
hy is the variable with the file path to the target language article
year is the variable with the year (i.e. "2012")
lang is the variable with the source language ISO code (i.e. "en")
num is the variable with the article number (i.e. "1234")
The output from Hunalign (for each article) is stored inside a directory with the target language and the source language (separated by a hyphen), which is inside a month directory inside a year directory. Their names are their article numbers. (i.e. align/2012/10/hy-en/12345)

The format of the text-style output for Hunalign is:

Each line contains three columns, separated by tabs.
The first column in a line contains a sentence from the source language text.
The second column contains the (supposedly) corresponding sentence from the target language text.
The third column is a confidence score for the alignment. It averages all the scores to determine the quality of the alignment.
At the bottom, there will be the quality score. It seems like the quality score increases with the amount of content in the input articles.

Difference between revisions of "Aravot.am"

Latest revision as of 00:17, 9 January 2013

Contents

Index[edit]

Scraping[edit]

Scraping and aligning[edit]

Scraping[edit]

Sentence Segmentation[edit]

Sentence Aligning[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+{{TOCD}}
-== '''Extracting from http://www.aravot.am''' ==
+== Index ==
-== How to get to the index of each day and where the main archive page is. ==
+How to get to the index of each day and where the main archive page is.
 '''1998 - 2000'''
@@ Line 8: / Line 8: @@
 '''Armenian'''
-Main Archive page : http://archive.aravot.am/index_arc.html
+*Main archive page : http://archive.aravot.am/index_arc.html
+'''Articles by date'''
+*Template : ''<nowiki>http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html</nowiki>''
-'''Grabbing Articles by date'''
-Template : ''http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html''
+''ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'
-''ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html''
-Within each main page are links to all the articles.
 '''2000'''
@@ Line 22: / Line 22: @@
 '''Armenian'''
-Main archive page : http://archive.aravot.am/arc_00_01.htm
+*Main archive page : http://archive.aravot.am/arc_00_01.htm
 '''Articles by date'''
-Template : ''http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm''
+*Template : ''<nowiki>http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''
 ''ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm''
-Note: Beware of spelling of some months: October = oktember
+*Note: Beware of spelling of some months: October = oktember
-Within each main page are links to all the articles.
 '''2001'''
@@ Line 37: / Line 38: @@
 '''Armenian'''
-Main archive page : http://archive.aravot.am/arc_00_01.htm
+*Main archive page : http://archive.aravot.am/arc_00_01.htm
 '''Articles by date'''
-Template : http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htm
+*Template : <nowiki>http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''Italic text''
-ie : http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm
+ie : ''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm''
-Within each main page are links to all the articles.
 '''2002 - 2003'''
@@ Line 49: / Line 52: @@
 '''Armenian'''
-Main archive page : http://archive.aravot.am/aravot_arc_2002.htm
+*Main archive page : http://archive.aravot.am/aravot_arc_2002.htm
 '''Articles by date'''
-Template : ''http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm''
+*Template : ''<nowiki>http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm</nowiki>''
 ''ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm''
@@ Line 59: / Line 62: @@
 ''ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm''
-Within each main page are links to all the articles.
-Note for 1998 - 2003
+*Note for 1998 - 2003
-Most likely English and Russian articles do not exist.
-Cannot access the parent directories
+*English and Russian articles most likely, do not exist.
+*The parent directories cannot be accessed, so it is not for certain.
 ie : http://archive.aravot.am/2003
 '''2004 - 2011 October'''
-'''2004'''
+*'''2004'''
-Main archive page : http://www.aravot.am/aravot_arc_2004.htm
+*Main archive page : http://www.aravot.am/aravot_arc_2004.htm
-'''2005 - 2011'''
+*'''2005 - 2011'''
-Main archive page : http://www.aravot.am/aravot_arc.htm
+*Main archive page : http://www.aravot.am/aravot_arc.htm
 '''Articles by date'''
-Template ''http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm''
+*Template ''<nowiki>http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm</nowiki>''
 [language] is one of the following: arm, eng, rus
@@ Line 84: / Line 93: @@
 '''Examples'''
+Armenian
-''http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm
+''http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm''
+English
+''http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm
+Russian
+''http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm''
-http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm
-http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm''
 '''2011 October 21 - Present'''
@@ Line 94: / Line 113: @@
 '''Articles by date'''
-Template :'' http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/''
+*Template :''<nowiki>http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/</nowiki>''
 ''ie : http://www.aravot.am/en/2011/12/26/21854/''
@@ Line 112: / Line 131: @@
 ''http://www.aravot.am/2011/12/26/21854/''
-Note : For Armenian the language is taken out
+*Note: For Armenian, [lang] is taken out
-Note :  not all articles exist in all three languages (as shown in the examples)
+*Note: Not all articles exist in all three languages (as shown in the examples)
+'''
 == Scraping ==
+'''
 '''1998 - 2003'''
@@ Line 122: / Line 144: @@
 Difficulty: Medium
-The only language found is Armenian.
+*The only language found is Armenian.
-Start at the index.htm of each date.
+*Start at the index.htm of each date.
-Parse for any links to articles.
+*Parse for any links to articles.
-Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.
+*Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.
 '''Example'''
-index.htm
+*index.htm
 ''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm''
-First link on the left goes directly to an article
+*First link on the left goes directly to an article
 ''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm''
-Second link goes to a summary page
+*Second link goes to a summary page
 ''http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm''
 '''2004 - 2011'''
@@ Line 148: / Line 172: @@
 Difficulty: Hard
-Articles can be in Armenian, Russian or English, but not every article is in every language.
+*Articles can be in Armenian, Russian or English, but not every article is in every language.
-There are also some huge holes in the archive for languages other than Armenian
+*There are also some huge holes in the archive for languages other than Armenian
-, English only has 2 months
+*2004, English only has 2 months
 ''http://www.aravot.am/2004/aravot_eng/''
-Easy to download all the articles from one day
+*Easy to download all the articles from one day
-Difficult to match article to article form different languages
+*Difficult to match article to article form different languages
-File names for the same articles are not always the same
+*File names for the same articles are not always the same
-Some articles exist in one language but not the other
+*Some articles exist in one language but not the other
-Index page are not always the same either
+*Index page are not always the same either
+''http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm''
+''http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm''
-''http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm
-http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm''
 '''2011 October 21 - Present'''
@@ Line 174: / Line 200: @@
 Difficulty: Easy
-Giving a range of dates, go to an archive of each day in all three languages.
+*Giving a range of dates, go to an archive of each day in all three languages.
 '''Example'''
 ''http://www.aravot.am/ru/2011/12/26/''
-Scrape all the article links, then find the article full text
+*Scrape all the article links, then find the article full text
+*Save html with date, language, article number
+*Try to match articles with same date and article number.
+*Using only article number should be safe enough, but date would make them more unique
+==Scraping and aligning==
+How to scrape, segment, and align text from aravot.am:
+===Scraping===
+English, Russian, and Armenian articles' content will be scraped from aravot.am.
+Write the Python script with Beautiful Soup, which aids in screen scraping.
+Loop through the archive pages for each day (and month and year) of the given range and for each language to get all the article URLs.
+Loop through all the article URLs for each day and language and get the HTML tags that store article contents (e.g. <nowiki><pre>, <p>, etc.</nowiki>) by using Beautiful Soup. Append the contents to a string.
+The Python 3 scraper for Aravot.am can be found [http://danielhonline.com/2013/01/scrape-armenian-english-and-russian-articles-from-aravot-am/  here].
+===Sentence Segmentation===
+Some things need to be done with the content string before tokenising it. For English, smart quotes should be replaced. This is an example code snippet:
+<pre>
+output_en=output_en.replace('“','"')
+output_en=output_en.replace('”','"')
+output_en=output_en.replace("‘","'")
+output_en=output_en.replace("’","'")
+</pre>
+Please refer to the [http://wiki.apertium.org/wiki/Sentence_segmenting sentence segmentation] page for sentence tokenisation. NLTK Punkt can be used.
+After tokenisation, Article contents can be saved to a file. The file structure can be: year/month (the month folder contains the articles, in all three languages, for that month).
+You will have to loop through this process for all files for sentence segmentation.
+===Sentence Aligning===
+Hunalign will be used to align sentences from two languages. For example, hy-en and hy-ru pairs articles can be ran through.
+Hunalign can be downloaded from [http://mokk.bme.hu/resources/hunalign/ here].
+To build Hunalign on Linux/Unix/Mac OS X, run these commands in the directory where Hunalign was downloaded:
+<pre>
+tar zxvf hunalign-1.1.tgz
+cd hunalign-1.1/src/hunalign
+make
+</pre>
+Hunalign needs a bilingual dictionary:
+<blockquote>
+The dictionary consists of newline-separated dictionary items. An item consists of a target languge phrase and a source language phrase, separated by the ” @ ” sequence. Multiword phrases are allowed. The words of a phrase are space-separated as usual. IMPORTANT NOTE: In the current version, for historical reasons, the target language phrases come first. Therefore the ordering is the opposite of the ordering of the command-line arguments or the results.
+</blockquote>
+(The above quote is from [http://mokk.bme.hu/resources/hunalign/ here].)
+A sample Armenian to English dictionary  based on Apertium's hye-eng bidix (converted to a Hunalign-compatible dictionary format via [http://danielhonline.com/2013/01/apertium-bidix-to-hunalign-dictionary-converter-script/ a simple Python script]) can be found [http://sebsauvage.net/paste/?ea47c257cf471482#7RcEqbvdH1NwNHAlPrwUDPAVkgfxrkwbzpQkdPoI0Go= here].
+Information about all Hunalign commands can be found [http://mokk.bme.hu/resources/hunalign/ here].
+If you do not have a bilingual dictionary, you can use the <code>-realign</code> argument.
+Hunalign will build its own dictionary based on the two text files (in two languages) that you input. You can input a 0 byte file as a placeholder for the dictionary argument.
+Hunalign is used like this:
+<pre>
+hunalign [ common_arguments ] [ -hand=hand_align_file ] dictionary_file source_text target_text
+</pre>
+Use the <code>-utf</code> argument to input and output in <code>UTF-8</code>.
+Use the <code>-bisent</code> argument to only print bisentences (one-to-one alignment segments).
+An example command would be:
+<pre>
+src/hunalign/hunalign  -utf -bisent -realign -text dict.txt en_2012_06_23_84517.txt hy_2012_06_23_84517.txt
+</pre>
+In this command:
+* The command is executed in Hunspell's root directory.
+* The <code>-utf</code> argument (explained above) is used.
+* The <code>-bisent</code> command (explained above) is used.
+* The bilingual dictionary used is called <code>dict.txt</code>
+*<code>en_2012_06_23_84517.txt</code> is the file in the source language.
+*<code>hy_2012_06_23_84517.txt</code> is the file in the target language.
+You need to loop through all the files that you generated after the sentence segmentation step. You can create a simple Python script to do so:
+<pre>
+os.popen('~/hunalign-1.1/src/hunalign/hunalign -utf -bisent -text dict.txt '+en+' '+hy+' > ~/hunalign-1.1/align/'+year+'/'+month+'/hy-'+lang+'/'+num).read()
+</pre>
+* Hunalign is under this file path: <code>~/hunalign-1.1/src/hunalign/hunalign</code>.
-Save html with date, language, article number
+* <code>en</code> is the variable with the file path to the source language article
+* <code>hy</code> is the variable with the file path to the target language article
+* <code>year</code> is the variable with the year (i.e. "2012")
+* <code>lang</code> is the variable with the source language ISO code (i.e. "en")
+* <code>num</code> is the variable with the article number (i.e. "1234")
+* The output from Hunalign (for each article) is stored inside a directory with the target language and the source language (separated by a hyphen), which is inside a month directory inside a year directory. Their names are their article numbers. (i.e. align/2012/10/hy-en/12345)
+The format of the text-style output for Hunalign is:
-Try to match articles with same date and article number.
+* Each line contains three columns, separated by tabs.
-Using only article number should be safe enough, but date would make them more unique
+* The first column in a line contains a sentence from the source language text.
+* The second column contains the (supposedly) corresponding sentence from the target language text.
+* The third column is a confidence score for the alignment. It averages all the scores to determine the quality of the alignment.
+* At the bottom, there will be the quality score. It seems like the quality score increases with the amount of content in the input articles.