Difference between revisions of "Aravot.am"

From Apertium
Jump to navigation Jump to search
Line 253: Line 253:
</pre>
</pre>
A sample Armenian to English dictionary based on Apertium's hye-eng bidix (converted to a Hunalign-compatible dictionary format via a simple Python script) can be found [http://sebsauvage.net/paste/?41f56bfc5aabbe20#ipA5ElUiPzGEigRR4Aiq9dJ9O5ZH86O3Xa642WuJ8HA= here].
A sample Armenian to English dictionary based on Apertium's hye-eng bidix (converted to a Hunalign-compatible dictionary format via a simple Python script) can be found [http://sebsauvage.net/paste/?41f56bfc5aabbe20#ipA5ElUiPzGEigRR4Aiq9dJ9O5ZH86O3Xa642WuJ8HA= here].
If you do not have a bilingual dictionary, you can use the -realign argument. Hunalign will build its own dictionary based on the two text files (in two languages) that you input.
If you do not have a bilingual dictionary, you can use the -realign argument. Hunalign will build its own dictionary based on the two text files (in two languages) that you input. You can input a 0 byte file as a placeholder for the dictionary argument.


Hunalign is used like this:
Hunalign is used like this:
Line 277: Line 277:
* <code>hy_2012_06_23_84517.txt</code> is the file in the target language.
* <code>hy_2012_06_23_84517.txt</code> is the file in the target language.


You need to loop through all the files that you generated after the sentence segmentation step. You can create a simple Python script to do so.
You need to loop through all the files that you generated after the sentence segmentation step. You can create a simple Python script to do so:
<pre>
os.popen('~/hun3/hunalign-1.1/src/hunalign/hunalign -utf -bisent -text dict.txt '+en+' '+hy+' > ~/hun3/hunalign-1.1/align/'+year+'/'+month+'/hy-'+lang+'/'+num).read()
</pre>

* Hunalign is under this filepath: <code>~/hun3/hunalign-1.1/src/hunalign/hunalign</code>.
* en is the variable with the filepath to the source language article
* hy is the variable with the filepath to the target language article
* year is the variable with the year (i.e. "2012")
* lang is the variable with the source language ISO code (i.e. "en")
* num is the varaible with the article number (i.e. "1234")
* All the output from Hunalign is stored in a directory called "align" with different file names (i.e.

Revision as of 23:14, 7 January 2013

Index

How to get to the index of each day and where the main archive page is.

1998 - 2000

Armenian

Articles by date

  • Template : http://archive.aravot.am/[year]/[month]/[day]/index_[month][day].html

ie: http://archive.aravot.am/1999/dectember/22/index_dec22.html'


2000

Armenian

Articles by date

  • Template : http://archive.aravot.am/2000new/aravot_arm/[month]/[day]/aravot_index.htm

ie : http://archive.aravot.am/2000new/aravot_arm/oktember/20/aravot_index.htm

  • Note: Beware of spelling of some months: October = oktember


2001

Armenian

Articles by date

  • Template : http://archive.aravot.am/2001/aravot_arm/[month]/[day]/aravot_index.htmItalic text

ie : http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm


2002 - 2003

Armenian

Articles by date

  • Template : http://archive.aravot.am/[year]/aravot_arm/[month]/[day]/aravot_index.htm

ie : http://archive.aravot.am/2002/aravot_arm/February/7/aravot_index.htm

ie : http://archive.aravot.am/2003/aravot_arm/January/15/aravot_index.htm


  • Note for 1998 - 2003
  • English and Russian articles most likely, do not exist.
  • The parent directories cannot be accessed, so it is not for certain.

ie : http://archive.aravot.am/2003



2004 - 2011 October

  • 2004
  • 2005 - 2011

Articles by date

  • Template http://www.aravot.am/[year]/aravot_[language]/[month]/[date]/aravot_index.htm

[language] is one of the following: arm, eng, rus

Examples

Armenian

http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm

English

http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm

Russian

http://www.aravot.am/2004/aravot_rus/December/4/aravot_index.htm



2011 October 21 - Present

Articles by date

  • Template :http://www.aravot.am/[lang]/[year]/[month]/[day]/[articleNumber]/

ie : http://www.aravot.am/en/2011/12/26/21854/

Examples

English

http://www.aravot.am/en/2011/12/26/21854/

Russian

http://www.aravot.am/ru/2011/12/26/21854/

Armenian

http://www.aravot.am/2011/12/26/21854/

  • Note: For Armenian, [lang] is taken out
  • Note: Not all articles exist in all three languages (as shown in the examples)

Scraping

1998 - 2003

Difficulty: Medium

  • The only language found is Armenian.
  • Start at the index.htm of each date.
  • Parse for any links to articles.
  • Sometime these articles are only summaries of a few. So you would have to search for possible links within each of these.

Example

  • index.htm

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

  • First link on the left goes directly to an article

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_index.htm

  • Second link goes to a summary page

http://archive.aravot.am/2001/aravot_arm/january/13/aravot_politik.htm


2004 - 2011

Difficulty: Hard

  • Articles can be in Armenian, Russian or English, but not every article is in every language.
  • There are also some huge holes in the archive for languages other than Armenian
  • 2004, English only has 2 months

http://www.aravot.am/2004/aravot_eng/

  • Easy to download all the articles from one day
  • Difficult to match article to article form different languages
  • File names for the same articles are not always the same
  • Some articles exist in one language but not the other
  • Index page are not always the same either

http://www.aravot.am/2004/aravot_arm/December/4/aravot_index.htm

http://www.aravot.am/2004/aravot_eng/December/4/aravot_index.htm


2011 October 21 - Present

Difficulty: Easy

  • Giving a range of dates, go to an archive of each day in all three languages.

Example

http://www.aravot.am/ru/2011/12/26/

  • Scrape all the article links, then find the article full text
  • Save html with date, language, article number
  • Try to match articles with same date and article number.
  • Using only article number should be safe enough, but date would make them more unique

Scraping and aligning

How to scrape aravot.am:

Scraping

Write the Python script with Beautiful Soup, which aids in screen scraping.

English, Russian, and Armenian will be scraped from aravot.am. Loop through the archive pages for each day of the given range and for each language to get all the article URLs.

Loop through all the article URLs for each day and language and get the HTML tags that store content (e.g. the paragraph tag and the preformatted text tag) by using Beautiful Soup. Append the contents to a string.

Sentence Segmentation

Some things need to be done with the text before tokenising it. For English, smart quotes should be replaced:

output_en=output_en.replace('“','"')
output_en=output_en.replace('”','"')
output_en=output_en.replace("‘","'")
output_en=output_en.replace("’","'")


Please refer to the sentence segmentation page for sentence tokenisation. I used NLTK Punkt.

After tokenisation, you can save the article text to a file. The file structure should be: year/month. You will have to loop through this process for all files.

Sentence Aligning

Run Hunalign to align sentences from two languages. In my case, I ran it on hy-en and hy-ru articles.

To install Hunalign on Linux/Unix/Mac OS X, run these commands in the directory where you downloaded Hunalign:

tar zxvf hunalign-1.1.tgz
cd hunalign-1.1/src/hunalign
make

Hunalign needs a bilingual dictionary:

The dictionary consists of newline-separated dictionary items. An item consists of a target languge phrase and a source language phrase, separated by the ” @ ” sequence. Multiword phrases are allowed. The words of a phrase are space-separated as usual. IMPORTANT NOTE: In the current version, for historical reasons, the target language phrases come first. Therefore the ordering is the opposite of the ordering of the command-line arguments or the results.

A sample Armenian to English dictionary based on Apertium's hye-eng bidix (converted to a Hunalign-compatible dictionary format via a simple Python script) can be found here. If you do not have a bilingual dictionary, you can use the -realign argument. Hunalign will build its own dictionary based on the two text files (in two languages) that you input. You can input a 0 byte file as a placeholder for the dictionary argument.

Hunalign is used like this:

hunalign [ common_arguments ] [ -hand=hand_align_file ] dictionary_file source_text target_text

Use the -utf argument to input and output in UTF-8.

Use the -bisent argument to only print bisentences (one-to-one alignment segments).

An example command would be:

src/hunalign/hunalign  -utf -bisent -realign -text dict.txt en_2012_06_23_84517.txt hy_2012_06_23_84517.txt

In this command:

  • The command is executed in Hunspell's root directory.
  • The -utf argument (explained above) is used.
  • The -bisent command (explained above) is used.
  • The bilingual dictionary used is called dict.txt
  • en_2012_06_23_84517.txt is the file in the source language.
  • hy_2012_06_23_84517.txt is the file in the target language.

You need to loop through all the files that you generated after the sentence segmentation step. You can create a simple Python script to do so:

os.popen('~/hun3/hunalign-1.1/src/hunalign/hunalign -utf -bisent -text dict.txt '+en+' '+hy+' > ~/hun3/hunalign-1.1/align/'+year+'/'+month+'/hy-'+lang+'/'+num).read()
  • Hunalign is under this filepath: ~/hun3/hunalign-1.1/src/hunalign/hunalign.
  • en is the variable with the filepath to the source language article
  • hy is the variable with the filepath to the target language article
  • year is the variable with the year (i.e. "2012")
  • lang is the variable with the source language ISO code (i.e. "en")
  • num is the varaible with the article number (i.e. "1234")
  • All the output from Hunalign is stored in a directory called "align" with different file names (i.e.