Difference between revisions of "User:Maybeitworksnow/proposal"

From Apertium
Jump to navigation Jump to search
 
(25 intermediate revisions by 2 users not shown)
Line 47: Line 47:
   
 
Term papers - past
 
Term papers - past
  +
Buianova, A. (2015) '''“'''Classification of Sardinian Dialects Based on the Swadesh List'''”'''.
 
  +
Buianova, A. (2015) '''"Classification of Sardinian Dialects Based on the Swadesh List".'''
  +
 
Buianova, A. (2016) '''“Discrepancies between the Scientific and Naive Taxonomy in the Names of Plants / Animals (Based on Slavic Languages)”'''.
 
Buianova, A. (2016) '''“Discrepancies between the Scientific and Naive Taxonomy in the Names of Plants / Animals (Based on Slavic Languages)”'''.
  +
   
 
Bachelor thesis - current
 
Bachelor thesis - current
  +
 
'''“Constructions with Body Parts in Typological Perspective”''' for Russian, Czech, German, and English, with the idea to build up an electronic dictionary for linguistic and medical usage.
 
'''“Constructions with Body Parts in Typological Perspective”''' for Russian, Czech, German, and English, with the idea to build up an electronic dictionary for linguistic and medical usage.
  +
  +
  +
  +
== Why is it you are interested in machine translation? ==
  +
  +
  +
Unfortunately, I’ve never worked on machine translation for all my university years. That’s why I’m strongly interested in meeting it and working with it. I would like to learn closer the MT mechanism, to develop the skills I’ve already had, and to use in practice the knowledge I’ve been keeping for many years at university classes and through my translating sessions. I guess, as a computational linguist I should have such an experience, which might help me to find and open new horizons of my future career and research.
  +
This internship could give me the practical skills I’m looking for, make me more professional in the area of my studies and future job.
  +
  +
== Why is it that you are interested in Apertium? ==
  +
  +
  +
Apertium has projects of MT for small, dialectical, and local languages that is pretty seldom (I would write that Apertium is the only). From my experience of Ket and Sardinian I believe that native speakers really need it and appreciate it. In some case, Apertium helps to popularize ‘little’ languages among people, who prefered to switch into the formal literary language or into a ‘big dad’ (like bel-rus), and maybe because of that to spark the interest to their culture (it's very simple to identify from a sociolinguistic researches and experiment, e.g. where native speakers of local languages of Russia were shy to use their language because 'nobody could understand them' and 'that's why it was better to learn Russian'. It also can case the interest in these languages among people, who might not know about their existence) . Also Apertium provides methods of the rule-based MT that I would like to learn closer, as I wrote above.
  +
  +
As I've already noticed, the Apertium team is ready to help in any minute and to answer any question. It's pleasure to work with such people.
  +
  +
  +
  +
== Bel-ukr language pair ==
  +
  +
  +
I would like to propose the Apertium my work on the project to create a new language pair for Belarusian - Ukrainian languages.
  +
  +
'''Why Google and Apertium should sponsor it and how and who it will benefit in society.
  +
'''
  +
  +
I plan to build up the open-free ‘language resource’, which could be the first good one in the field for Bel-Ukr languages (I would write the best one: ‘Easy to be the best in the field, when you’re the only’.).
  +
  +
Despite of the fact that many people in Belarus and Ukraine are bilingual (speaking both - Russian and Belarusian / Ukrainian languages) and that Russian is more popular in some regions of these neighbour countries, Belarusian and Ukrainian are widely spoken, have the official status, and as I’ve noticed, not all the people in these countries can Russian. If we are talking about young people, they prefer Bel or Ukr instead of Russian. However, all the existed dictionaries, MT platforms for these idioms have good developed applications just for Russian, in the best case for English. Then what about people, who do not speak any Russian or English, and want to understand their neighbors? Belarusian and Ukrainian are official languages of the nations, but they don’t have any good languages resources for not-Russian and not-English speaking people to communicate. (Can you imagine that for Spanish - Portuguese, for example?)
  +
  +
For instance, Apertium has projects for rus-ukr, rus-bel language pairs. You can easily google Russian - Belarusian dictionary, a very good one for Rus-Ukrainian (e.g. AbbyyLingvo, though it’s still being developed), but no Bel-Ukr or Ukr-Bel. When I needed to find such a dictionary (for my research and simply to translate something good), I’ve found one, but it was almost empty, I couldn’t search any basic word. That’s why I tend to write that it doesn’t exist. However, the idea of creation some language resource like that is here, so it just proves my argument that people need it and they will use it.
  +
  +
The Google Translator has the opportunity to translate from Bel to Ukr and back, but it works badly, it’s clear that the platform doesn’t have a direct translation machine (maybe, working first with Russian or English).
  +
  +
I guess, the result if my project can be used for both - simple native-speakers and professional linguists, working in the field of Bel-Ukr language diversity or Slavic languages.
  +
  +
  +
  +
== Timeline ==
  +
  +
'''Post-application period'''
  +
  +
Get to know the Apertium better.
  +
  +
Learning more about MT.
  +
  +
Finish the coding challenge.
  +
  +
  +
'''Community bonding period'''
  +
  +
Get the full picture of Belarusian and Ukrainian morphology.
  +
  +
To look closer to morph analysers for Bel and Ukr.
  +
  +
== Work plan ==
  +
  +
Week 1: Try to improve the morphological analysers, if it’s necessary. Work with exist dictionaries. Adv. Testvoc.
  +
  +
Week 2: State words (prep, conj), numerals. Testvoc. To complete the dictionaries.
  +
  +
Week 3: Even up verbs. Testvoc.
  +
  +
Week 4: Nouns, adjectives. Testvoc.
  +
  +
'''Deliverable #1:''' 30% done. Dictionaries cover more words.
  +
  +
Week 5: Transfer verbs. Bilingual dictionary. Morphological disambiguations.
  +
  +
Week 6: Nouns, adj. Bilingual dictionary. Morphological disambiguations.
  +
  +
Week 7: Numerals. Bilingual dictionary. Morphological disambiguations.
  +
  +
Week 8: State words. Bilingual dictionary. Morphological disambiguations.
  +
  +
'''Deliverable #2''': 60% done. Morphological disambiguations is done, bilingual dictionary is completed.
  +
  +
Week 9: Continue working on transfer rules, CG.
  +
  +
Week 10: Test and check possible errors in rules and dictionaries.
  +
  +
Week 11: Corpora test.
  +
  +
Week 12: Evaluation.
  +
  +
'''Project completed:''' 80% done: Language pair is ready to use.
  +
  +
== Non-Summer-of-Code plans ==
  +
  +
  +
In the beginning of June I have my bachelor thesis defence (the date is not clear yet). By this reason, I can not work on the project for 1 week all 30-40 hours per week. Though after my defence I’m free with my studies and I’m ready to work more, to supply what I’ll have missed.
  +
  +
In the beginning of August, I have a regatta with the Yacht club to the North of Russia for 3 days. Though I plan to keep my working 30-40 hours per week because of other days.
  +
  +
As I’ve written, I’m free with my studies at university after June, but I don’t have any classes since May. Though I would like to mention that I like doing sports, so every day I spend on that about 2 hours (rowing and yachting). I’m sure, it doesn’t stop my working.
  +
  +
  +
  +
== Second Idea ==
  +
  +
  +
The second idea was to work on Russian - Swearing Russian. I don’t know any resource that provides any translation like that. However, a lot of researchers work with Russian swearing. Even though it represents a part of Russian, it has different synthetic, semantic and morphological world with huge ambiguities! Swearing becomes more normative. And I tend to think that in future it won’t have such a bad status. In this case, we could stay in the step further.
  +
Recently, bad Russian became very popular in the Internet because of online gaming. This resource could be useful for linguists and non-Russian native speakers, who tries to understand Russian fully. And a lot of fun for Russians.
  +
It’s a real natural language. As a linguist I’m strongly interested in this field of research.

Latest revision as of 15:21, 3 April 2017

Personal details[edit]

Name: Anastasia Buianova (Анастасия Буянова)

E-mail address: anastasia.d.buianova@gmail.com

Other information that may be useful to contact you: cell-number +7 926 47 27 444

Location: Moscow (UTC+03:00)


Language knowledge / working languages: Russian (native), English (C1), German (B2-C1).

Programming: python.

Education: skills and experience[edit]

I’m currently in my fourth year studying Computational (Applied) and Fundamental Linguistics at Higher School of Economics (HSE) in Moscow, Russia. [Full time, BA: September 2013 - June 2017]

Fundamental and Computational Linguistics at Karls Eberhard University of Tuebingen, Germany. [Exchange: spring semester, March - July 2016]


(Some-of-)Taken and passed university courses:

Computational linguistics: natural language processing, machine learning, python 2, python 3.

Mathematics: Logic, Discrete Mathematics (Combinatorics), Linear Algebra and Mathematical Analysis, Probability Theory and Mathematical Statistics.

Fundamental linguistics: morphology, phonetics, syntax, semantics, typology (Tuebingen and HSE).

Applied linguistics: phylogenetics (Tuebingen), sociolinguistics, psycholinguistics, neurolinguistics.

Linguistic interests[edit]

Lexical typology, morphology, syntax;

Slavic languages (morphology and syntax in typological perspective, also lexical typology);

old languages and scripts;

German and Dialects (z.B. Schwaebisch, Badisch, Duetsch; phonetics, morphology, syntax);

Russian (morphology, syntax; non-normative lexicon as a lexical-semantic phenomenon).

Research (past and current)[edit]

Term papers - past

Buianova, A. (2015) "Classification of Sardinian Dialects Based on the Swadesh List".

Buianova, A. (2016) “Discrepancies between the Scientific and Naive Taxonomy in the Names of Plants / Animals (Based on Slavic Languages)”.


Bachelor thesis - current

“Constructions with Body Parts in Typological Perspective” for Russian, Czech, German, and English, with the idea to build up an electronic dictionary for linguistic and medical usage.


Why is it you are interested in machine translation?[edit]

Unfortunately, I’ve never worked on machine translation for all my university years. That’s why I’m strongly interested in meeting it and working with it. I would like to learn closer the MT mechanism, to develop the skills I’ve already had, and to use in practice the knowledge I’ve been keeping for many years at university classes and through my translating sessions. I guess, as a computational linguist I should have such an experience, which might help me to find and open new horizons of my future career and research. This internship could give me the practical skills I’m looking for, make me more professional in the area of my studies and future job.

Why is it that you are interested in Apertium?[edit]

Apertium has projects of MT for small, dialectical, and local languages that is pretty seldom (I would write that Apertium is the only). From my experience of Ket and Sardinian I believe that native speakers really need it and appreciate it. In some case, Apertium helps to popularize ‘little’ languages among people, who prefered to switch into the formal literary language or into a ‘big dad’ (like bel-rus), and maybe because of that to spark the interest to their culture (it's very simple to identify from a sociolinguistic researches and experiment, e.g. where native speakers of local languages of Russia were shy to use their language because 'nobody could understand them' and 'that's why it was better to learn Russian'. It also can case the interest in these languages among people, who might not know about their existence) . Also Apertium provides methods of the rule-based MT that I would like to learn closer, as I wrote above.

As I've already noticed, the Apertium team is ready to help in any minute and to answer any question. It's pleasure to work with such people.


Bel-ukr language pair[edit]

I would like to propose the Apertium my work on the project to create a new language pair for Belarusian - Ukrainian languages.

Why Google and Apertium should sponsor it and how and who it will benefit in society.

I plan to build up the open-free ‘language resource’, which could be the first good one in the field for Bel-Ukr languages (I would write the best one: ‘Easy to be the best in the field, when you’re the only’.).

Despite of the fact that many people in Belarus and Ukraine are bilingual (speaking both - Russian and Belarusian / Ukrainian languages) and that Russian is more popular in some regions of these neighbour countries, Belarusian and Ukrainian are widely spoken, have the official status, and as I’ve noticed, not all the people in these countries can Russian. If we are talking about young people, they prefer Bel or Ukr instead of Russian. However, all the existed dictionaries, MT platforms for these idioms have good developed applications just for Russian, in the best case for English. Then what about people, who do not speak any Russian or English, and want to understand their neighbors? Belarusian and Ukrainian are official languages of the nations, but they don’t have any good languages resources for not-Russian and not-English speaking people to communicate. (Can you imagine that for Spanish - Portuguese, for example?)

For instance, Apertium has projects for rus-ukr, rus-bel language pairs. You can easily google Russian - Belarusian dictionary, a very good one for Rus-Ukrainian (e.g. AbbyyLingvo, though it’s still being developed), but no Bel-Ukr or Ukr-Bel. When I needed to find such a dictionary (for my research and simply to translate something good), I’ve found one, but it was almost empty, I couldn’t search any basic word. That’s why I tend to write that it doesn’t exist. However, the idea of creation some language resource like that is here, so it just proves my argument that people need it and they will use it.

The Google Translator has the opportunity to translate from Bel to Ukr and back, but it works badly, it’s clear that the platform doesn’t have a direct translation machine (maybe, working first with Russian or English).

I guess, the result if my project can be used for both - simple native-speakers and professional linguists, working in the field of Bel-Ukr language diversity or Slavic languages.


Timeline[edit]

Post-application period

Get to know the Apertium better.

Learning more about MT.

Finish the coding challenge.


Community bonding period

Get the full picture of Belarusian and Ukrainian morphology.

To look closer to morph analysers for Bel and Ukr.

Work plan[edit]

Week 1: Try to improve the morphological analysers, if it’s necessary. Work with exist dictionaries. Adv. Testvoc.

Week 2: State words (prep, conj), numerals. Testvoc. To complete the dictionaries.

Week 3: Even up verbs. Testvoc.

Week 4: Nouns, adjectives. Testvoc.

Deliverable #1: 30% done. Dictionaries cover more words.

Week 5: Transfer verbs. Bilingual dictionary. Morphological disambiguations.

Week 6: Nouns, adj. Bilingual dictionary. Morphological disambiguations.

Week 7: Numerals. Bilingual dictionary. Morphological disambiguations.

Week 8: State words. Bilingual dictionary. Morphological disambiguations.

Deliverable #2: 60% done. Morphological disambiguations is done, bilingual dictionary is completed.

Week 9: Continue working on transfer rules, CG.

Week 10: Test and check possible errors in rules and dictionaries.

Week 11: Corpora test.

Week 12: Evaluation.

Project completed: 80% done: Language pair is ready to use.

Non-Summer-of-Code plans[edit]

In the beginning of June I have my bachelor thesis defence (the date is not clear yet). By this reason, I can not work on the project for 1 week all 30-40 hours per week. Though after my defence I’m free with my studies and I’m ready to work more, to supply what I’ll have missed.

In the beginning of August, I have a regatta with the Yacht club to the North of Russia for 3 days. Though I plan to keep my working 30-40 hours per week because of other days.

As I’ve written, I’m free with my studies at university after June, but I don’t have any classes since May. Though I would like to mention that I like doing sports, so every day I spend on that about 2 hours (rowing and yachting). I’m sure, it doesn’t stop my working.


Second Idea[edit]

The second idea was to work on Russian - Swearing Russian. I don’t know any resource that provides any translation like that. However, a lot of researchers work with Russian swearing. Even though it represents a part of Russian, it has different synthetic, semantic and morphological world with huge ambiguities! Swearing becomes more normative. And I tend to think that in future it won’t have such a bad status. In this case, we could stay in the step further. Recently, bad Russian became very popular in the Internet because of online gaming. This resource could be useful for linguists and non-Russian native speakers, who tries to understand Russian fully. And a lot of fun for Russians. It’s a real natural language. As a linguist I’m strongly interested in this field of research.