Difference between revisions of "User:Medobob/GSoC2024Proposal"

From Apertium
Jump to navigation Jump to search
Line 15: Line 15:
   
 
== Why am I interested in Apertium? ==
 
== Why am I interested in Apertium? ==
Apertium’s community is active and engaging which is one of the best parts in Apertium.
+
* Apertium’s community is active and engaging which is one of the best parts in Apertium.
 
* It’s open source nature which provides the source code and it’s dictionaries for free for anyone to use and modify.
 
 
* Apertium’s translation system which is rule-based which is the most suitable for low-resource languages and better than Natural Networks approach.
It’s open source nature which provides the source code and it’s dictionaries for free for anyone to use and modify.
 
 
Apertium’s translation system which is rule-based which is the most suitable for low-resource languages and better than Natural Networks approach.
 
 
   
 
== Proposal ==
 
== Proposal ==

Revision as of 19:05, 20 March 2024

Contact Information

Name: Mohamed Adel
Email: mohamed.adel.alsayed1@gmail.com
Github: https://github.com/mohamed-adel-alsayed
IRC: Medobob
LinkedIn: https://www.linkedin.com/in/mohamed-adel-alsayed
Timezone: UTC+2


Background

A third-year computer science student at the Egyptian E-Learning University in Egypt with proficiency in C++, XML, and Python.

Native Language: Arabic Second Language: English

Why am I interested in Apertium?

  • Apertium’s community is active and engaging which is one of the best parts in Apertium.
  • It’s open source nature which provides the source code and it’s dictionaries for free for anyone to use and modify.
  • Apertium’s translation system which is rule-based which is the most suitable for low-resource languages and better than Natural Networks approach.

Proposal

Robust Tokenization in Lttoolbox

Deliverables:

  • Updated lttoolbox with full Unicode support.
  • Enhanced tokenization algorithm addressing compound word recognition, accent handling, and script diversity.
  • Mechanisms for dictionary-driven control over tokenization rules.
  • Comprehensive documentation and usage guidelines for developers and users.
  • Test suites ensuring the correctness, efficiency, and compatibility of the enhanced tokenization module.


Reasons why Google and Apertium should sponsor it

  • Improved Language Support: Enhancing tokenization in Apertium will lead to better support for a wide range of languages and scripts, aligning with Google's mission to make information accessible and useful globally.
  • Enhanced Translation Quality: By improving tokenization accuracy and handling of compound words, the quality of machine translation provided by Apertium will be significantly enhanced, benefiting users worldwide.
  • Future-Proof Development: A modern, Unicode-compliant tokenization framework will make Apertium easier to maintain and develop in the future, ensuring its continued relevance in the NLP landscape.
  • Better Support for Complex Scripts: Robust handling of no-space scripts and special characters like hyphens will enable Apertium to effectively work with languages that have different writing systems or rely less on whitespace for word separation.
  • Enhanced User Experience: User control over tokenization allows developers to tailor Apertium's behavior to specific languages, improving the overall user experience and making Apertium more adaptable to diverse language needs.


Coding Challenge

I've completed the coding challenge associated with this project, and you can find the code at https://github.com/mohamed-adel-alsayed/Alphabet-Classifier

  • This program is designed to classify characters into either alphabet or non-alphabet categories, supporting a wide range of languages through Unicode
  • It utilizes ICU to determine each character's general category, subsequently identifying whether it belongs to the alphabet or not based on its Unicode classification.


Work Plan

Community bonding period (May 1 - May 26)

  • Becoming acquainted with the Apertium organization and community.
  • Reviewing the codebase to understand how tokenization is currently implemented.
  • Identifying limitations and issues with the current approach (e.g., non-Unicode compliance, lack of dictionary control).
  • Documenting findings and creating a clear picture of the current state.

Work Period (May 27 - 26 Aug):

Week 1 (May 27 - Jun 2):

  • Research and evaluate candidate tokenization libraries/algorithms, focusing on options that integrate well with existing Apertium testing frameworks.
  • Select the most promising tokenization approach based on research and evaluation.

Week 2 (Jun 3 - Jun 9):

  • Design the new tokenization logic based on the chosen approach and Apertium's requirements.
  • Identify existing code relying on current tokenization behavior.

Week 3 (Jun 10 - Jun 16):

  • Begin implementing the core functionality of the new tokenization algorithm.

Week 4 (Jun 17 - Jun 23):

  • Continue implementing core functionalities and start unit testing.

Week 5 (Jun 24 - Jun 30):

  • Begin integration of the new tokenization algorithm with Apertium's infrastructure.

Week 6 (Jul 1 - Jul 7):

  • Ensure compatibility with existing functionalities and add more unit tests.

Week 7 (Jul 8 - Jul 14):

  • Design functionalities for dictionary control over tokenization rules. Focus on core control mechanisms for common scenarios (e.g., hyphen handling).

Week 8 (Jul 15 - Jul 21):

  • Implement changes and add more unit tests to maintain functionality with the new algorithm.

Week 9 (Jul 22 - Jul 28):

  • Design and implement sample dictionaries with diverse tokenization scenarios.

Week 10 (Jul 29 - Aug 4):

  • Begin writing comprehensive documentation for the new tokenization system.

Week 11/12/13 (Aug 5 - Aug 26):

  • Finalize writing documentation for the new tokenization system, including user guides for dictionary writers on utilizing the dictionary control mechanisms.
  • Discuss the new functionalities with mentors, including a demonstration of dictionary control mechanisms. Address any feedback and incorporate suggestions for improvement.
  • If necessary, based on discussions or testing results, dedicate time to write additional unit tests to further solidify the new tokenization system.