Difference between revisions of "Task ideas for Google Code-in/Tokenisation for spaceless orthographies"

From Apertium
Jump to navigation Jump to search
(missed a possibility)
 
Line 18: Line 18:
 
Imagine·for·a·moment·that·english·were·written·with·out·spaces
 
Imagine·for·a·moment·that·english·were·written·with·out·spaces
 
Imagine·fora·moment·that·english·were·written·with·out·spaces
 
Imagine·fora·moment·that·english·were·written·with·out·spaces
  +
Imagine·for·a·moment·that·english·we·rewritten·with·out·spaces
  +
Imagine·fora·moment·that·english·we·rewritten·with·out·spaces
 
Imagine·for·a·moment·that·english·were·writ·ten·without·spaces
 
Imagine·for·a·moment·that·english·were·writ·ten·without·spaces
 
Imagine·fora·moment·that·english·were·writ·ten·without·spaces
 
Imagine·fora·moment·that·english·were·writ·ten·without·spaces
 
Imagine·for·a·moment·that·english·were·written·without·spaces
 
Imagine·for·a·moment·that·english·were·written·without·spaces
 
Imagine·fora·moment·that·english·were·written·without·spaces
 
Imagine·fora·moment·that·english·were·written·without·spaces
  +
Imagine·for·a·moment·that·english·we·rewritten·without·spaces
  +
Imagine·fora·moment·that·english·we·rewritten·without·spaces
 
</pre>
 
</pre>
   
Line 36: Line 40:
 
<pre>
 
<pre>
 
^Imagine/Imagine$ ^fora/fora/for+a$ ^moment/moment$ ^that/that$ ^english/english$ \
 
^Imagine/Imagine$ ^fora/fora/for+a$ ^moment/moment$ ^that/that$ ^english/english$ \
^were/were$ ^written/writ+ten/written$ ^without/with+out/without$ ^spaces/spaces$
+
^werewritten/were+writ+ten/were+written/we+rewritten$ ^without/with+out/without$ ^spaces/spaces$
 
</pre>
 
</pre>
   

Latest revision as of 14:52, 7 May 2021

Objective[edit]

The objective of this task is to investigate how to best tokenise sentences in South and East Asian languages into words. Sentences in these languages are usually not written with spaces to show word boundaries.

Example[edit]

Imagineforamomentthatenglishwerewrittenwithoutspaces.

Given a fairly complete dictionary of English words it should be possible to generate all the possible ways of splitting up the sentence into words that are found in the dictionary:

Imagine·for·a·moment·that·english·were·writ·ten·with·out·spaces
Imagine·fora·moment·that·english·were·writ·ten·with·out·spaces
Imagine·for·a·moment·that·english·were·written·with·out·spaces
Imagine·fora·moment·that·english·were·written·with·out·spaces
Imagine·for·a·moment·that·english·we·rewritten·with·out·spaces
Imagine·fora·moment·that·english·we·rewritten·with·out·spaces
Imagine·for·a·moment·that·english·were·writ·ten·without·spaces
Imagine·fora·moment·that·english·were·writ·ten·without·spaces
Imagine·for·a·moment·that·english·were·written·without·spaces
Imagine·fora·moment·that·english·were·written·without·spaces
Imagine·for·a·moment·that·english·we·rewritten·without·spaces
Imagine·fora·moment·that·english·we·rewritten·without·spaces

Tasks[edit]

Literature review[edit]

Search on Google for papers and programs about word segmentation / tokenisation for the language in question. Make a report about what you find.

Input/output code[edit]

The input should be a sentence, and the output should be a lattice, for example:

^Imagine/Imagine$ ^fora/fora/for+a$ ^moment/moment$ ^that/that$ ^english/english$ \
^werewritten/were+writ+ten/were+written/we+rewritten$ ^without/with+out/without$ ^spaces/spaces$

Algorithms[edit]

Longest-match left-to-right (LRLM)
Maximal matching
N-gram models

It should be possible to have for each word in the dictionary its possible parts-of-speech. Given this it should be possible to calculate n-gram co-occurrence probabilities, and use these to rank the possible segmentations.

Evaluation[edit]

Take about 3000 words of text in the language (about 6 pages) and split it into sentences. Then manually split the sentences into tokens. Compare the performance of the algorithm(s) you have implemented against the manually tokenised sentences.