Task ideas for Google Code-in/Tokenisation for spaceless orthographies

Objective

The objective of this task is to investigate how to best tokenise sentences in South and East Asian languages into words. Sentences in these languages are usually not written with spaces to show word boundaries.

Example

Imagineforamomentthatenglishwerewrittenwithoutspaces.

Given a fairly complete dictionary of English words it should be possible to generate all the possible ways of splitting up the sentence into words that are found in the dictionary:

Imagine·for·a·moment·that·english·were·writ·ten·with·out·spaces
Imagine·fora·moment·that·english·were·writ·ten·with·out·spaces
Imagine·for·a·moment·that·english·were·written·with·out·spaces
Imagine·fora·moment·that·english·were·written·with·out·spaces
Imagine·for·a·moment·that·english·were·writ·ten·without·spaces
Imagine·fora·moment·that·english·were·writ·ten·without·spaces
Imagine·for·a·moment·that·english·were·written·without·spaces
Imagine·fora·moment·that·english·were·written·without·spaces

Tasks

Literature review

Input/output code

The input should be a sentence, and the output should be a lattice, for example:

^Imagine/Imagine$ ^fora/fora/for+a$ ^moment/moment$ ^that/that$ ^english/english$ \
^were/were$ ^written/writ+ten/written$ ^without/with+out/without$ ^spaces/spaces$

Algorithms

Longest-match left-to-right (LRLM)

Maximal matching

N-gram models

Evaluation

Take about 500 words of text in the language and split it into sentences. Then manually split the sentences into tokens. Compare the performance of the algorithm(s) you have implemented against the manually tokenised sentences.

Task ideas for Google Code-in/Tokenisation for spaceless orthographies

Contents

Objective

Example

Tasks

Literature review

Input/output code

Algorithms

Evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools