Difference between revisions of "Task ideas for Google Code-in/Tokenisation for spaceless orthographies"
Jump to navigation
Jump to search
(Created page with " Tokenisation for spaceless orthographies") |
|||
Line 1: | Line 1: | ||
+ | ==Objective== |
||
+ | ===Example=== |
||
+ | |||
⚫ | |||
+ | <pre> |
||
+ | Imagineforamomentthatenglishwerewrittenwithoutspaces. |
||
+ | </pre> |
||
+ | |||
+ | Given a fairly complete dictionary of English words it should be possible to generate all the possible ways of splitting up the sentence into words that are found in the dictionary: |
||
+ | |||
+ | <pre> |
||
+ | Imagine·for·a·moment·that·english·were·writ·ten·with·out·spaces |
||
+ | Imagine·fora·moment·that·english·were·writ·ten·with·out·spaces |
||
+ | Imagine·for·a·moment·that·english·were·written·with·out·spaces |
||
+ | Imagine·fora·moment·that·english·were·written·with·out·spaces |
||
+ | Imagine·for·a·moment·that·english·were·writ·ten·without·spaces |
||
+ | Imagine·fora·moment·that·english·were·writ·ten·without·spaces |
||
+ | Imagine·for·a·moment·that·english·were·written·without·spaces |
||
+ | Imagine·fora·moment·that·english·were·written·without·spaces |
||
+ | </pre> |
||
+ | |||
+ | ==Tasks== |
||
+ | |||
+ | ===Literature review=== |
||
+ | |||
+ | ===Input/output code=== |
||
+ | |||
+ | The input should be a sentence, and the output should be a lattice, for example: |
||
+ | |||
+ | <pre> |
||
+ | ^Imagine/Imagine$ ^fora/fora/for+a$ ^moment/moment$ ^that/that$ ^english/english$ \ |
||
+ | ^were/were$ ^written/writ+ten/written$ ^without/with+out/without$ ^spaces/spaces$ |
||
+ | </pre> |
||
+ | |||
+ | ===Algorithms=== |
||
+ | |||
+ | ; |
||
+ | |||
+ | ===Evaluation=== |
||
+ | |||
+ | Take about 500 words of text in the language and split it into sentences. Then manually split the sentences into tokens. Compare the performance of the algorithm(s) you have implemented against the manually tokenised sentences. |
||
+ | |||
+ | |||
⚫ |
Revision as of 23:52, 14 November 2013
Contents
Objective
Example
Imagineforamomentthatenglishwerewrittenwithoutspaces.
Given a fairly complete dictionary of English words it should be possible to generate all the possible ways of splitting up the sentence into words that are found in the dictionary:
Imagine·for·a·moment·that·english·were·writ·ten·with·out·spaces Imagine·fora·moment·that·english·were·writ·ten·with·out·spaces Imagine·for·a·moment·that·english·were·written·with·out·spaces Imagine·fora·moment·that·english·were·written·with·out·spaces Imagine·for·a·moment·that·english·were·writ·ten·without·spaces Imagine·fora·moment·that·english·were·writ·ten·without·spaces Imagine·for·a·moment·that·english·were·written·without·spaces Imagine·fora·moment·that·english·were·written·without·spaces
Tasks
Literature review
Input/output code
The input should be a sentence, and the output should be a lattice, for example:
^Imagine/Imagine$ ^fora/fora/for+a$ ^moment/moment$ ^that/that$ ^english/english$ \ ^were/were$ ^written/writ+ten/written$ ^without/with+out/without$ ^spaces/spaces$
Algorithms
Evaluation
Take about 500 words of text in the language and split it into sentences. Then manually split the sentences into tokens. Compare the performance of the algorithm(s) you have implemented against the manually tokenised sentences.