Difference between revisions of "Ideas for Google Summer of Code/Shallow-function labeller"

From Apertium
Jump to navigation Jump to search
(Created page with "<pre> <spectre> deltamachine_, yes, sorry that is my fault, i had the idea when falling asleep and didn't write much more <spectre> so <spectre> a dependency parser builds a...")
 
Line 14: Line 14:
 
<spectre> http://www.aclweb.org/anthology/E95-1029
 
<spectre> http://www.aclweb.org/anthology/E95-1029
 
</pre>
 
</pre>
  +
  +
==Coding challenge==
  +
  +
* Write a script that takes a dependency treebank in UD format and "flattens" it, that is, applies the following transformations:
  +
** Words with the <code>@conj</code> relation take the label of their head
  +
** Words with the <code>@parataxis</code> relation take the label of their head
  +
** ...
  +
* Write a script that takes a sentence in Apertium stream format and for each surface form applies the most frequent label from the labelled corpus.

Revision as of 19:58, 1 March 2017

<spectre> deltamachine_, yes, sorry that is my fault, i had the idea when falling asleep and didn't write much more 
<spectre> so 
<spectre> a dependency parser builds a whole tree and assigns labels to the tree
<spectre> a shallow-function labeller basically just assigns labels to words, without the tree
<spectre> e.g. a function labelled sentence might look something like:
<spectre>  
<spectre> I/@nsubj saw/@fmv the/@mod cat/@obj
<spectre>  
<spectre> so you get the function of the word, but not the exact tree structure
<spectre> it's an easier task
<spectre> in some ways
<spectre> because you don't have to resolve e.g. coordination ambiguity
<spectre> http://www.aclweb.org/anthology/E95-1029

Coding challenge

  • Write a script that takes a dependency treebank in UD format and "flattens" it, that is, applies the following transformations:
    • Words with the @conj relation take the label of their head
    • Words with the @parataxis relation take the label of their head
    • ...
  • Write a script that takes a sentence in Apertium stream format and for each surface form applies the most frequent label from the labelled corpus.