Difference between revisions of "Conllu Parsing and Searching"

From Apertium
Jump to navigation Jump to search
Line 3: Line 3:
 
Searching is as follows:
 
Searching is as follows:
   
== The '<' character ==
+
== Word-based search: the '&lt;' character ==
   
If you want to find a specific word(i.e. you want to find the word 'ести' in your ConLL-U file):
+
If you want to find a '''specific word''' (e.g., you want to find the word "bread" in your ConLL-U file), you create a search with the <code>&lt;</code> symbol followed by the word you want to search for.
   
  +
For example, the search term <code>'&lt;ести</code> might return:
You would start your search with a '<'
 
   
 
'Token: 6, Form: ести, Lemma: есті, UPOSTAG: VERB, HEAD: 0, DEPREL: root, # sent_id = story.tagged.txt:44:776, Sentence: Ол енді ол дыбысты анығырақ ести бастады .'
Then write the word after the '<' (i.e '<ести)
 
   
 
The format of the result is the Token (where in the sentence the match appeared), the lemma, the <code>upostag</code> (part of speech), the HEAD, and the sentence_id.
This will print your answer in this format:
 
 
'Token: 6, Form: ести, Lemma: есті, UPOSTAG: VERB, HEAD: 0, DEPREL: root, # sent_id = story.tagged.txt:44:776, Sentence: Ол енді ол дыбысты анығырақ ести бастады .'
 
 
This gives you the Token(where in the sentence did this appear), lemma, upostag(part of speech), HEAD, and the sentence_id
 
   
 
== The '{' character ==
 
== The '{' character ==

Revision as of 18:04, 10 December 2017

Parse and Search through a conllu file

Searching is as follows:

Word-based search: the '<' character

If you want to find a specific word (e.g., you want to find the word "bread" in your ConLL-U file), you create a search with the < symbol followed by the word you want to search for.

For example, the search term '<ести might return:

'Token: 6, Form: ести, Lemma: есті, UPOSTAG: VERB, HEAD: 0, DEPREL: root, # sent_id = story.tagged.txt:44:776, Sentence:  Ол енді ол дыбысты анығырақ ести бастады .' 

The format of the result is the Token (where in the sentence the match appeared), the lemma, the upostag (part of speech), the HEAD, and the sentence_id.

The '{' character

If you would like to search with a tree(i.e you want to search for a word with a HEAD value or word):

You would start your search with a '{'

Then, between the words you are searching for a relation between add a '>'

For instance, if you wanted to see when 'have' did action to 'clue' (i.e. I have no clue') you would use this character

An example entry would be '{have>clue'

If you wanted, you could also be more specific or ambigious

When searching with attributes (i.e UPOSTAG), you could do this like:

'{upostag=verb, form=have>form=clue'

PLEASE NOTE THAT WHEN YOU SPECIFY EXTRA ATTRIBUTES YOU HAVE TO PUT 'Form=' ARGUMENT FOR THE WORD

If you wanted to specify nothing and look for words that do action to bread, you would use:

'{none=none>form=clue}'

PLEASE NOTE THAT YOU HAVE TO HAVE 'NONE=NONE' WHERE NOTHING IS SPECIFIED

You can also specify attributes instead of 'form=clue' such as 'upostag=noun'

Example Output:

Token: 2, Form: have, Lemma: have, UPOSTAG: VERB, HEAD: 0, DEPREL: root, # sent_id = 2, Sentence: I have no clue .

The ':' character

If you would like to search for a deprel or upostag and a feature in a word:

You would start your search with a ':' and encapsulate your search with '[]'

For instance if you wanted to search for a copula and past feature you would do

':[cop, past]'

This would find a copula with a past feature and have an output like:

'Token: 3, Form: болғаныма, Lemma: бол, UPOSTAG: AUX, HEAD: 2, DEPREL: cop, # sent_id = akorda-random.tagged.txt:158:2829, Sentence: Мен осында болғаныма қуаныштымын қуанышты мын .'

The ';' character

If you would like to search with a relationship(i.e nsubj relation to another node that has a noun POS)

You would start your search with a ';'

You would then type a deprel tag followed by a colon and then a part of speech

The second term(the one after the ';') can also be the lemma or the word id_name

You would use to search for a word with nsubj relationship with a noun:

';nsubj:noun'

Could Output:

'Token: 8, Form: жүзімдік, Lemma: жүзімдік, UPOSTAG: NOUN, HEAD: 6, DEPREL: conj, # sent_id = Шымкент.tagged.txt:8:216, Sentence: Тау етегінде өзен бойындағы алқаптарда егіншілік пен жүзімдік ал көгалды таулы жайылымдарда - мал шаруашылығы дамыған .'

Example Of How To Use This Program

python conlluparse.py "text.conllu" ':[cop, past]'

python conlluparse.py "text.conllu" ';nsubj:noun'

python conlluparse.py "text.conllu" '{none=none>form=bread}'

python conlluparse.py "text.conllu" '<bread'