Concordancer

From Apertium
Revision as of 00:43, 28 November 2013 by Francis Tyers (talk | contribs)
Jump to navigation Jump to search

A concordancer is a tool which shows you a word, in context.

Web Concordancer

A concordancer web interface is available in the SVN at /trunk/apertium-tools/webconcordancer. Use python3 server.py to run the code on localhost:8080/concordancer.html. The web interface requires three primary inputs to set up:

  1. Corpus path: an absolute path to the the corpus (e.g. /home/apertium/Desktop/corpus.txt)
  2. Language module: an absolute path to the language module to use for analysis (e.g. /home/apertium/Desktop/apertium-en-es)
  3. Language Pair: the language pair to pass to Apertium for morphological analysis (e.g. en-es-anmor)
  4. Search Window: the size of the context around the token located in number of characters/number of tokens (e.g. 15)

The interface also supports four distinct search modes, each of which has support for regular expressions (enabled via a checkbox):

  1. Tag Search: this mode will output all tokens in the corpus with all of the specified search tags and supports regex inside individual tags
    • e.g. <n> will show all noun tokens
    • e.g. <n><sg> will show all singular noun tokens
    • e.g. <p[1-2]> will show all first person and second person tokens
  1. Lemma Search: this mode will output all tokens in the corpus which contain lemmas that match the given search string and supports regex within the search string (omit the regex '$' token)
    • e.g. be will show all the tokens that are forms of the word 'be'
    • e.g. t.* will show all the tokens that have lemmas beginning with 't'
  1. Surface Form Search: this mode will output all tokens in the corpus which have a surface form that matches the search string and supports regex within the search string (omit the regex '$' token)
    • e.g. the will show all the instances of 'the' in the corpus
    • e.g. [0-9]+ will show all the tokens composed entirely of Arabic numerals
  1. Raw Corpus Search: this mode will find all matches to the search string in the corpus and supports regex within the search string
    • e.g. previous will find all the instances of the letter sequence 'previous' in the corpus
    • e.g. \.$ will find all the instances of a period character ending a line in the corpus

Current Interface

Concordancer.png

Proposed Changes

  1. Apply an unlimited number of sets of filters simultaneously.
  2. To appear in the output, a line of the corpus must have tokens which fulfill all the sets of filters specified.
  3. Filters can be added one by one and deleted

Proposed Interface (draft)

ConcordancerProposed.png

External links