Difference between revisions of "Concordancer"

From Apertium
Jump to navigation Jump to search
(Created page with " A '''concordancer''' is a tool which shows you a word, in context. ==External links== * [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/webconcordancer webcon...")
 
(add web concordancer section)
Line 2: Line 2:
 
A '''concordancer''' is a tool which shows you a word, in context.
 
A '''concordancer''' is a tool which shows you a word, in context.
   
  +
==Web Concordancer==
  +
A concordancer web interface is available in the SVN at [[https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/webconcordancer /trunk/apertium-tools/webconcordancer]. Run <code>python3 server.py</code> to run the code on <code>localhost:8080/concordancer.html</code>. The web interface requires three primary inputs to set up:
  +
  +
# Corpus path: an absolute path to the the corpus (e.g. /home/apertium/Desktop/corpus.txt)
  +
# Language module: an absolute path to the language module to use for analysis (e.g. /home/apertium/Desktop/apertium-en-es)
  +
# Language Pair: the language pair to pass to Apertium for morphological analysis (e.g. en-es-anmor)
  +
# Search Window: the size of the context around the token located in number of characters/number of tokens (e.g. 15)
  +
  +
The interface also supports four distinct search modes, each of which has support for regular expressions (enabled via a checkbox):
  +
  +
# '''Tag Search''': this mode will output all tokens in the corpus with all of the specified search tags and supports regex ''inside'' individual tags
  +
#* e.g. <code><n></code> will show all noun tokens
  +
#* e.g. <code><n><sg></code> will show all singular noun tokens
  +
#* e.g. <code><p[1-2]&gt;</code> will show all first person and second person tokens
  +
  +
# '''Lemma Search''': this mode will output all tokens in the corpus which contain lemmas that match the given search string and supports regex within the search string (omit the regex '$' token)
  +
#* e.g. <code>be</code> will show all the tokens that are forms of the verb 'be'
  +
#* e.g. <code>t.*</code> will show all the tokens that have lemmas beginning with 't'
  +
  +
# '''Surface Form Search''': this mode will output all tokens in the corpus which have a surface form that matches the search string and supports regex within the search string (omit the regex '$' token)
  +
#* e.g. <code>the</code> will show all the instances of 'the' in the corpus
  +
#* e.g. <code>[0-9]+</code> will show all the tokens composed entirely of Arabic numerals
  +
  +
# '''Raw Corpus Search''': this mode will find all matches to the search string in the corpus and supports regex within the search string
  +
#* e.g. <code>previous</code> will find all the instances of the letter sequence 'previous' in the corpus
  +
#* e.g. <code>\.$</code> will find all the instances of a period character ending a line in the corpus
   
 
==External links==
 
==External links==

Revision as of 20:34, 27 November 2013

A concordancer is a tool which shows you a word, in context.

Web Concordancer

A concordancer web interface is available in the SVN at [/trunk/apertium-tools/webconcordancer. Run python3 server.py to run the code on localhost:8080/concordancer.html. The web interface requires three primary inputs to set up:

  1. Corpus path: an absolute path to the the corpus (e.g. /home/apertium/Desktop/corpus.txt)
  2. Language module: an absolute path to the language module to use for analysis (e.g. /home/apertium/Desktop/apertium-en-es)
  3. Language Pair: the language pair to pass to Apertium for morphological analysis (e.g. en-es-anmor)
  4. Search Window: the size of the context around the token located in number of characters/number of tokens (e.g. 15)

The interface also supports four distinct search modes, each of which has support for regular expressions (enabled via a checkbox):

  1. Tag Search: this mode will output all tokens in the corpus with all of the specified search tags and supports regex inside individual tags
    • e.g. <n> will show all noun tokens
    • e.g. <n><sg> will show all singular noun tokens
    • e.g. <p[1-2]> will show all first person and second person tokens
  1. Lemma Search: this mode will output all tokens in the corpus which contain lemmas that match the given search string and supports regex within the search string (omit the regex '$' token)
    • e.g. be will show all the tokens that are forms of the verb 'be'
    • e.g. t.* will show all the tokens that have lemmas beginning with 't'
  1. Surface Form Search: this mode will output all tokens in the corpus which have a surface form that matches the search string and supports regex within the search string (omit the regex '$' token)
    • e.g. the will show all the instances of 'the' in the corpus
    • e.g. [0-9]+ will show all the tokens composed entirely of Arabic numerals
  1. Raw Corpus Search: this mode will find all matches to the search string in the corpus and supports regex within the search string
    • e.g. previous will find all the instances of the letter sequence 'previous' in the corpus
    • e.g. \.$ will find all the instances of a period character ending a line in the corpus

External links