Difference between revisions of "Concordancer"

From Apertium
Jump to navigation Jump to search
(add proposed changes and interface pictures)
(GitHub migration)
 
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
  +
{{Github-unmigrated-tool}}
 
  +
{{TOCD}}
 
A '''concordancer''' is a tool which shows you a word, in context.
 
A '''concordancer''' is a tool which shows you a word, in context.
   
Line 41: Line 42:
   
 
[[File:concordancerProposed.png|600px]]
 
[[File:concordancerProposed.png|600px]]
  +
  +
==TODO==
  +
  +
* Efficiency: Make it scale up to corpora of millions of words. This might involve doing (a) pre-analysis of the corpus -- e.g. the program doesn't read + analyse, but rather just read from a pre-analysed corpus; and (b) indexing using SQLite or something similar.
  +
* Pagination: Limit results to ''n'' per page, perhaps 10,50,100,500,all
   
 
==External links==
 
==External links==

Latest revision as of 02:11, 10 March 2018

Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

A concordancer is a tool which shows you a word, in context.

Web Concordancer[edit]

A concordancer web interface is available in the SVN at /trunk/apertium-tools/webconcordancer. Use python3 server.py to run the code on localhost:8080/concordancer.html. The web interface requires three primary inputs to set up:

  1. Corpus path: an absolute path to the the corpus (e.g. /home/apertium/Desktop/corpus.txt)
  2. Language module: an absolute path to the language module to use for analysis (e.g. /home/apertium/Desktop/apertium-en-es)
  3. Language Pair: the language pair to pass to Apertium for morphological analysis (e.g. en-es-anmor)
  4. Search Window: the size of the context around the token located in number of characters/number of tokens (e.g. 15)

The interface also supports four distinct search modes, each of which has support for regular expressions (enabled via a checkbox):

  1. Tag Search: this mode will output all tokens in the corpus with all of the specified search tags and supports regex inside individual tags
    • e.g. <n> will show all noun tokens
    • e.g. <n><sg> will show all singular noun tokens
    • e.g. <p[1-2]> will show all first person and second person tokens
  1. Lemma Search: this mode will output all tokens in the corpus which contain lemmas that match the given search string and supports regex within the search string (omit the regex '$' token)
    • e.g. be will show all the tokens that are forms of the word 'be'
    • e.g. t.* will show all the tokens that have lemmas beginning with 't'
  1. Surface Form Search: this mode will output all tokens in the corpus which have a surface form that matches the search string and supports regex within the search string (omit the regex '$' token)
    • e.g. the will show all the instances of 'the' in the corpus
    • e.g. [0-9]+ will show all the tokens composed entirely of Arabic numerals
  1. Raw Corpus Search: this mode will find all matches to the search string in the corpus and supports regex within the search string
    • e.g. previous will find all the instances of the letter sequence 'previous' in the corpus
    • e.g. \.$ will find all the instances of a period character ending a line in the corpus

Current Interface[edit]

Concordancer.png

Proposed Changes[edit]

  1. Apply an unlimited number of sets of filters simultaneously.
  2. To appear in the output, a line of the corpus must have tokens which fulfill all the sets of filters specified.
  3. Filters can be added one by one and deleted

Proposed Interface (draft)[edit]

ConcordancerProposed.png

TODO[edit]

  • Efficiency: Make it scale up to corpora of millions of words. This might involve doing (a) pre-analysis of the corpus -- e.g. the program doesn't read + analyse, but rather just read from a pre-analysed corpus; and (b) indexing using SQLite or something similar.
  • Pagination: Limit results to n per page, perhaps 10,50,100,500,all

External links[edit]