Difference between revisions of "Concordancer"

Latest revision as of 02:11, 10 March 2018

Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

Web Concordancer[edit]

A concordancer web interface is available in the SVN at /trunk/apertium-tools/webconcordancer. Use python3 server.py to run the code on localhost:8080/concordancer.html. The web interface requires three primary inputs to set up:

Corpus path: an absolute path to the the corpus (e.g. /home/apertium/Desktop/corpus.txt)
Language module: an absolute path to the language module to use for analysis (e.g. /home/apertium/Desktop/apertium-en-es)
Language Pair: the language pair to pass to Apertium for morphological analysis (e.g. en-es-anmor)
Search Window: the size of the context around the token located in number of characters/number of tokens (e.g. 15)

The interface also supports four distinct search modes, each of which has support for regular expressions (enabled via a checkbox):

Tag Search: this mode will output all tokens in the corpus with all of the specified search tags and supports regex inside individual tags
- e.g. <n> will show all noun tokens
- e.g. <n><sg> will show all singular noun tokens
- e.g. <p[1-2]> will show all first person and second person tokens

Lemma Search: this mode will output all tokens in the corpus which contain lemmas that match the given search string and supports regex within the search string (omit the regex '$' token)
- e.g. be will show all the tokens that are forms of the word 'be'
- e.g. t.* will show all the tokens that have lemmas beginning with 't'

Surface Form Search: this mode will output all tokens in the corpus which have a surface form that matches the search string and supports regex within the search string (omit the regex '$' token)
- e.g. the will show all the instances of 'the' in the corpus
- e.g. [0-9]+ will show all the tokens composed entirely of Arabic numerals

Raw Corpus Search: this mode will find all matches to the search string in the corpus and supports regex within the search string
- e.g. previous will find all the instances of the letter sequence 'previous' in the corpus
- e.g. \.$ will find all the instances of a period character ending a line in the corpus

Current Interface[edit]

Proposed Changes[edit]

Apply an unlimited number of sets of filters simultaneously.
To appear in the output, a line of the corpus must have tokens which fulfill all the sets of filters specified.
Filters can be added one by one and deleted

Proposed Interface (draft)[edit]

TODO[edit]

Efficiency: Make it scale up to corpora of millions of words. This might involve doing (a) pre-analysis of the corpus -- e.g. the program doesn't read + analyse, but rather just read from a pre-analysed corpus; and (b) indexing using SQLite or something similar.
Pagination: Limit results to n per page, perhaps 10,50,100,500,all

External links[edit]

@@ Line 1: / Line 1: @@
+{{Github-unmigrated-tool}}
+{{TOCD}}
 A '''concordancer''' is a tool which shows you a word, in context.
@@ Line 5: / Line 6: @@
 A concordancer web interface is available in the SVN at [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/webconcordancer /trunk/apertium-tools/webconcordancer]. Use <code>python3 server.py</code> to run the code on <code>localhost:8080/concordancer.html</code>. The web interface requires three primary inputs to set up:
-# Corpus path: an absolute path to the the corpus (e.g. /home/apertium/Desktop/corpus.txt)
+# '''Corpus path''': an absolute path to the the corpus (e.g. /home/apertium/Desktop/corpus.txt)
-# Language module: an absolute path to the language module to use for analysis (e.g. /home/apertium/Desktop/apertium-en-es)
+# '''Language module''': an absolute path to the language module to use for analysis (e.g. /home/apertium/Desktop/apertium-en-es)
-# Language Pair: the language pair to pass to Apertium for morphological analysis (e.g. en-es-anmor)
+# '''Language Pair''': the language pair to pass to Apertium for morphological analysis (e.g. en-es-anmor)
-# Search Window: the size of the context around the token located in number of characters/number of tokens (e.g. 15)
+# '''Search Window''': the size of the context around the token located in number of characters/number of tokens (e.g. 15)
 The interface also supports four distinct search modes, each of which has support for regular expressions (enabled via a checkbox):
@@ Line 18: / Line 19: @@
 # '''Lemma Search''': this mode will output all tokens in the corpus which contain lemmas that match the given search string and supports regex within the search string (omit the regex '$' token)
-#* e.g. <code>be</code> will show all the tokens that are forms of the verb 'be'
+#* e.g. <code>be</code> will show all the tokens that are forms of the word 'be'
 #* e.g. <code>t.*</code> will show all the tokens that have lemmas beginning with 't'
@@ Line 28: / Line 29: @@
 #* e.g. <code>previous</code> will find all the instances of the letter sequence 'previous' in the corpus
 #* e.g. <code>\.$</code> will find all the instances of a period character ending a line in the corpus
+=== Current Interface ===
+[[File:concordancer.png|600px]]
+=== Proposed Changes ===
+# Apply an unlimited number of sets of filters simultaneously.
+# To appear in the output, a line of the corpus must have tokens which fulfill all the sets of filters specified.
+# Filters can be added one by one and deleted
+=== Proposed Interface (draft) ===
+[[File:concordancerProposed.png|600px]]
+==TODO==
+* Efficiency: Make it scale up to corpora of millions of words. This might involve doing (a) pre-analysis of the corpus -- e.g. the program doesn't read + analyse, but rather just read from a pre-analysed corpus; and (b) indexing using SQLite or something similar.
+* Pagination: Limit results to ''n'' per page, perhaps 10,50,100,500,all
 ==External links==

Difference between revisions of "Concordancer"

Latest revision as of 02:11, 10 March 2018

Contents

Web Concordancer[edit]

Current Interface[edit]

Proposed Changes[edit]

Proposed Interface (draft)[edit]

TODO[edit]

External links[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools