Difference between revisions of "Language identification"
Jump to navigation
Jump to search
(One intermediate revision by the same user not shown) | |||
Line 11: | Line 11: | ||
* http://www.ling.upenn.edu/Events/PLC/plc37/abstracts/Szymanski_PLC37_CT.pdf for identifying parallel text in another language within the same document. |
* http://www.ling.upenn.edu/Events/PLC/plc37/abstracts/Szymanski_PLC37_CT.pdf for identifying parallel text in another language within the same document. |
||
* https://github.com/saffsd/langid.py Naïve Bayes method implemented in Python, comes with lots of pre-trained models |
* https://github.com/saffsd/langid.py Naïve Bayes method implemented in Python, comes with lots of pre-trained models |
||
* https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Balanced_Language_Identification_Evaluation_Set_for_Queries / https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Language_Detection_with_TextCat#Query_Data_Collection – modifies the textcat method for working on very short strings |
Latest revision as of 20:57, 1 March 2016
Language identification or language recognition is the process of identifying what language a text (document/paragraph/sentence/word/…) is in.
Apertium-apy uses the CLD2 library for language identification (optionally it can use coverage of analysers, but this is really slow)
See also[edit]
- Apertium-apy/Language identification for some accuracy experiments of CLD2
- http://odur.let.rug.nl/~vannoord/TextCat/ the original TextCat library (perl, there's also a C port)
- https://github.com/unhammer/gt-CorpusTools/blob/master/corpustools/text_cat.py python2 reimplementation of textcat
- http://www.ling.upenn.edu/Events/PLC/plc37/abstracts/Szymanski_PLC37_CT.pdf for identifying parallel text in another language within the same document.
- https://github.com/saffsd/langid.py Naïve Bayes method implemented in Python, comes with lots of pre-trained models
- https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Balanced_Language_Identification_Evaluation_Set_for_Queries / https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Language_Detection_with_TextCat#Query_Data_Collection – modifies the textcat method for working on very short strings