Difference between revisions of "Begiak/awik"

From Apertium
Jump to navigation Jump to search
(add first issue and wishlist)
 
(add solution and further information)
Line 3: Line 3:
   
 
==Issues==
 
==Issues==
===Bad message chopping===
+
===Bad message chopping (fixed in r_______) ===
'''Problem''': Commands such as <code>.awik Руководство по созданию новой языковой пары</code> result in outputs that don't contain the URL of the wiki article, i.e. the excerpt from the article is too long and the message terminates early. For example,
+
'''Problem''': Commands such as <code>.awik Руководство по созданию новой языковой пары</code> result in outputs that don't contain the URL of the wiki article, i.e. the excerpt from the article is too long and the message terminates early. This problem is also observable with the phenny Wikipedia plugin. For example,
 
<p><code>
 
<p><code>
 
(3:14:31 PM) sushain: .awik Руководство по созданию новой языковой пары<br>
 
(3:14:31 PM) sushain: .awik Руководство по созданию новой языковой пары<br>
Line 18: Line 18:
 
'''Source''': The excerpt to display is being chopped incorrectly due to not properly handling Unicode strings. IRC messages are limited in length not by their character length but rather their byte length. The incorrect code only truncates the article excerpt if the number of bytes after UTF-8 encoding is greater than the IRC message length minus the link's length. The problem occurs since the truncation uses the character length of the excerpt rather than the UTF-8 encoded byte length. Furthermore, a similar problem is observed in calculating the length of message remaining for the excerpt when accounting for the length of the URL.
 
'''Source''': The excerpt to display is being chopped incorrectly due to not properly handling Unicode strings. IRC messages are limited in length not by their character length but rather their byte length. The incorrect code only truncates the article excerpt if the number of bytes after UTF-8 encoding is greater than the IRC message length minus the link's length. The problem occurs since the truncation uses the character length of the excerpt rather than the UTF-8 encoded byte length. Furthermore, a similar problem is observed in calculating the length of message remaining for the excerpt when accounting for the length of the URL.
   
  +
'''Solution''': Use the length of the UTF-8 encoded string for all calculations and truncate the UTF-8 bytes rather than the characters. If truncation produces an invalid Unicode character at the string's conclusion, ignore it (this will account for truncation occurring in the middle of a valid Unicode character). The changes made:
'''Solution''':
 
  +
  +
<code>maxlength = 430 - len(' - ' + wikiuri % (format_term_display(term)))</code> becomes <code>maxlength = 430 - len((' - ' + wikiuri % (format_term_display(term))).encode('utf-8'))</code>
  +
  +
<code>sentence = sentence[:maxlength]</code> becomes <code>sentence = sentence.encode('utf-8')[:maxlength].decode('utf-8', 'ignore')</code>
  +
 
'''Test''':
   
 
==Wishlist==
 
==Wishlist==

Revision as of 22:14, 16 December 2013

Apertium Wiki Begiak Command

Issues

Bad message chopping (fixed in r_______)

Problem: Commands such as .awik Руководство по созданию новой языковой пары result in outputs that don't contain the URL of the wiki article, i.e. the excerpt from the article is too long and the message terminates early. This problem is also observable with the phenny Wikipedia plugin. For example,

(3:14:31 PM) sushain: .awik Руководство по созданию новой языковой пары
(3:14:34 PM) begiak: "В этом руководстве описывается порядок создания новой языковой пары для системы машинного перевода Apertium.От вас не требуются какие-либо лингвистические знания или знания по машинному переводу, кроме как способности различать части речи (отлич

The correct behavior should be:

(3:14:31 PM) sushain: .awik Руководство по созданию новой языковой пары
(3:14:35 PM) sushain_begiak: "В этом руководстве описывается порядок создания новой языковой пары для системы машинного перевода Apertium.От вас не требуются какие-либо лингвистические знания или [...] - http://wiki.apertium.org/wiki/Руководство_по_созданию_новой_языковой_пары

Source: The excerpt to display is being chopped incorrectly due to not properly handling Unicode strings. IRC messages are limited in length not by their character length but rather their byte length. The incorrect code only truncates the article excerpt if the number of bytes after UTF-8 encoding is greater than the IRC message length minus the link's length. The problem occurs since the truncation uses the character length of the excerpt rather than the UTF-8 encoded byte length. Furthermore, a similar problem is observed in calculating the length of message remaining for the excerpt when accounting for the length of the URL.

Solution: Use the length of the UTF-8 encoded string for all calculations and truncate the UTF-8 bytes rather than the characters. If truncation produces an invalid Unicode character at the string's conclusion, ignore it (this will account for truncation occurring in the middle of a valid Unicode character). The changes made:

maxlength = 430 - len(' - ' + wikiuri % (format_term_display(term))) becomes maxlength = 430 - len((' - ' + wikiuri % (format_term_display(term))).encode('utf-8'))

sentence = sentence[:maxlength] becomes sentence = sentence.encode('utf-8')[:maxlength].decode('utf-8', 'ignore')

Test:

Wishlist

  • Search the entire Apertium wiki using the provided search functionality of MediaWiki.