Corpus-based lexicalised definiteness

From Apertium
Jump to navigation Jump to search

Sometimes, definiteness is not the same between languages, even if both languages express both definiteness and indefiniteness by means of articles.

  • (ca) Al principi del 2009, el Grup de Treball de Nacions Unides sobre la Detenció Arbitrària va considerar que l'arrest domiciliari d'Aung San Suu Kyi és il·legal segons les lleis internacionals i les mateixes lleis birmanes.
  • (ca-en) At first of -the 2009, the Group of Work of United Nations on -the Arbitrary Arrest considered that the domiciliary arrest of Aung San Suu Kyi is illegal as -the international laws and -the same Burmese laws.
  • (en) At the beginning of 2009, the Working Group of +the United Nations on Arbitrary Detention considered that the house arrest of Aung San Suu Kyi is illegal according to international law and also Burmese laws.

There are some methods that can be used to solve these translation problems:

  • Years in English aren't definite, but they are in Catalan, if there is no preposition before, but there is an article in Catalan, put "in".
  • Some could be based on frequency, or frequency with syntactic information.
    • international law:
      • In the English Wikipedia, we find "international law(s)" 563 times, and "the international law(s) 9 times. In all cases, "international law" is either (a) used as a modifier "the international law programme", "the international law enforcement community", or is (b) an error in the English (probably non-native writer) " In violation of the international law, they were tortured and executed"
      • In comparison, in the Catalan Wikipedia, we find "llei(s) internacional(s)" 28 times, and only twice without an article (or other determiner) "com a llei internacional" "estan protegides per lleis internacionals".

So, 98.4% of the time "international law" appears without an article in the English Wikipedia, and if we exclude modifier and non-adequate usage, it is 100% of the time. In the Catalan Wikipedia, "llei internacional" appears 92% of the time with an article. Surely we can make use of this information somehow.

After all, Google gets it right:

  • (G) In early 2009, the UN Working Group on Arbitrary Detention found that the house arrest of Aung San Suu Kyi is illegal under international law and the same laws Burmese.

Language pairs that could use this: Icelandic and English, English and Catalan, English and Spanish, Norwegian and English, Welsh and English, Basque and Spanish