User:ScoopGracie
Revision as of 00:55, 26 May 2025 by ScoopGracie (talk | contribs)
Bible translation discussion[edit]
[20:10:01] <kj7rrv> popcorndude_m[m] hello! This is Samuel Sloniker. I created the pair directory with `apertium-init apertium-grc-eng --a1=hfst --a2=lttoolbox`. Is that correct? [20:25:09] *** Quits: Fiji (~Fiji@0002bc5a.user.oftc.net) (Ping timeout: 480 seconds) [21:12:51] <dangswan> looks right (assuming you want t*x translation rather than rtx or something more experimental) [21:21:03] *** Joins: Fiji (~Fiji@0002bc5a.user.oftc.net) [21:31:34] <kj7rrv> popcorndude_m[m]: thank you! rtx would be recursive transfer, right? [21:31:47] <kj7rrv> Would that be better for handling the free word order of Greek? [21:34:37] <dangswan> honestly they both sound terrible for handling free word order [21:36:01] <dangswan> I'm working on transfer using dependency syntax in constraint grammar, which I think is more reasonable [21:36:19] <dangswan> but also, what is it you want to translate? [21:37:34] <dangswan> if you want to do the NT, you can start from the UD treebank rather than the raw text, and then you don't have to write a parser [21:37:45] <kj7rrv> My eventual goal is to translate the New Testament into languages that don't have an existing translation, but I'm targeting English first to learn the process with a language that I know. [21:38:29] <kj7rrv> Is "UD" "Universal Dependencies"? I've heard of that before, but I don't really understand what it is. [21:38:35] <dangswan> replace "New Testament" with "Old Testament (Hebrew)" and you have my dissertation [21:39:27] <dangswan> Universal Dependencies is a syntactic annotation framework where each word has a relation to some other word in the sentence [21:41:40] <dangswan> so in εν αρχη ην ο λογος, we say that εν is @case of αρχη, αρχη is @obl (oblique) of λογος, ην is @cop of λογος, ο is @det of λογος, and λογος is the @root of the sentence [21:42:08] <kj7rrv> Long-term, I would be interested in working with the OT as well, but for now I'm focusing on the NT. [21:42:08] <kj7rrv> Is there an NT edition that is pre-annotated with UD? [21:42:28] <kj7rrv> Oh, that sounds very helpful in translation! [21:42:30] <dangswan> yep [21:42:43] <dangswan> https://github.com/UniversalDependencies/UD_Ancient_Greek-PROIEL [21:42:44] <begiak> [ GitHub - UniversalDependencies/UD_Ancient_Greek-PROIEL: Ancient Greek data from the PROIEL project. ] [21:43:37] <dangswan> if your goal is translation rather than ML training, then the sentences are a bit scattered (though not too horrendously) [21:43:54] <kj7rrv> So that would avoid the need to apply Colwell's Rule in lexical transfer in order to correctly translate θεος ην ο λογος as "the Word was God"? [21:44:55] <kj7rrv> s/lexical transfer/structural transfer [21:45:12] <dangswan> remind me what Colwell's rule states? [21:46:22] <dangswan> (also I'm emailing you my dissertation proposal, which I think you'll find interesting in this regard) [21:47:43] <kj7rrv> It's the rule that when there are two nouns in the nominative case with a copula, and one is articular and the other is anarthrous, the articular one is the subject and the anarthrous one is the predicate nominative, thus θεος ην ο λογος is "the Word was God" and not "God was the Word." [21:47:44] <kj7rrv> Thank you! [21:48:12] <dangswan> yes, this avoids that entirely [21:50:13] <dangswan> essentially, take the apertium pipeline and split structural transfer into 2 pieces: parsing and transfer proper, with parsing moved to before lexical transfer [21:50:32] <dangswan> we then start the pipeline at lexical transfer rather than at morphological analysis [21:50:51] <kj7rrv> That definitely sounds useful! So all of the part-of-speech tagging, chunking, etc. is already done, and a translation system for the NT just needs to do lexical transfer, structural construction (not sure if there's an established term for that), and morphological generation? [21:51:05] <dangswan> precisely [21:51:37] <kj7rrv> Thank you! Are Apertium's structural transfer tools suitable for this? [21:52:09] <dangswan> Constraint Grammar is far more powerful than what we use it for and can handle this with only minor adjustments [21:53:09] <kj7rrv> So CG would be used for structural transfer, an Apertium bidix for lexical transfer, and an Apertium monodix for morphological generation? [21:53:17] <dangswan> yes [21:53:41] <dangswan> the proposal I sent you has updated pipeline diagrams (Question 2) [21:56:27] *** Parts: eevvoor (230f52442d@jabberfr.org) (Error from remote client) [21:56:43] <kj7rrv> Thank you! That definitely helps to clarify it. The only step I don't understand is linearization. Is that conversion from some kind of TL parse tree into a sequence of words suitable for piping into �apertium ???-gener�? [21:58:16] <dangswan> oh, I forgot that part of the split [21:58:51] <dangswan> so actually I split transfer into 3 pieces, where structural transfer changes what words are present and how they are related and then linearization puts them in the target-language order [21:59:41] <dangswan> in part this makes it easier for me to reason about how the rules interact and it part it makes more of the system monolingual and thus reusable [22:00:30] <kj7rrv> Does that mean it puts more on the source or target language side? [22:03:09] <dangswan> so instead of having grc-eng.rtx which does all syntax, we have grc.cg3 that parses, grc-eng.cg3 that edits the tree, and eng.llx that puts the words in English order, and two of those files are in the language repos so we can use them again for grc-spa and hbo-eng [22:04:13] <dangswan> (departing for supper now - will return eventually) [22:05:38] <kj7rrv> Thank you for the help! I'll post any more questions I have; there's no need to hurry to reply. [22:26:40] <kj7rrv> "The tree is then passed to the main focus of this section, the structural transfer module, which �selects translations for words that have more than one� and makes adjustments to the tree structure as needed." Would it be feasible to do word-sense disambiguation on the UD parse tree, and then write rules such as "Greek �παρακαλεω� sense 1 becomes English �encourage�; sense 2 becomes �urge�..."? [22:27:24] <kj7rrv> That would require substantial work once, but would be applicable to all languages. [22:31:42] <dangswan> yes, that would be a reasonable undertaking [22:32:05] <dangswan> and in fact, my Hebrew trees do have such data in them (though I may have forgotten to mention it in the proposal) [22:36:56] <kj7rrv> Thank you! I'm only about halfway through; I might have just not gotten to that point yet. [22:38:58] <dangswan> if it comes up, it's a single sentence in Question 1 [22:39:21] <dangswan> one of my data sources just happens to have such annotations [22:40:36] <kj7rrv> Oh, okay, thank you! Would using that in translation require a change in the bidix format? [22:41:45] <dangswan> no, it would just be an extra tag [22:43:20] <kj7rrv> Something like this (although probably less verbose)? [22:43:21] <kj7rrv> ```xml<e><p><l>παρακαλεο<s n="vblex"/><s n="sense-comfort"/></l><r>comfort<s n="n"/><s n="vblex"/></r></p></e> [22:43:37] <kj7rrv> * Something like this (although probably less verbose)?... (full message at <https://matrix.org/oftc/media/v1/media/download/AZ3GuPBxfjiJ-KDzc1lt5-QvHDVNkvEy9bA385JpEaQbOxRDL4P_JbRXtC237twQm_vtIHnAx1PBu4vK_yjAVs1CeXUbax3QAG1hdHJpeC5vcmcvVVp6WnVmRnJ4U0RjSnBic3Zsb0NSZU1M>) [22:43:47] <kj7rrv> * Something like this (although probably less verbose)?... (full message at <https://matrix.org/oftc/media/v1/media/download/AW0j25zJEZUk6S84m7OLTZl0nHGSec-87E6ktmmsFQMYBKAXMJeZ880t7nreqrx69dn-PJJfZ8F6d6aBgEh9UrVCeXUbbbSwAG1hdHJpeC5vcmcvSVVUdHhXc0NDRGxuQkNnZXdsVUluZkNZ>) [22:44:29] <kj7rrv> * Something like this (although probably less verbose)?... (full message at <https://matrix.org/oftc/media/v1/media/download/AYDRSkLum0-LEUk-iTGN1Xv4TtDkpZ0mckes4zLlCbwwhcxO9oBSdwl4Fbs_j9PbhYca6xR7P3DMNkS9bz0aVGpCeXUbd_OQAG1hdHJpeC5vcmcvQ1V6bmVsRFVtVnlPUWVpTXBoemdqUXhk>) [22:44:45] <kj7rrv> * Something like this (although probably less verbose)?... (full message at <https://matrix.org/oftc/media/v1/media/download/AS_nbcHk39Lv66_N7a72lSW7LIEUZZOIzFlTFlJHrL1eAx8ATa559cz50ZRB7oOK28vUrdKY-FgXUlqz8SH9gIZCeXUbe8NAAG1hdHJpeC5vcmcvcmxmVEVIb0dnV1ZJQ0ZGSEdTdG9nQldu>) [22:45:03] <kj7rrv> (sorry about all the edits; I forgot that IRC doesn't handle them well) [22:46:48] <dangswan> the Hebrew ones are IDs into a dictionary, so something like `<s n="SDBH:006653001001000"/>` except that that would bloat the symbol list horrendously, so what I might actually do is use the tags in lexical selection instead [22:47:05] <dangswan> (that also places fewer restrictions on where they appear) [22:49:06] <kj7rrv> That makes sense! So the bidix would just list "comfort" and "encourage" as translations of παρακαλεω, and lexical selection would use the sense tags to disambiguate? [22:49:28] <dangswan> yep [22:52:00] <kj7rrv> Thank you! Would that use �apertium-lex-tools� for lexical selection? [22:53:03] <dangswan> it could, or it could do selection in CG where it would have access to the tree structure in addition to the linear order [22:55:40] <kj7rrv> Oh, okay! Does CG use the Apertium pipeline format, or does it have its own format that requires conversion? [22:56:49] <dangswan> .rlx files for morphological disambiguation are CG [22:58:00] <kj7rrv> Oh, I was referring to the stream format that is piped between Apertium tools in the pipeline. Does CG support that? [22:59:05] <dangswan> yes, CG supports Apertium format [23:00:25] <dangswan> though CG's native format is often more readable [23:03:21] <kj7rrv> Since there would still be some Apertium tools used, wouldn't at least part of the pipeline need to use Apertium format? [23:06:03] <dangswan> yeah, the whole pipeline would probably be in Apertium format, the note about readability is largely tangential [23:07:13] <dangswan> but if you run a -disam mode, it will often output in CG, because the CG format is more expressive than Apertium and can display things like deleted readings [23:10:05] <kj7rrv> Oh, okay. If adding a new language reveals an insufficiency in the sense disambiguations in the trees (e.g. suppose "urge" is one of the senses used of παρακαλεω, but someone tries to create a translation pair to translate the NT to a language that has different words for urging someone to do or not to do something), would there be a feasible way to fix that? I would think that another entry could be added to the lexical selection [23:10:05] <kj7rrv> rules in every pair, and each word with that sense tag in the trees could be reviewed; that sounds doable but rather time-consuming. Could at least the former be automated? [23:14:49] <dangswan> for the moment, I'm taking sense data as about what distinctions are made by the source language, and finer distinctions can be made either based on context or it can be left for the category of things that the Apertium system doesn't handle and the human translator corrects (and there will always be some of those) [23:16:45] <dangswan> long-term, there could be a second layer of disambiguation in a second tag which would then be available when needed but not interfere with existing systems that only used the first tag [23:19:18] <kj7rrv> How do you determine what distinctions are made by the source language among senses of one word? Does that focus only on senses that are so divergent that they are essentially different words (e.g. "lead" as a verb meaning to cause to follow or as a noun referring to a heavy metal)? [23:20:21] <dangswan> the Hebrew data is copied from an existing source, and I haven't looked at it closely [23:20:59] <dangswan> but my guess would be that you could draw a line if there's a decently clean separation of what other words it appears with [23:21:33] <dangswan> or if there's a different pattern of what arguments the verb takes [23:22:43] <dangswan> turns out the dictionary that my data comes from also has Greek, so you can see what they do: https://marble.bible/dictionary?s=002512000000000&db=Greek [23:22:44] <begiak> [ Marble ] [23:24:57] <kj7rrv> That makes sense! So in your system, there's nothing analogous to listing several different ways a word could be translated in a target language, like what Strong's Concordance gives (e.g. "παρακαλέω... to �call near�, i.e. �invite�, �invoke� (by �imploration�, �exhortation�, or �consolation�):— beseech, call for, (be of good) comfort, desire, exhort, (give) exhortation, enreat, pray.")? [23:25:59] <kj7rrv> Οh, I see how it works! That definitely looks like it would work for most use cases. [23:26:19] <dangswan> so for https://marble.bible/dictionary?s=000742000000000&db=Greek we have αρχη and we distinguish between the aspect-ish use, the time use, the relation use, etc [23:27:12] <kj7rrv> Thank you! Some TLs might make finer distinctions than that dataset, but it seems that that should cover most use cases. [23:28:18] <kj7rrv> Is the semantic dictionary libre? [23:29:00] <dangswan> https://github.com/ubsicap/ubs-open-license/tree/main/dictionaries/greek CC-BY-SA [23:29:34] <kj7rrv> Thank you1 [23:29:36] <kj7rrv> ! [23:30:33] <dangswan> and this repo has disambiguated text: https://github.com/clear-Bible/macula-greek?tab=readme-ov-file [23:32:02] <kj7rrv> "Participant referents: Who is “he,” “she,” or “it” in this sentence?" That would avoid the need for anaphora resolution IIUC? [23:33:12] <dangswan> yes, if we figured out how to incorporate that it would replace -anaphora [23:33:39] *** Quits: Fiji (~Fiji@0002bc5a.user.oftc.net) (Ping timeout: 480 seconds) [23:35:18] <kj7rrv> So it sounds like the main problem would be getting the parse trees from the macula-greek XML format to Apertium stream format, and the rest would be a fairly ordinary Apertium pair but with more use of CG than usual? [23:36:47] <dangswan> what I would probably actually do is line up macula-greek with UD-Ancient_Greek-PROIEL and copy the data onto the UD trees and then pass that through the Apertium pipeline [23:38:03] <kj7rrv> Oh, okay! Are the UD trees based on the SBL GNT as well? [23:39:08] <dangswan> probably, I'll check [23:40:47] <kj7rrv> I can check; I was just asking in case you already knew. [23:41:25] <dangswan> https://github.com/proiel/proiel-treebank/ says Tischendorf 1869 [23:41:38] <kj7rrv> Oh, okay. [23:43:20] <dangswan> so the lining up would be a matter of figuring out whether Nestle or SBL is closer to that and then adjusting the mappings as needed [23:43:43] <kj7rrv> Since Tischendorf's text was a critical text, it shouldn't be too hard to adapt the UD trees to Nestle-Aland or SBL? [23:44:53] <dangswan> I also haven't looked at the UD trees in detail and don't know how trustworthy they are [23:45:25] <kj7rrv> It looks like MACULA has parse trees as well; could they be used instead? [23:46:36] <dangswan> those would be much harder to fit into Apertium directly, but my Hebrew treebank is largely based on MACULA's Hebrew repo, so I know roughly what it takes to convert them [23:48:42] <kj7rrv> Oh, okay. I'll look at the UD and MACULA trees as I have time and compare them; are there programs available to view them in a parse-tree format rather than text? [23:50:00] <dangswan> UD has several - take whichever display format you like best on https://universaldependencies.org/tools.html [23:50:01] <begiak> [ UD tools ] [23:50:06] <dangswan> MACULA, I'm not aware of any [23:53:21] <kj7rrv> Thank you! I need to go now, but I'll look at the parse trees and do some more research on how the pipeline would work. Thanks again for the help!