Difference between revisions of "User:Popcorndude"
Popcorndude (talk | contribs) m (→Automate transition to -separable: html escapes) |
Popcorndude (talk | contribs) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
The rest of this page is my project list. Feel free to steal ideas from it, especially if you want to collaborate. |
The rest of this page is my project list. Feel free to steal ideas from it, especially if you want to collaborate. |
||
Also, because prioritizing is hard, [[User:Popcorndude/What_should_Daniel_work_on]] exists so that other people can vote on what order I do this stuff in. |
|||
== Things I know how to do == |
|||
== Things I (think I) know how to do == |
|||
⚫ | |||
=== Modernizing old pairs === |
=== Modernizing old pairs === |
||
Line 30: | Line 28: | ||
This should be reasonably self-explanatory. |
This should be reasonably self-explanatory. |
||
=== UD parsing === |
|||
See below for notes on doing this with rtx, but really it would probably be easier to do it separately. |
|||
==== UD parsing in -recursive ==== |
|||
DET @det:3 ADJ @amod:3 NOUN |
|||
^the<det>$ ^green<adj>$ ^dragon<n>$ |
|||
-> ^the<det>$ ^dragon<n><@@amod>{^green<adj><@amod>$ ^dragon<n>$}$ |
|||
-> ^dragon<n><@@amod><@@det>{^the<det><@det>$ ^green<adj><@amod>$ ^dragon<n>$}$ |
|||
Tricky things: non-projective stuff? transfer? |
|||
(the double atted tags are to keep track of (e.g.) if you need a rule that applies to a noun with case marking) |
|||
=== Transfer system based on UD === |
|||
Basically a system for writing tree/graph rewrite operations for the handful of pair-specific things that remain after implementing a UD parser and un-parser. |
|||
=== UD un-parsing === |
|||
Some sort of language model that takes a list of LUs and dependency relations and determines the appropriate order (and agreement info?) for them, possibly taking source order into account. In conjunction with a UD parser, this should produce at least somewhat functional transfer between any pair of languages with a bidix. |
|||
=== Rule-Explainer === |
|||
Using https://github.com/tree-sitter/tree-sitter/tree/master/lib/binding_web it should be possible to make a web page that parses a rule file and explains what each piece is doing (bonus points for live updating and hovering over code or explanation to highlight the relevant section of the other one). Would make a nice teaching and debugging tool. |
|||
=== Unordered Generation === |
|||
Something that operates like <code>lt-proc -g</code> but tries all tags in the input at all possible points so that it doesn't matter where they're supposed to be (start, end, middle, reverse order, ...) as long as the right set is present. Probably helpful for the prefix-tags debate. |
|||
== Things I don't know how to do == |
== Things I don't know how to do == |
||
=== Better tokenization === |
|||
Multiwords and space-less orthographies. |
|||
* https://github.com/apertium/organisation/issues/24 |
|||
* https://github.com/chanlon1/tokenisation |
|||
* [[User:Francis_Tyers/Apertium_4]] |
|||
* [[Rethinking_tokenisation_in_the_pipeline]] |
|||
=== Learning transfer rules from small corpora === |
|||
⚫ | |||
=== Import data from FieldWorks === |
=== Import data from FieldWorks === |
||
Line 47: | Line 89: | ||
=== Rule-based semantics === |
=== Rule-based semantics === |
||
If we used -recursive's tree output as the input to some sort of semantics system, could we do anything interesting with information extraction? |
If we used -recursive's tree output as the input to some sort of semantics system, could we do anything interesting with information extraction? |
||
=== See rule effect === |
|||
A tool that takes a rule in any module and finds every line in a corpus where it would apply. |
|||
== Things I've already done == |
== Things I've already done == |
||
Line 54: | Line 100: | ||
I really don't like XML or finite-state chunking, hence the new transfer module. |
I really don't like XML or finite-state chunking, hence the new transfer module. |
||
=== Unicode everywhere === |
|||
Pretty much all the main modules should handle non-ASCII characters correctly now. |
|||
=== Using -separable for Postgen === |
|||
Have <code>lsx-proc -p</code> read things like <code>^lemma<tags>/surface$</code> and output the same, with LUs in the rules matching either <code>surface</code> or <code>lemma<tags>/surface</code> (with literal <code>/</code> - shouldn't require an updates to the compiler), and output the same. Might have to restrict it to not adding or deleting anything, but not sure. |
|||
See [[Postgenerator]] |
|||
=== Capitalization Restoration === |
|||
Move capitalization handling out of transfer and into a separate post-processor. See [[Capitalization restoration]]. |
|||
== Relevant Pages == |
|||
[[Automated_extraction_of_lexical_resources]] |
Latest revision as of 18:06, 23 December 2022
Hi! I'm Daniel. IRC is generally the fastest way to contact me. I'm usually in US central or eastern time (UTC-4 - UTC-6), but I read the logs, so leave messages whenever.
The rest of this page is my project list. Feel free to steal ideas from it, especially if you want to collaborate.
Also, because prioritizing is hard, User:Popcorndude/What_should_Daniel_work_on exists so that other people can vote on what order I do this stuff in.
Contents
Things I (think I) know how to do[edit]
Modernizing old pairs[edit]
Some of the old pairs are a mess. There's some monolingualizing to be done, random files are missing, and there are workarounds to missing features that now exist. Also the READMEs are terrible. Basically the plan is to make most things look more like what Apertium-init generates.
Nicer UI for contributing[edit]
See Ideas_for_Google_Summer_of_Code/Bidix_lookup_and_maintenance. I'm much less certain about similar contributions to monodix. For transfer though you could probably make something that shows the tree that gets built and then have a drag-and-drop interface to fix errors.
Automate transition to -separable[edit]
Some monodixes have slightly horrifying multiwords in them, such as this one:
<e lm="you can lead a horse to water but you can't make it drink"><i>you<b/>can<b/>lead<b/>a<b/>horse<b/>to<b/>water<b/>but<b/>you<b/>can't<b/>make<b/>it<b/>drink</i><par n="hello__ij"/></e>
It shouldn't be too hard to extract the multiwords from a monodix and convert them to -separable entries. The fact that they're in the monodix means they're contiguous so there's no information lost.
Updating documentation[edit]
This should be reasonably self-explanatory.
UD parsing[edit]
See below for notes on doing this with rtx, but really it would probably be easier to do it separately.
UD parsing in -recursive[edit]
DET @det:3 ADJ @amod:3 NOUN ^the<det>$ ^green<adj>$ ^dragon<n>$ -> ^the<det>$ ^dragon<n><@@amod>{^green<adj><@amod>$ ^dragon<n>$}$ -> ^dragon<n><@@amod><@@det>{^the<det><@det>$ ^green<adj><@amod>$ ^dragon<n>$}$
Tricky things: non-projective stuff? transfer?
(the double atted tags are to keep track of (e.g.) if you need a rule that applies to a noun with case marking)
Transfer system based on UD[edit]
Basically a system for writing tree/graph rewrite operations for the handful of pair-specific things that remain after implementing a UD parser and un-parser.
UD un-parsing[edit]
Some sort of language model that takes a list of LUs and dependency relations and determines the appropriate order (and agreement info?) for them, possibly taking source order into account. In conjunction with a UD parser, this should produce at least somewhat functional transfer between any pair of languages with a bidix.
Rule-Explainer[edit]
Using https://github.com/tree-sitter/tree-sitter/tree/master/lib/binding_web it should be possible to make a web page that parses a rule file and explains what each piece is doing (bonus points for live updating and hovering over code or explanation to highlight the relevant section of the other one). Would make a nice teaching and debugging tool.
Unordered Generation[edit]
Something that operates like lt-proc -g
but tries all tags in the input at all possible points so that it doesn't matter where they're supposed to be (start, end, middle, reverse order, ...) as long as the right set is present. Probably helpful for the prefix-tags debate.
Things I don't know how to do[edit]
Better tokenization[edit]
Multiwords and space-less orthographies.
- https://github.com/apertium/organisation/issues/24
- https://github.com/chanlon1/tokenisation
- User:Francis_Tyers/Apertium_4
- Rethinking_tokenisation_in_the_pipeline
Learning transfer rules from small corpora[edit]
Given a syntactic parser for one language and a fairly small parallel corpus it seems like it should be possible to learn decent transfer rules. (This turns out to be a lot harder than I thought.)
Import data from FieldWorks[edit]
SIL FieldWorks processes things like lexical and morphological data. It might be possible to take data from it and build a transducer.
Learn morphology from small corpora[edit]
Most of the things on this page are components of the translation memory idea, which will need to be able to learn some amount of morphology as it goes, though I currently have very little idea how to do that.
Translation Memory ++[edit]
A translation memory remembers phrases as you translate so you don't have to translate them again. This idea would be like that but would build an Apertium pair rather than just storing phrases. It could give you a draft of one page, and then improve the draft of the next page based on your postedits.
Rule-based semantics[edit]
If we used -recursive's tree output as the input to some sort of semantics system, could we do anything interesting with information extraction?
See rule effect[edit]
A tool that takes a rule in any module and finds every line in a corpus where it would apply.
Things I've already done[edit]
Apertium-recursive[edit]
I really don't like XML or finite-state chunking, hence the new transfer module.
Unicode everywhere[edit]
Pretty much all the main modules should handle non-ASCII characters correctly now.
Using -separable for Postgen[edit]
Have lsx-proc -p
read things like ^lemma<tags>/surface$
and output the same, with LUs in the rules matching either surface
or lemma<tags>/surface
(with literal /
- shouldn't require an updates to the compiler), and output the same. Might have to restrict it to not adding or deleting anything, but not sure.
See Postgenerator
Capitalization Restoration[edit]
Move capitalization handling out of transfer and into a separate post-processor. See Capitalization restoration.