User:Popcorndude

From Apertium
Jump to navigation Jump to search

Hi! I'm Daniel. IRC is generally the fastest way to contact me. I'm usually in US central or eastern time (UTC-4 - UTC-6), but I read the logs, so leave messages whenever.

The rest of this page is my project list. Feel free to steal ideas from it, especially if you want to collaborate.

Things I (think I) know how to do

Modernizing old pairs

Some of the old pairs are a mess. There's some monolingualizing to be done, random files are missing, and there are workarounds to missing features that now exist. Also the READMEs are terrible. Basically the plan is to make most things look more like what Apertium-init generates.

Nicer UI for contributing

See Ideas_for_Google_Summer_of_Code/Bidix_lookup_and_maintenance. I'm much less certain about similar contributions to monodix. For transfer though you could probably make something that shows the tree that gets built and then have a drag-and-drop interface to fix errors.

Automate transition to -separable

Some monodixes have slightly horrifying multiwords in them, such as this one:

<e lm="you can lead a horse to water but you can't make it drink"><i>you<b/>can<b/>lead<b/>a<b/>horse<b/>to<b/>water<b/>but<b/>you<b/>can't<b/>make<b/>it<b/>drink</i><par n="hello__ij"/></e>

It shouldn't be too hard to extract the multiwords from a monodix and convert them to -separable entries. The fact that they're in the monodix means they're contiguous so there's no information lost.

Updating documentation

This should be reasonably self-explanatory.

UD parsing in -recursive

DET @det:3 ADJ @amod:3 NOUN
^the<det>$ ^green<adj>$ ^dragon<n>$
-> ^the<det>$ ^dragon<n><@@amod>{^green<adj><@amod>$ ^dragon<n>$}$
-> ^dragon<n><@@amod><@@det>{^the<det><@det>$ ^green<adj><@amod>$ ^dragon<n>$}$

Tricky things: non-projective stuff? transfer?

(the double atted tags are to keep track of (e.g.) if you need a rule that applies to a noun with case marking)

Unicode everywhere

Things I don't know how to do

Learning transfer rules from small corpora

Given a syntactic parser for one language and a fairly small parallel corpus it seems like it should be possible to learn decent transfer rules. (This turns out to be a lot harder than I thought.)

Import data from FieldWorks

SIL FieldWorks processes things like lexical and morphological data. It might be possible to take data from it and build a transducer.

Learn morphology from small corpora

Most of the things on this page are components of the translation memory idea, which will need to be able to learn some amount of morphology as it goes, though I currently have very little idea how to do that.

Translation Memory ++

A translation memory remembers phrases as you translate so you don't have to translate them again. This idea would be like that but would build an Apertium pair rather than just storing phrases. It could give you a draft of one page, and then improve the draft of the next page based on your postedits.

Rule-based semantics

If we used -recursive's tree output as the input to some sort of semantics system, could we do anything interesting with information extraction?

Things I've already done

Apertium-recursive

I really don't like XML or finite-state chunking, hence the new transfer module.

Relevant Pages

Automated_extraction_of_lexical_resources