0

As part of a bigger project I want to implement a machine translator from language A to language B. Since there are not available tools that automatically do machine translation over this set of languages, and the available corpus of language B is quite small, I am trying to do the following:

1. Given a sentence in language A, use a tool to get its set of language A PoS (Part-of-speech) tags.

2. The tool I am using for PoS tagging (Freeling) does not return a parse tree, so I thought on building my own parse tree from the set of tags.

3. After the parse tree is completed, traverse it by levels (starting on the root) and reorder its elements according to the grammar rules of language B.

Graphical explanation

enter image description here

After doing some research I found out about Earley parsing (whose ability of parsing any language caught my attention because grammar on language B might change overtime, so I cannot guarantee that it will always meet any specific criterion). However, given that my ultimate goal is doing structure transfer I am not sure if using a bottom-up parser and trying to reorder the elements as I match them with the rules would give me a better performance, or if I am on the wrong path and my solution is wrong altogether.

ml-moron
  • 888
  • 1
  • 11
  • 22
Yukypack
  • 11
  • 6

2 Answers2

0

Depending on the source language you are dealing with, FreeLing does provide a parse tree (e.g. for Spanish, English, Catalan, Portuguese...)

If parsing in your language is not supported by FreeLing, you can add it just by writting a grammar. FreeLing includes a CKY parser which will apply your grammar and give you the parse tree.

In this way, you could achieve step 2 "building my own parse tree from the set of tags".

Regarding the transfer, I am not sure the best strategy is reordering on the fly. Probably is better to have the whole tree and perform the transfer aftwerwards.

If your goal is rule-based translation, you can have a look to the open-source translation platform https://www.apertium.org/

Lluís Padró
  • 215
  • 1
  • 5
  • I didn't notice that FreeLing includes a CKY parser that would apply any grammar I give to it, that's very nice! However, I am sticking with the webdemo right now because I have been having trouble trying to integrate the library with my c# project. I would be DEEPLY thankful if you know about any guide that could help me with that. – Yukypack Jul 28 '16 at 21:03
  • FreeLing can be called from several languages thanks to SWIG (http://swig.org). Adding C# to the list may be relatively easy (depending on how well is C# supported by SWIG). You can try to have a look at te "APIs" folder in FreeLing tarball, and then adapt some of the Makefiles to C#. – Lluís Padró Sep 13 '16 at 13:37
0

If you are looking for the "best" algorithm for divining a parse tree then you should look at Parsey McParseface. Open-source solution recently released by Google. Both is arguably state-of-the-art, and has a really good literature overview in the README.

The issue with using rule-based parsers, or generic lexicon-based methods is that the accuracy you're going to see is really quite low. Generally, trying to use an unsupervised technique here is a shortcut that will cause your algorithm to fail in most cases with even slightly irregular grammar. Especially if the grammar of your target language is likely to change over time it probably has some general ambiguity, which you won't be able to account for using a rule-based system.

As far as the generic bottom-up approach for restructuring your parse trees it's hard to say whether that's the right solution or not. It's certainly a pretty typical approach for building parse trees, but the quality for transfer depends deeply on the domain you're working in, the size of your dataset, and the grammar structure of both languages. At the end of the day one of the big drawbacks of machine learning is that nobody can tell you if a new approach will work or not with any kind of certainty.

You have to give it a shot, assess the performance according to an appropriate metric and then make changes to see if you improve your performance. Sadly, if the corpus you've got is very small you're unlikely to get any kind of high-quality translation in an automated fashion, just not quite enough signal, but if you use the UN transcripts as a training set you can at least validate your fundamental approach compared to literature.

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
  • Thanks for the detailed answer, I will make sure to give a look at Parsey McParseface which looks quite interesting. And yes, I am aware that rule-based methods are not the best but, sadly, the available data is minimal and doesn't allow for any better at the moment. – Yukypack Jul 28 '16 at 21:01
  • @Yukypack Using something OTS will probably. give you something better than a rule-based method unless you plan on annotating everything. – Slater Victoroff Jul 28 '16 at 21:11