machine-learning, artificial-intelligence and computational-linguistics

Question

I would love to talk to people who have experience in machine-learning, computational-linguistics or artificial-intelligence in general but by the following example:

• Which existing software would you apply for a manageable attempt building something like google translate by statistic linguistic, machine learning? (Don’t get me wrong I don’t want to just do this, but solely trying to draw a conceptional framework for something most complex in this field, what would you think of if you had the chance to lead a team going to realize such...)

• Which existent database(s)? Which database technology to store results when those are terabytes of data

• Which programming languages besides C++?

• Apache mahunt?

• And, how would those software components work together to power the effort as a whole?

Python is quite useful due to NLTK. Also: Try to split your question. The way it is now it is nearly impossible to get a definitive answer. — pmr, Apr 23 '11 at 18:36
Partial answer to your question: for state-of-art manageable machine translation framework check Moses, http://www.statmt.org/moses/ — matcheek, Apr 24 '11 at 09:28

Kiril · Answer 1 · 2011-04-23T22:32:18.620

Which existing software would you apply for a manageable attempt building something like google translate by statisic linguistic, machine learning

If your only goal is to build software that translates, then I would just use the Google Language API: it's free so why reinvent the wheel? If your goal is to build a translator similar to Google's for the sake of getting familiar with machine learning, then you're on the wrong path... try a simpler problem.

Which database(s)?

Update:
Depends on the size of your corpus: if it's ginormous, then I would go with hadoop (since you mentioned mahout)... otherwise go with a standard database (SQL Server, MySQL, etc.).

Original:
I'm not sure what databases you can use for this, but if all else fails you can use Google translate to build your own database... however, the latter will introduce bias towards Google's translator and any errors that Google does will cause your software to (at the very least) have the same errors.

Which programming languages besides C++?

Whatever you're most comfortable with... certainly C++ is an option, but you might have an easier time with Java or C#. Developing in Java and C# is much faster since there is A LOT of functionality built into those languages right from the start.

Apache mahunt?

If you have an enormous data set... you could.

Update:
In general if the size of your corpus is really big, then I would definitely use a robust combination like mahout/hadoop. Both of them are built exactly for that purpose and you would have a really hard time "duplicating" all of their work unless you do have a huge team behind you.

And, how would those software components work together to power the effort as a whole?

It seems that you are in fact trying to familiarize yourself with machine learning... I would try something MUCH simpler: build a language detector instead of a translator. I recently built one and I found that the most useful thing you can do is build character n-grams (bigrams and trigrams combined worked the best). You would then use the n-grams as input to a standard Machine Learning algorithm (like C45, GP, GA, Bayesian Model, etc.) and perform 10-fold cross-validation to minimize overfitting.

Update:

"...what software components do you use to make your example running?"

My example was pretty simple: I have an SQL Server database with documents which are already labeled with a language, I load all the data in the memory (several hundred documents) and I give the algorithm (C45) each document. The algorithm uses a custom function to extract the document features (bigram and trigram letters), then it runs its standard learning process and spits out a model. I then test the model against a testing data set to verify the accuracy.

In your case, with terabytes of data, it seems that you should use mahout with hadoop. Additionally, the components you're going to be using are well defined in the mahout/hadoop architecture, so it should be pretty self explanatory from there on.

thanks for your answer - "try a simpler problem." - check my edit... Thanks for the example. And to put my questions in other words - what software components do you use to make your example running? But also: which would you use or think of useing them if you had to do something of that field but in maximum level of complexity with terabytes of data on each side? — Jonas, Apr 23 '11 at 19:05
database - i did not mean "which coprus/input?" i specified this question as well — Jonas, Apr 23 '11 at 19:13

Gael Varoquaux · Answer 2 · 2011-05-10T08:18:33.750

3

With regards to language choice, at least for prototyping, I would suggest Python. It is enjoying a lot of success in the natural language processing as comes with a large library of tools with scientific computing, text analysis, and machine learning. Last but not least, it is really easy to call compiled code (C, C++), if you want to benefit from existing tools.

Specifically, have a look at the following modules:

NLTK, natural language toolkit
scikits.learn, machine learning in Python

Olivier Grisel's presentation on text mining with these tools can come in handy.

Disclaimer: I am one of the core developers of scikits.learn.

edited May 10 '11 at 08:18

answered Apr 25 '11 at 11:06

Gael Varoquaux

2,466
2
24
12

according to the FAQ, "you *must* disclose your affiliation [with scikit-learn] in your answers." – Fred Foo May 09 '11 at 11:23
Sorry, you are right. I didn't know this rule, but it does make sens. I am editing my answer to fix that. – Gael Varoquaux May 10 '11 at 08:17

score 2 · Answer 3 · answered Apr 26 '11 at 05:22

Which existent database(s)? Which database technology to store results when those are terabytes of data HBase, ElasticSearch, MongoDB

• Which programming languages besides C++? For ML other popular languages Scala, Java, Python

• Apache mahunt? Useful sometimes, more codding to pure Hadoop

• And, how would those software components work together to power the effort as a whole? There are many statistical machine learning algorithms which can be paralelized with mapreduce, allow sotrage in NoSQl

score 1 · Accepted Answer · answered Apr 25 '11 at 01:39

The best techniques available for automated translation are based on statistical methods. In computer science this is known as "Machine Translation" or MT. The idea is to treat the signal (the text to be translated) as a noisy signal and to use error correction to "fix" the signal. For example, suppose you are translating english to french. Assume the english statement was originally french but came out as english. You have to fix it up to restore it. A statistical language model can be built for the target language (french) and for the errors. Errors could include dropped words, moved words, misspelled words, and added words.

More can be found at : http://www.statmt.org/

Regarding the db, an MT solution does not need a typical db. Everything should be done in memory.

The best language to use for this specific task is the fastest one. C would be ideal for this problem because it is fast and easy to control memory access. But any high level language could be used such as Perl, C#, Java, Python, etc.

Everything should be done in memory. - Would you consider Datadraw then for simple things or Redis or Scalatis for bigger Data? What if your memory is 256GB and your corpus is 10TB and you outputdata aswell? I wonder if one should really code the database on their own to reach max. performance for the specific task or if there are suitable ones already which will have max. performance as well. — Jonas, Apr 25 '11 at 14:33

score 0 · Answer 5 · answered Apr 27 '18 at 21:49

Google's Tensorflow is a useful tool for basic translation. Anyone who is truly bilingual knows, however, that translating is not a statistical process. It is a much more complicated process that has just been simplified so that 90% of it seems correct.
Immense parallelism will make a great difference, so the advent of Quantum Computing, and maybe some of the ideas form it, will make possible the next 8%.
The final 2% will match normal professional translators and interpreters.

machine-learning, artificial-intelligence and computational-linguistics

5 Answers5