Which existing software would you apply for a manageable attempt building something like google translate by statisic linguistic, machine learning
If your only goal is to build software that translates, then I would just use the Google Language API: it's free so why reinvent the wheel? If your goal is to build a translator similar to Google's for the sake of getting familiar with machine learning, then you're on the wrong path... try a simpler problem.
Which database(s)?
Update:
Depends on the size of your corpus: if it's ginormous, then I would go with hadoop (since you mentioned mahout)... otherwise go with a standard database (SQL Server, MySQL, etc.).
Original:
I'm not sure what databases you can use for this, but if all else fails you can use Google translate to build your own database... however, the latter will introduce bias towards Google's translator and any errors that Google does will cause your software to (at the very least) have the same errors.
Which programming languages besides C++?
Whatever you're most comfortable with... certainly C++ is an option, but you might have an easier time with Java or C#. Developing in Java and C# is much faster since there is A LOT of functionality built into those languages right from the start.
Apache mahunt?
If you have an enormous data set... you could.
Update:
In general if the size of your corpus is really big, then I would definitely use a robust combination like mahout/hadoop. Both of them are built exactly for that purpose and you would have a really hard time "duplicating" all of their work unless you do have a huge team behind you.
And, how would those software components work together to power the effort as a whole?
It seems that you are in fact trying to familiarize yourself with machine learning... I would try something MUCH simpler: build a language detector instead of a translator. I recently built one and I found that the most useful thing you can do is build character n-grams (bigrams and trigrams combined worked the best). You would then use the n-grams as input to a standard Machine Learning algorithm (like C45, GP, GA, Bayesian Model, etc.) and perform 10-fold cross-validation to minimize overfitting.
Update:
"...what software components do you use to make your example running?"
My example was pretty simple: I have an SQL Server database with documents which are already labeled with a language, I load all the data in the memory (several hundred documents) and I give the algorithm (C45) each document. The algorithm uses a custom function to extract the document features (bigram and trigram letters), then it runs its standard learning process and spits out a model. I then test the model against a testing data set to verify the accuracy.
In your case, with terabytes of data, it seems that you should use mahout with hadoop. Additionally, the components you're going to be using are well defined in the mahout/hadoop architecture, so it should be pretty self explanatory from there on.