Using BERT in order to detect language of a given word

Question

I have words in the Hebrew language. Part of them are originally in English, and part of them are 'Hebrew English', meaning that those are words that are originally from English but are written with Hebrew words. For example: 'insulin' in Hebrew is "אינסולין" (Same phonetic sound).

I have a simple binary dataset. X: words (Written with Hebrew characters) y: label 1 if the word is originally in English and is written with Hebrew characters, else 0

I've tried using the classifier, but the input for it is full text, and my input is just words.

I don't want any MASKING to happen, I just want simple classification.

Is it possible to use BERT for this mission? Thanks

David Dale · Accepted Answer · 2019-06-23T17:09:43.017

BERT is intended to work with words in context. Without context, a BERT-like model is equivalent to simple word2vec lookup (there is fancy tokenization, but I don't know how it works with Hebrew - probably, not very efficiently). So if you really really want to use distributional features in your classifier, you can take a pretrained word2vec model instead - it's simpler than BERT, and no less powerful.

But I'm not sure it will work anyway. Word2vec and its equivalents (like BERT without context) don't know much about inner structure of a word - only about contexts it is used in. In your problem, however, word structure is more important than possible contexts. For example, words בלוטת (gland) or דם (blood) or סוכר (sugar) often occur in the same context as insulin, but בלוטת and דם are Hebrew, whereas סוכר is English (okay, originally Arabic, but we are probably not interested in too ancient origins). You just cannot predict it from context only.

So why not start with some simple model (e.g. logistic regression or even naive bayes) over simple features (e.g. character n-grams)? Distributional features (I mean w2v) may be added as well, because they tell about topic, and topics may be informative (e.g. in medicine, and technology in general, there are probably relatively more English words than in other domains).

Thanks, I know that character n-gram is a classic solution for this problem, but I was wondering whether I can use more advanced methods and use the power of transfer learning (Such as BERT, ELMo, etc) for my problem. Thanks for your informative comment. — jonb, Jun 24 '19 at 12:51

Using BERT in order to detect language of a given word

1 Answers1