1

Many technologies can be used to detect spam in a specific language, and if proper technology is adopted, it can make a system be able to detect spams in multiple languages, but this requires a single text be in a specific language.

So my question is how to detect a text that composed of multiple languages? this is not only about language detection. I'd like to know some best practices to do multilingual text spam detection.

Yu QIAN
  • 177
  • 3
  • 12

2 Answers2

1

If you're trying to do multilingual text spam detection, A possible approach is to use PorterStemmer().

Using WordLemmatizer() will probably give you an error (because the words have to be in english), On the other hand leaving it as it is going to effect your models performance.

Here's an example:

from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer

stemmer = PorterStemmer()

test = "Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat"
print(test)

def word_stemmer(words):

    words = words.split(" ")
    stem_words = [stemmer.stem(i) for i in words]
    return " ".join(stem_words)

print(word_stemmer(test))

Here's what the output looks like

arfan md
  • 19
  • 2
0

A naive solution is still using the translation API to segment the text into fragments by languages. And then classify the text fragments by languages.

This is a straightforward solution, but I am afraid of the performance as translation API is frequently called.

I was wondering how big companies or some excellent projects how to handle this problem?

Yu QIAN
  • 177
  • 3
  • 12