let me introduce the context briefly: I'm fine tuning a generic BERT model for the context of food and beverage. The final goal is a classification task.
To train this model, I'm using a corpus of text gathered from blog posts, articles, magazines etc... that cover the topic.
I am however facing a predicament that I don't know how to handle: specifically, there are sometimes words that either contain a typo, or maybe different accents, but that are semantically the same.
Let me give you an example to briefly illustrate what I mean:
The wine Gewürztraminer
is correctly written with the ü
, however sometimes you also find it written with just a normal u
, or some other times even just Gewurtz
. There are several situations like this one.
Now, a human being would obviously know that we're talking exactly about the same thing, but I have absolutely no idea about how BERT would handle these situations. Would it understand that they're the same thing? Would it consider them instead to be completely different words?
I am currently in the process of cleaning my training data, fixing the typos and trying to even out all these inconsistencies, but at this point I'm not even sure if I should do that at all, considering that the text that will need to be classified can potentially contain typos and situations like the one described above.
What would you guys suggest?