I am new to NER
and Spacy
. Trying to figure out what, if any, text cleaning needs to be done. Seems like some examples I've found trim the leading and trailing whitespace and then muck with the start/stop indexes. I saw one example where the guy did a bunch of cleaning and his accuracy was really bad because all the indexes were messed up.
Just to clarify, the dataset was annotated with DataTurks, so you get json like this:
"Content": <original text>
"label": [
"Skills"
],
"points": [
{
"start": 1295,
"end": 1621,
"text": "\n• Programming language...
So by "mucking with the indexes", I mean, if you strip off the leading \n
, you need to update the start index, so it's still aligned properly.
So that's really the question, if I start removing characters from the beginning, end or middle, I need to apply the rule to the content attribute and adjust start/end indexes to match, no? I'm guessing an obvious "yes" :), so I was wondering how much cleaning needs to be done.
So you would remove the \n
s, bullets, leading / trailing whitespace, but leave standard punctuation like commas, periods, etc?
What about stuff like lowercasing, stop words, lemmatizing, etc?
One concern I'm seeing with a few samples I've looked at, is the start/stop indexes do get thrown off by the cleaning they do because you kind of need to update EVERY annotation as you remove characters to keep them in sync.
I.e.
A 0 -> 100
B 101 -> 150
if I remove a char
at position 50
, then I need to adjust B to 100 -> 149
.