6

I am new to NER and Spacy. Trying to figure out what, if any, text cleaning needs to be done. Seems like some examples I've found trim the leading and trailing whitespace and then muck with the start/stop indexes. I saw one example where the guy did a bunch of cleaning and his accuracy was really bad because all the indexes were messed up.

Just to clarify, the dataset was annotated with DataTurks, so you get json like this:

        "Content": <original text>
        "label": [
            "Skills"
        ],
        "points": [
            {
                "start": 1295,
                "end": 1621,
                "text": "\n• Programming language...

So by "mucking with the indexes", I mean, if you strip off the leading \n, you need to update the start index, so it's still aligned properly.

So that's really the question, if I start removing characters from the beginning, end or middle, I need to apply the rule to the content attribute and adjust start/end indexes to match, no? I'm guessing an obvious "yes" :), so I was wondering how much cleaning needs to be done.

So you would remove the \ns, bullets, leading / trailing whitespace, but leave standard punctuation like commas, periods, etc?

What about stuff like lowercasing, stop words, lemmatizing, etc?

One concern I'm seeing with a few samples I've looked at, is the start/stop indexes do get thrown off by the cleaning they do because you kind of need to update EVERY annotation as you remove characters to keep them in sync.

I.e.

A 0 -> 100
B 101 -> 150

if I remove a char at position 50, then I need to adjust B to 100 -> 149.

Tamil Selvan
  • 1,600
  • 1
  • 9
  • 25
SledgeHammer
  • 7,338
  • 6
  • 41
  • 86
  • The "indexes" you're referring to seem to be NER labels. It is true that if your NER labels are wrong they are not usable, so you have to keep them correct if you do preprocessing. It may be helpful to pretend you are giving instructions to a person and asking them to identify things - "please identify skills" "what's a skill?" *points to empty air* - you can see how that would be unhelpful. – polm23 Dec 28 '21 at 05:15

1 Answers1

5

First, spaCy does no transformation of the input - it takes it literally as-is and preserves the format. So you don't lose any information when you provide text to spaCy.

That said, input to spaCy with the pretrained pipelines will work best if it is in natural sentences with no weird punctuation, like a newspaper article, because that's what spaCy's training data looks like.

To that end, you should remove meaningless white space (like newlines, leading and trailing spaces) or formatting characters (maybe a line of ----?), but that's about all the cleanup you have to do. The spaCy training data won't have bullets, so they might get some weird results, but I would leave them in to start. (Also, bullets are obviously printable characters - maybe you mean non-ASCII?)

I have no idea what you mean by "muck with the indexes", but for some older NLP methods it was common to do more extensive preprocessing, like removing stop words and lowercasing everything. Doing that will make things worse with spaCy because it uses the information you are removing for clues, just like a human reader would.

Note that you can train your own models, in which case they'll learn about the kind of text you show them. In that case you can get rid of preprocessing entirely, though for actually meaningless things like newlines / leading and following spaces you might as well remove them anyway.


To address your new info briefly...

Yes, character indexes for NER labels must be updated if you do preprocessing. If they aren't updated they aren't usable.

It looks like you're trying to extract "skills" from a resume. That has many bullet point lists. The spaCy training data is newspaper articles, which don't contain any lists like that, so it's hard to say what the right thing to do is. I don't think the bullets matter much, but you can try removing or not removing them.

What about stuff like lowercasing, stop words, lemmatizing, etc?

I already addressed this, but do not do this. This was historically common practice for NLP models, but for modern neural models, including spaCy, it is actively unhelpful.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • Hi, I updated the question with some clarifications you asked about. – SledgeHammer Dec 28 '21 at 05:04
  • If you change the question that significantly you should just open a new question instead... – polm23 Dec 28 '21 at 05:12
  • Yup :), that's the learning project. Pulling entities out of a resume. More then just skills, but wanted to keep it brief. – SledgeHammer Dec 28 '21 at 05:48
  • Hi, what you say sounds logical and valid. I have already found some reliable examples on the internet where spacy NER models were trained with text data in the format you describe (like text frome the news paper). However, I just can't find a clear statement on this topic in spacy's documentation and it's driving me crazy and paranoid. Can you possibly give us your source? I have to label a lot of data for my own NER model, which is a huge effort and not the most fun work. An official statement from spacy/explosion on this issue would reassure me greatly. – EustassX Apr 20 '22 at 08:10
  • 1
    Source is clearly stated on the models page. https://spacy.io/models/en/ – polm23 Apr 20 '22 at 08:32