Regex / "token_pattern" for scikit-learn text Vectorizer

Question

I'm using sklearn to do some NLP vectorizing with a tf-idf Vectorizer object. This object can be constructed with a keyword, "token_pattern".

I want to avoid hashtags (#foobar), numerics (and strings that begin with a number(s), i.e. 10mg), any line that begin with 'RT' (retweet), or the line "Deleted tweet".

In addition, I want to ignore unicode.

I want to keep, the URL's (not the 'http://') and have them tokenized into any words ( [A-Za-z]+ only ) that may exist in them.

I have some experience with Regex, but have not needed more complex patterns until now.

Below is my stab for everything...it's obviously not the best way to investigate, but it does sum up how I currently think about the Regex rules.

NOTE: the skearn doc here shows the default "token_pattern" using the unicode flag on the string and I don't understand why...separate question perhaps.

pat2 = r"(?im)([A-Z]+)(?<!^@)([A-Z]+)(?<!^#)([A-Z]+)(?<!^(RT))([A-Z]+)(?<!^Deleted)(?<=^(http://))([A-Z]+)"

My break down:

(?im)  #Are flags for 'multi-line' and 'case insensitive'

([A-Z]+)(?<!^@) #A negative look back, match [A-Z]+ only if not preceded by 'starts with @'.

(?<=^(http://))([A-Z]+) #A positive look forward, match [A-Z]+ only if 'starts with "http://"' is present.

I get the feeling this is not an elegant solution even if it is tweaked into working...

TIA

UPDATE: RAW DATA EXAMPLE:

If it helps to know, I'm using a pandas data frame to load the data. I'm new to pandas and maybe missing some pandas based solution.

From this raw data, I'd want only words taken from the text and the URL's. This example sucks...please comment further to help me get it better defined... thx!

raw:

http://foxsportswisconsin.ning.com/profiles/blogs/simvastatin-20-mg-pas-cher-sur-internet-acheter-du-simvastatin-20

tokenized:

[foxsportswisconsin, ning, com, profiles, blogs, simvastatin, mg, pas, cher, sur, internet, acheter, du, simvastatin]

Can you show us the parsing you want from a tweet? Example tweet and example tokens. This is 100% not the way to go. — Slater Victoroff, Jan 24 '15 at 19:34
@SlaterTyranus, I work on that now...my input is a mix from various sources, blogs, tweets, etc. Originally, I made a separate method to loop over the lines, break them into words, and regex them...but that was also messy. — wbg, Jan 24 '15 at 19:44

score 12 · Accepted Answer · answered Jan 24 '15 at 21:15

tl;dr: if you ever write a regex over 20 characters you're doing something wrong, but it might be an acceptable hack. If you write a regex over 50 characters you need to stop immediately.

Let me just start off by saying that this should, in no way, shape, or form be solved by a regex.

Most of the steps that you describe should be handle in pre-processing or post-processing. You shouldn't try to come up with a regex that filters something that starts with Deleted tweet or RT, you should ignore these lines in pre-processing.

Ignore unicode? Then probably worth getting off the internet since literally everything on the internet, and everything outside of notepad is unicode. If you want to remove all unicode characters that can't be represented in ascii (which is what I assume you meant?), then the encoding step is the place to fix this:

<string>.encode('ascii', 'ignore')

As far as ignoring http goes, you should just set http as a stopword. This can be passed in as another argument to the vectorizer you're using.

Once that's done, the token regex you use (probably still not a case for regex, but that is the interface that sklearn offers), is actually very simple:

'\b[a-zA-Z]\w+\b'

Where the only change to be implemented here is the ignoring of numerics like 10mg mentioned above.

It's worth noting that this heavy level of token removal is going to negatively affect pretty much any analysis you're trying to do. If you have a decent-sized corpus you shouldn't remove any tokens, if it's small removing stop words and using a stemmer or a lemmatizer is a good way to go, but this kind of token removal is poor practice and will lead to overfitting.

This is a great answer, because it treats the real latent question I had, which is, "What is the best approach to tokenizing in NLP?". I have used preprocessing and stop word additions and all that ...I just thought maybe I should use Regex in instead...now I know, "No" you don't use use Regex, and I should preserve the corpus and make better use of the transformer. Or, just get better data. Cheers ! — wbg, Jan 25 '15 at 00:30

Regex / "token_pattern" for scikit-learn text Vectorizer

1 Answers1