I'm using sklearn to do some NLP vectorizing with a tf-idf Vectorizer object. This object can be constructed with a keyword, "token_pattern".
I want to avoid hashtags (#foobar), numerics (and strings that begin with a number(s), i.e. 10mg), any line that begin with 'RT' (retweet), or the line "Deleted tweet".
In addition, I want to ignore unicode.
I want to keep, the URL's (not the 'http://') and have them tokenized into any words ( [A-Za-z]+ only ) that may exist in them.
I have some experience with Regex, but have not needed more complex patterns until now.
Below is my stab for everything...it's obviously not the best way to investigate, but it does sum up how I currently think about the Regex rules.
NOTE: the skearn doc here shows the default "token_pattern" using the unicode flag on the string and I don't understand why...separate question perhaps.
pat2 = r"(?im)([A-Z]+)(?<!^@)([A-Z]+)(?<!^#)([A-Z]+)(?<!^(RT))([A-Z]+)(?<!^Deleted)(?<=^(http://))([A-Z]+)"
My break down:
(?im) #Are flags for 'multi-line' and 'case insensitive'
([A-Z]+)(?<!^@) #A negative look back, match [A-Z]+ only if not preceded by 'starts with @'.
(?<=^(http://))([A-Z]+) #A positive look forward, match [A-Z]+ only if 'starts with "http://"' is present.
I get the feeling this is not an elegant solution even if it is tweaked into working...
TIA
UPDATE: RAW DATA EXAMPLE:
If it helps to know, I'm using a pandas data frame to load the data. I'm new to pandas and maybe missing some pandas based solution.
From this raw data, I'd want only words taken from the text and the URL's. This example sucks...please comment further to help me get it better defined... thx!
raw:
http://foxsportswisconsin.ning.com/profiles/blogs/simvastatin-20-mg-pas-cher-sur-internet-acheter-du-simvastatin-20
tokenized:
[foxsportswisconsin, ning, com, profiles, blogs, simvastatin, mg, pas, cher, sur, internet, acheter, du, simvastatin]