9

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully someone can help me out.

Currently I'm extracting all alphabetical sequences with '[a-z]+'. This is an okay approximation, but it drags a lot of rubbish out with it.

Ideally I would like some regex (doesn't have to be pretty or efficient) that extracts all alphabetical sequences delimited by natural word separators (such as [/-_,.: ] etc.), and ignores any alphabetical sequences with illegal bounds.

However I'd also be happy to just be able to get all alphabetical sequences that ARE NOT adjacent to a number. So for instance 'pie21' would NOT extract 'pie', but 'http://foo.com' would extract ['http', 'foo', 'com'].

I tried lookahead and lookbehind assertions, but they were applied per-character (so for example re.findall('(?<!\d)[a-z]+(?!\d)', 'pie21') would return 'pi' when I want it to return nothing). I tried wrapping the alpha part as a term ((?:[a-z]+)) but it didn't help.

More detail: The data is an email database, so it's mostly plain English with normal numbers, but occasionally there's rubbish strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA and AC7A21C0 that I'd like to ignore completely. I'm assuming any alphabetical sequence with a number in it is rubbish.

Templar
  • 1,843
  • 7
  • 29
  • 42
orlade
  • 2,060
  • 4
  • 24
  • 35
  • Better use raw strings with regexes. `\d` happens to work, but other escape sequences will fail, and this can be hard to debug. – Tim Pietzcker Apr 19 '11 at 14:30

4 Answers4

18

If you restrict yourself to ASCII letters, then use (with the re.I option set)

\b[a-z]+\b

\b is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b matches pie, but not pie21 or 21pie.

To also allow other non-ASCII letters, you can use something like this:

\b[^\W\d_]+\b

which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the \w shorthand to match non-ASCII letters.

[^\W\d_] as a negated character class allows any alphanumeric character except for digits and underscore.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • This sounds exactly like what I want, but I can't get the bally `\b`s to work. With `text` set as some normal sentence, `re.findall('\b[a-z]+\b', text, re.I)` returns nothing. No matter what I put in the square brackets (or using `search` or `match`) it doesn't seem to help either. Using `\B` gets me some results, but strips off the first and last character of each word. As lazy as it sounds I'm far too tired to pick up a new concept right now; any chance you know why it's not working? Or that you can post a literal example of how you'd use it in this case? – orlade Apr 19 '11 at 15:09
  • 5
    That's *exactly* why I wrote my comment to your question. If you don't use raw strings (`r"\b[a-z]\b"`), the `\b` will be interpreted as a backspace character. – Tim Pietzcker Apr 19 '11 at 17:35
  • Ooooooooooooh, that's what you meant :). Sorry, it's now 5:30am here and I was never going to make that connection. Simply add the r and it works a treat! Thank you, sir. – orlade Apr 19 '11 at 19:30
  • In general this works, but it will fail on words with special characters (e.g. `wenn bei Beförderungen Schäden`) – yekta Jun 25 '17 at 09:41
  • @yekta: Not if you compile the regex using the `re.UNICODE` or `re.LOCALE` option. I should add that to my answer. – Tim Pietzcker Jun 25 '17 at 11:44
3

Are you familiar with word boundaries? (\b). You can extract word's using the \b around the sequence and matching the alphabet within:

\b([a-zA-Z]+)\b

For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.

You can the \b sequence, and others, over at the python manual

EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:

(?!\d)   # negative look-ahead for numbers
(?<!\d)  # negative look-behind for numbers
Brad Christie
  • 100,477
  • 16
  • 156
  • 200
  • As per Tim's answer, `\b` sounds like what I want but it's not playing nice. Any ideas? I tried the lookahead and lookbehinds before but they seem to match all the characters right up until the character that is adjacent to a number, and so don't completely ignore words with numbers in them. Also it complains about lookaheads needing fixed-width patterns with those +s in there. – orlade Apr 19 '11 at 15:13
  • @Pie21: Then just use a single-digit match. We don't care how many numbers post/precede it, just that there's a digit. [example](http://re.dabase.com/webre.py?input=pie21+21pie+21pie21+pie&regex=\b%28%3F%3C!\d%29%28[a-zA-Z]%2B%29%28%3F!\d%29\b) – Brad Christie Apr 19 '11 at 15:21
  • I got this working [ re.findall(r"\b([a-zA-Z]+)\b",content, re.I) ] but it doesn't seem to weed out forward and back-slashes. Here are some words that came out: '[endif]', '$', '8', '/small', '/li' – Bill Jul 09 '15 at 05:46
2

What about:

import re
yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA  pie42"
filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])

Note that:

  • split explodes your string into potential candidates => returns a list of "potential words"
  • set makes unicity filtering => transforms the list in set, thus removing entries appearing more than once. This step is not mandatory.
  • filter reduces the number of candidates : takes a list, applies a test function to each element, and returns a list of the element succeeding the test. In our case, the test function is "anonymous"
  • lambda : anonymous function, taking an item and checking if it's a word (upper or lower letters only)

EDIT : added some explanations

Bruce
  • 7,094
  • 1
  • 25
  • 42
  • Ugly as it is, it does work! Cheers! However can I ask one more favour: since I don't speak lambda OR filter, is there a way to do that kind of thing with `re.finditer()`? I need to keep track of the start and end indexes of each match in the text as well. – orlade Apr 19 '11 at 15:04
0

Sample code

print re.search(ur'(?u)ривет\b', ur'Привет')
print re.search(ur'(?u)\bривет\b', ur'Привет')

or

s = ur"abcd ААБВ"
import re
rx1 = re.compile(ur"(?u)АБВ")
rx2 = re.compile(ur"(?u)АБВ\b")
rx3 = re.compile(ur"(?u)\bАБВ\b")
print rx1.findall(s)
print rx2.findall(s)
print rx3.findall(s)
Alexander Lubyagin
  • 1,346
  • 1
  • 16
  • 30