5

I am looking for a regex to extract the word that ONLY contain alphanumeic characters:

string = 'This is a $dollar sign !!'
matches = re.findall(regex, string)
matches = ['This', 'is', 'sign']

This can be done by tokenizing the string and evaluate each token individually using the following regex:

^[a-zA-Z0-9]+$

Due to performance issues, I want to able to extract the alphanumeric tokens without tokenizing the whole string. The closest I got to was

regex = \b[a-zA-Z0-9]+\b

, but it still extracts substrings containing alphanumeric characters:

string = 'This is a $dollar sign !!'
matches = re.findall(regex, string)
matches = ['This', 'is', 'dollar', 'sign']

Is there a regex able to pull this off? I've tried different things but can't come up with a solution.

GRoutar
  • 1,311
  • 1
  • 15
  • 38

3 Answers3

5

Instead of word boundaries, lookbehind and lookahead for spaces (or the beginning/end of the string):

(?:^|(?<= ))[a-zA-Z0-9]+(?= |$)

https://regex101.com/r/TZ7q1c/1

Note that "a" is a standalone alphanumeric word, so it's included too.

['This', 'is', 'a', 'sign']
CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • To avoid slow character-by-character forward-tracking, I considered using a possessive match and then `(*SKIP)(*FAIL)`ing when followed by non-spaces, but that requires the regex module, which is slower than the native `re`, and doesn't provide any speed improvement despite taking fewer steps `\s*\K[a-zA-Z0-9]*+(?:\S+(*SKIP)(*FAIL))?` – CertainPerformance Jan 05 '19 at 23:21
  • Thank you for your answer. It is what I am looking for. In that case I suppose using the native re is still worth it in this case? Also, do you think this is still faster than the solution provided by @hegash ? – GRoutar Jan 06 '19 at 16:18
5

There is no need to use regexs for this, python has a built in isalnum string method. See below:

string = 'This is a $dollar sign !!'

matches = [word for word in string.split(' ') if word.isalnum()]
hegash
  • 833
  • 1
  • 7
  • 16
  • Thanks for the heads up, I didn't know. On the other hand, I am looking for the solution with the best performance. I will have it tested but I'm pretty sure regex's are faster than iterating a string and splitting it depending on a condition. – GRoutar Jan 06 '19 at 16:11
0

[Edited thanks to Khabz's comment. I misunderstood the question]

Depending on your intention, you could also "split" instead of "match".

 >>> matches = re.split(r'(?:\s*\S*[\$\!]+\S*\s*|\s+)', string)

 ['This', 'is', 'a', 'sign', '']

And in case you need to remove leading or trailing empty string:

>>> matches = [x for x in re.split(r'(?:\s*\S*[\$\!]+\S*\s*|\s+)', a) if x ]
['This', 'is', 'a', 'sign']

CertainPerformance's respond using look behind and ahead is the most compact. Using split is sometimes advantages when the exclusion is specified, i.e., the regex above describes what needs to be excluded. In this case, however, it is the inclusion of alpha-numeric that is specified, so using split() is not a good idea.

user2468968
  • 286
  • 3
  • 9
  • I think "findall" is equivalent. Despite that, the solution you provided doesn't match the requirements. "dollar" shouldn't be a match since the word contains a non alphanumeric character ("$dollar") – GRoutar Jan 06 '19 at 16:16