Regex to Match Horizontal White Spaces

Question

I need a regex in Python2 to match only horizontal white spaces not newlines.

\s matches all whitespaces including newlines.

>>> re.sub(r"\s", "", "line 1.\nline 2\n")
'line1.line2'

\h does not work at all.

>>> re.sub(r"\h", "", "line 1.\nline 2\n")
'line 1.\nline 2\n'

[\t ] works but I am not sure if I am missing other possible white space characters especially in Unicode. Such as \u00A0 (non breaking space) or \u200A (hair space). There are much more white space characters at the following link: https://www.cs.tut.fi/~jkorpela/chars/spaces.html (dead link)

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\xa0\u200a\n'

Do you have any suggestions?

score 14 · Accepted Answer · edited Dec 08 '21 at 00:45

14

I ended up using [^\S\n] instead of specifying all Unicode white spaces.

>>> re.sub(r"[^\S\n]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\n'

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\xa0\u200a\n'

It works as expected.

edited Dec 08 '21 at 00:45

martineau

119,623
25
170
301

answered Sep 07 '17 at 13:26

Memduh

836
8
18

`flags=re.UNICODE` is significant. – Carson Ip Jun 10 '19 at 03:33
1

@CarsonIp for python2, yes. For Py3, not so much – Ciprian Tomoiagă Jan 15 '21 at 14:40

score 1 · Answer 2 · answered Sep 07 '17 at 12:56

If you only want to match actual spaces, try a plain ( )+ (brackets for readability only*). If you want to match spaces and tabs, try [ \t]+ (+ so that you also match a sequence of e.g. 3 space characters.

Now there are in fact other whitespace characters in unicode, that's true. You are, however, highly unlikely to encounter any of those in written code, and also pretty unlikely to encounter any of the less common whitespace chars in other texts.

If you want to, you can include \u00A0 (non-breaking space, fairly common in scientific papers and on some websites. This is the HTML  ), en-space \u2002 (&ensp;), em-space \u2003 (&emsp;) or thin space \u2009 ( ).

You can find a variety of other unicode whitespace characters on Wikipedia, but I highly doubt it's necessary to include them. I'd just stick to space, tab and maybe non-breaking space (i.e. [ \t\u00A0]+).

What do you intend to match with \h, anyway? It's not a valid "symbol" in regex, as far as I know.

*Stackoverflow doesn't display spaces on the edge of inline code

@MemduhÇağrıDemir That's actually a pretty clever solution, it should work as desired. You'll still wanna add the plus sign at the end, though - otherwise a sequence of spaces will be counted as separate matches instead of one single match (`[^\S\n]+`) — PixelMaster, Sep 07 '17 at 13:42

score 0 · Answer 3 · answered Mar 16 '19 at 21:49

As there are fewer vertical white space characters (line terminators) than horizontal ones, it will be shorter to blacklist the first category than to white list the second category. But you still need to list a few more than just \n:

[^\S\n\v\f\r\u2028\u2029]

Regex to Match Horizontal White Spaces

3 Answers3

Linked

Related