0

I need to create a regular expression, that would match only the characters NOT in the windows-1251 encoding character set, to detect if there are any characters in a given piece of text that would violate the encoding. I tried to do it through the [^\u0000-\u044F]+ expression, however it is also matching some characters that are actually in line with the encoding.

Appreciate any help on the issue

  • https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1251.txt – Hans Passant May 19 '22 at 12:16
  • There is no way to fit 1102 code points into an 8-bit encoding; it accommodates a maximum of 256 code points (but actually a few less, as some of the range is undefined). – tripleee May 19 '22 at 12:46
  • Why don't you switch to `utf-8` - then you will never have that problem. – Poul Bak May 19 '22 at 13:26
  • @PoulBak The encoding is not up to me, unfortunately. I just need to check that an XML document is in line with it, and when it's not, I need to find the violating characters. The only way I can do it is through regex, it's a program's limitation – alexherranz May 21 '22 at 10:46
  • What tool/programming language do you use? There are many flavors of Regex. May be we could even provide an easier solution. Also note that using Windows-1251 is not supported in for instance .net core. – Poul Bak May 21 '22 at 11:06

1 Answers1

1

No language specified, but in Python no need for a regex with sets. Create a set of all Unicode code points that are members of Windows-1251 and subtract it from the set of the text. Note that only byte 98h is not used in Windows-1251 encoding:

>>> # Create the set of characters in code page 1251
>>> cp1251 = set(bytes(range(256)).decode('cp1251',errors='ignore'))
>>> set('This is a test \x98 马') - cp1251
{'\x98', '马'}

As a regular expression:

>>> import re
>>> text = ''.join(cp1251) # string of all Windows-1251 codepoints from previous set
>>> text
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\xa0¤¦§©«¬\xad®°±µ¶·»ЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяёђѓєѕіїјљњћќўџҐґ–—‘’‚“”„†‡•…‰‹›€№™'
>>> not_cp1251 = re.compile(r'[^\x00-\x7f\xa0\xa4\xa6\xa7\xa9\xab-\xae\xb0\xb1\xb5-\xb7\xbb\u0401-\u040c\u040e-\u044f\u0451-\u045c\u045e\u045f\u0490\u0491\u2013\u2014\u2018-\u201a\u201c-\u201e\u2020-\u2022\u2026\u2030\u2039\u203a\u20ac\u2116\u2122]')
>>> not_cp1251.findall(text) # all cp1251 text finds no outliers
[]
>>> not_cp1251.findall(text+'\x98') # adding known outlier
['\x98']
>>> not_cp1251.findall('马克'+text+'\x98') # adding other outliers
['马', '克', '\x98']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251