0

This is a simple code which uses regex to identify a pattern and termcolor to replace the pattern with the highlighted version of the same, basically used for highlighting a required text. The code seems to work fine with almost all the patterns. But while trying to identify 'dots (.)', the code seems to run indefinitly and crash jupyter kernel. It would be really heplfull if someone can help me with this. Thank you in advance.

import re
from termcolor import colored,cprint
text = "This is a sample text......"
pattern = re.compile(r"\.")
patternlist = pattern.findall(text)
# print(patternlist)

replacelist = [colored(i,"black", "on_yellow", attrs=["bold"]) for i in patternlist]
print(replacelist)
patterns = [i for i in zip(patternlist,replacelist)]
print(patterns)

for pattern, replacement in patterns:
    text = re.sub(pattern, replacement, text)
print(text)

The pattern which I used was: pattern = re.compile(r"."). The findall function seems to be working fine as I am getting the result as expected: ['.', '.', '.', '.', '.', '.']. While I am expected to get the highlighted version as: This is a sample text......, I am not getting any result and the jupyter note seems to run indefinitly and crash. I verified the pattern using online regex engine (https://regex101.com/) and seems to be working fine.

Nick
  • 138,499
  • 22
  • 57
  • 95
Nithin Mohan
  • 23
  • 1
  • 5

1 Answers1

1

You are creating too large a replacement. Modifying your code a bit

import re
from termcolor import colored
text = "This is a sample text......"
pattern = re.compile(r"\.")
patternlist = pattern.findall(text)
replacement = colored(".", "black", "on_yellow", attrs=["bold"])
patterns = re.sub(pattern, replacement, text)

print(patterns)

Makes the "." in the sample text highlighted yellow. Elaborating, your patterns is

[('.', '\x1b[1m\x1b[43m\x1b[30m.\x1b[0m'), ('.','\x1b[1m\x1b[43m\x1b[30m.\x1b[0m'), ('.','\x1b[1m\x1b[43m\x1b[30m.\x1b[0m'), ('.','\x1b[1m\x1b[43m\x1b[30m.\x1b[0m'), ('.','\x1b[1m\x1b[43m\x1b[30m.\x1b[0m'), ('.','\x1b[1m\x1b[43m\x1b[30m.\x1b[0m')]

Keeping in mind that your new regex pattern is . which replaces every character, as the characters used to form the colours causes an increase each time you repeat the loop you end up replacing every one of those characters with a new set of characters. You end up having an incredibly long string. At the first iteration, you end up with 27 .s, at the second, you end up with 513. And it multiplies each time.

Edit for the new information:

import re
from termcolor import colored
text = "This is a sample text......"
pattern = re.compile(r"is|sample|\." )
patternlist = set(pattern.findall(text))

for pattern in patternlist:
    replacement = colored(pattern, "black", "on_yellow", attrs=["bold"])
    text = re.sub(re.escape(pattern), replacement, text)

print(text)

Things to note is the use of set, which shouldn't affect the output just the runtime and principle of not having multiple duplicates, and using re.escape to make sure any escapable items in the "pattern" that you get from re.findall get escaped properly. Keep in mind that since you used re.findall the "is" in "This" also gets matched and thus highlighted.

Shorn
  • 718
  • 2
  • 13
  • Thank you for the reply. I understood your point. But the problem is I need the for loop when I want to replace multiple items e.g in the above case if I want 'is', 'sample' and '.', I need to use the pattern r"is|sample|\." in a for loop. Then the problem arises. It works with any other item except the '.'. In this case I need 'is', 'sample' and '.'. How to bypass the problem? – Nithin Mohan Mar 08 '23 at 05:19
  • 1
    @NithinMohan I edited the answer to fit using multiple items in the regex – Shorn Mar 08 '23 at 06:01