0

I have a function that takes in a list of data and then removes any data that matches any of the regexes as defined below:

def clean_data(data):
# Regex for email, punctuation, common words
regex_list = ['[\w\.-]+@[\w\.-]+', '[^\P{P}-]+', '\band\b|\bor\b|\bnot\b|\ba\b|\ban\b|\bis\b|\bthe\b|\bof\b|\blike\b']

for i in data:
    for r in regex_list:
        i = re.sub(r, '', i)
return data

I defined data as the following:

data = ['this is like my name: Bob.', 'my email is bob@gmail.com']

When I run it in console, this is the output I get:

clean_data(data)

Out[74]: ['this is like my name: Bob.', 'my email is bob@gmail.com']

What am I doing wrong?

Community
  • 1
  • 1
tushariyer
  • 906
  • 1
  • 10
  • 20
  • Your `\b` inside `'\b'` is a backspace char, not a word boundary, use `r'pattern_here'`. – Wiktor Stribiżew Feb 26 '18 at 21:15
  • I thought `\b` was a word boundary? So would I format it like this: `''r'and'|r'or'|r'not'|r'a'|r'an'|r'is'|r'the'|r'of'|r'like''`? – tushariyer Feb 26 '18 at 21:17
  • `r'\band\b|\bor\b|\bnot\b|\ba\b|\ban\b|\bis\b|\bthe\b|\bof\b|\blike\b'` – GalAbra Feb 26 '18 at 21:18
  • You're right, `\b` **is** a word boundary but ``\`` needs to be escaped, so use `r''` instead of having to escape every backslash. See [What exactly do “u” and “r” string flags do, and what are raw string literals?](https://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-and-what-are-raw-string-literals) for more info – ctwheels Feb 26 '18 at 21:18
  • @GalAbra My output is no different. It won't even catch the email or punctuation either – tushariyer Feb 26 '18 at 21:19
  • @tushariyer Actually, what is your question here? :) Note that `\P{P}` is not supported by Python `re`. If you plan to match Unicode property classes, use PyPi regex module. – Wiktor Stribiżew Feb 26 '18 at 21:21
  • @tushariyer going off what @WiktorStribizew just mentioned, to gain support for `\P{P}` you can use the [regex](https://pypi.python.org/pypi/regex/) module. – ctwheels Feb 26 '18 at 21:23
  • @WiktorStribiżew my end goal as it were is to take in the list of strings and prune it so that any text that matches any of the regexes is removed. I was not aware that \P{P} was not supported, but the function is not even catching the email – tushariyer Feb 26 '18 at 21:23
  • Ok, 1) install PyPi regex (`pip install regex`), 2) `import regex`, 3) `regex.sub(r'[\w.-]+@[\w.-]+|[^\P{P}-]+|\b(?:and|or|not|an?|is|the|of|like)\b', '', x)` – Wiktor Stribiżew Feb 26 '18 at 21:25

1 Answers1

1

When you do re.sub you're actually creating a new string, not modifying the existing one. So i is a reference to a completely new object. You either insert it back to the list or create a new data list

Here's how you insert it back to the list (I'll stick to the awful convention of calling the string i for demonstration purposes)

def clean_data(data):
    # Regex for email, punctuation, common words
    regex_list = ['[\w\.-]+@[\w\.-]+', '[^\P{P}-]+', '\band\b|\bor\b|\bnot\b|\ba\b|\ban\b|\bis\b|\bthe\b|\bof\b|\blike\b']

    for k, i in enumerate(data):
        for r in regex_list:
            i = re.sub(r, '', i)
            data[k] = i
    return data
Fred
  • 1,462
  • 8
  • 15
  • Right, but I'm assigning it back to the same variable, so doesn't that take care of the insertion? Ah, just saw your edit. How would I re-insert it? – tushariyer Feb 26 '18 at 21:20
  • You should imagine that a variable is just holding a reference to the actual object. When you assign a new object to it, the reference to the old object is lost. Also, check the edit again – Fred Feb 26 '18 at 21:25
  • Another thing to keep in mind, the reference to `data` is passed to this function. When you alter `data` in-place like I'm doing in my answer, you don't have to return `data` back, because I altered the object passed to it – Fred Feb 26 '18 at 21:29
  • when I run this, it removes _everything_. I ran it on the following list: `data = ["this is like an my email bob@gmail.com", "an not like totally or fantastic", "poop"]`. My output was `Out[85]: ['', '', '']`. – tushariyer Feb 26 '18 at 21:29
  • Now that's your regexes fault. I think that's out of this question's context – Fred Feb 26 '18 at 21:31
  • Actually I just read the question again and yeah it's still a valid question. But I'm no good in regex so you'll have to get help elsewhere sorry ;) – Fred Feb 26 '18 at 21:33
  • Right. I'm going to accept your answer because it brought me closer, but if you think of something please let me know! – tushariyer Feb 26 '18 at 21:35