1

I have a large dataframe in pandas in which I'm iterating through a single column (containing cells of strings) for data cleaning. The data is super noisy and contains tons of HTML characters and C++ esque unicode stuff (ex. 'some text here\u00a0 maybe some other text' or '\u2013').

I've filtered out the HTML, but the unicode remains, and I really want to get rid of it to leave the most readable text possible. My current idea is to turn the variable in which I store the string entirely into a unicode (e.g u'\u00a0')format and then convert it back into a string for reassignment to the cell to somehow eliminate all of these codes. However, I've been looking all day for something to do the conversions with, and I can't find anything that works for me. What's an easy way to eliminate these substrings?

I've tried:

u'some string' --> doesn't work because I'm using a variable and not a literal

string.encode('utf-8')

string.decode('utf-8')

Here's the current code I'm operating on:

''' #Importing things

file_name = 'myfile.json'
df = pd.read_json(file_name)

for x in range(0,len(df['col'])):
    note = df.iloc[x]['col']
#BEGIN FILTERING OUT HTML
    pos1 = note.find('<')
    pos2 = note.find('>', pos1)
    while pos1 != -1 and pos2 != -1 :
        if '<' in note and note.find('>', pos1):
            note = note.replace(note[pos1:pos2+1], '')
            pos1 = note.find('<')
            pos2 = note.find('>', pos1)
    note = ' '.join(re.findall(r"[\w%-.']+", note))

#SOMETHING TO REMOVE UNICODE HERE

    df.at[x, 'col'] = note

#Continues on to save file
df.to_json('newfile.json', orient = 'records')

'''

Bonnie
  • 37
  • 1
  • 5
  • Let me know if that doesn't actually answer your question. – piRSquared Jun 03 '19 at 21:16
  • Unfortunately, I don't think it does. I have punctuation that I need to keep inside my text, and I've already found ways to filter out most of the annoying characters. The issue is that the text in these columns converted direct HTML code into a string. The HTML is gone now, but I have things like '\u00a0', '\u2013' etc that's spread all across the various strings without clear delimiters for replacement/string split if need be. I've tried to remove them in a few ways unsuccessfully. – Bonnie Jun 04 '19 at 13:31
  • 1
    You can easily filter those out with the same techniques shown in that dup target. I'm still convinced this is what you need. `chars = '\u2013\u00a0'; transtab = str.maketrans(dict.fromkeys(chars, '')); text = 'My \u2013strange\u00a0 text'; print(text); print(text.translate(transtab))` It isn't necessarily about punctuation but about how to filter out stuff. – piRSquared Jun 04 '19 at 13:53
  • 1
    If you need to retain certain punctuation, try the first solution in that dupe: `my_desired_punctuation = r'.!?'df['text'].str.replace(r'[^\w\s{}]+'.format(my_desired_punctuation), '')` – cs95 Jun 04 '19 at 13:56

0 Answers0