I have a large dataframe in pandas in which I'm iterating through a single column (containing cells of strings) for data cleaning. The data is super noisy and contains tons of HTML characters and C++ esque unicode stuff (ex. 'some text here\u00a0 maybe some other text' or '\u2013').
I've filtered out the HTML, but the unicode remains, and I really want to get rid of it to leave the most readable text possible. My current idea is to turn the variable in which I store the string entirely into a unicode (e.g u'\u00a0')format and then convert it back into a string for reassignment to the cell to somehow eliminate all of these codes. However, I've been looking all day for something to do the conversions with, and I can't find anything that works for me. What's an easy way to eliminate these substrings?
I've tried:
u'some string' --> doesn't work because I'm using a variable and not a literal
string.encode('utf-8')
string.decode('utf-8')
Here's the current code I'm operating on:
''' #Importing things
file_name = 'myfile.json'
df = pd.read_json(file_name)
for x in range(0,len(df['col'])):
note = df.iloc[x]['col']
#BEGIN FILTERING OUT HTML
pos1 = note.find('<')
pos2 = note.find('>', pos1)
while pos1 != -1 and pos2 != -1 :
if '<' in note and note.find('>', pos1):
note = note.replace(note[pos1:pos2+1], '')
pos1 = note.find('<')
pos2 = note.find('>', pos1)
note = ' '.join(re.findall(r"[\w%-.']+", note))
#SOMETHING TO REMOVE UNICODE HERE
df.at[x, 'col'] = note
#Continues on to save file
df.to_json('newfile.json', orient = 'records')
'''