How to correctly parse HTML to Unicode strings with pandas?

Question

I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from HTML table using pandas(read_html) and write result to csv file

However, when I write this text to a file,all spaces in it gets written in an unexpected encoding (example \xd0\xb9\xd1\x82\xd0\xb8). to solve the problem I added a line i = i.split(" ") after, all spaces in csv file substitutes for characters, the example below:

['0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '', '', '', '', '', '', '', '', '', '', '', '2', '', '', '3\n0', '', '', '', '', '', '', '', 'number', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'last name', '', 'number', 'plan', 'NaN\n1', '', '', '', '', '', '', '', '', '', 'NaN', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NaN', '', '', 'not', 'NaN\n2', '', '', '', '', '53494580', '', '', '', '', '', '', '', '', '', '+', '(53)494580', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n3', '', '', '', '', '53494581', '', '', '', '', '', '', '', '', '', '+', '(53)494581', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n4', '', '', '', '']

I would like to get rid of character ( '', ) Is there a way to fix this? Any pointers would be much appreciated.

code python:

import pandas as pd
import html5lib

filename="1.csv" 
file=open(filename,"w",encoding='UTF-8', newline='\n'); 
output=csv.writer(file, dialect='excel',delimiter =' ') 

r = requests.get('http://10.45.87.12/og?sh=1&CallerName=&Sys=.79.83.86.51&')
pd.set_option('max_rows',10000) 
df = pd.read_html(r.content)

for i in df:
    i = str(i)
    i = i.strip()
    i = i.encode('UTF-8').decode('UTF-8')
    i = i.split(" ")
    output.writerow(i)
 file.close()

can you try changing `i.encode('UTF-8').decode('UTF-8')` to `i.decode('UTF-8')` or just removing that line all together and posting your results? — Mohammad Ali, Jan 28 '18 at 05:06
could you paste the results of `requests.get('http://10.45.87.12/og?sh=1&CallerName=&Sys=.79.83.86.51&')`? I think `pd.read_html` will only work for html tables. I would recommend BeautifulSoup for most HTML parsing tasks @johnred — briancaffey, Jan 28 '18 at 05:27

score 0 · Accepted Answer · answered Jan 28 '18 at 06:59

You can use the filter method to remove of empty values. you can add the below snippet after 'i = i.split(" ")'

A = ['0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '', '', '', '', '', '', '', '', '', '', '', '2', '', '', '3\n0', '', '', '', '', '', '', '', 'number', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'last name', '', 'number', 'plan', 'NaN\n1', '', '', '', '', '', '', '', '', '', 'NaN', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NaN', '', '', 'not', 'NaN\n2', '', '', '', '', '53494580', '', '', '', '', '', '', '', '', '', '+', '(53)494580', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n3', '', '', '', '', '53494581', '', '', '', '', '', '', '', '', '', '+', '(53)494581', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n4', '', '', '', '']
print filter(None, A)

Output:

['0', '1', '2', '3\n0', 'number', 'last name', 'number', 'plan', 'NaN\n1', 'NaN', 'NaN', 'not', 'NaN\n2', '53494580', '+', '(53)494580', 'NP_551', 'NaN\n3', '53494581', '+', '(53)494581', 'NP_551', 'NaN\n4']

How to correctly parse HTML to Unicode strings with pandas?

1 Answers1