Removing python pandas data parse error while reading .docx file

Question

In the sample data frame

YYYYMM q1 q2 q3 q4 q5 q6 q7 q8 q9 q0 d1 d2 d3 d4 d5
197501  2 11 12 26 25 10 29 21 30 22  8  7 14  4 13
197502 27 22  8 20  6 26 21  4 19  9 10  1 11 12 23
197503  8  7 21 22 25  9  4 30  2 19 10 11 28 12 27
197504 29 28 27 17 19  2 30 16 18  3  9 10 11  8 13
197505 11 15 12 31 28 24  1 30 13 18  5  6 16  7 20
197506 24 10 27  8 23 28 25 26  9 22  2 12 29 30  1

After reading it

df1=pd.read_csv("Qdays_Ddays.docx",low_memory=False) #error_bad_lines=False)

Getting an error

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2

Please help to rectify it.

https://stackoverflow.com/questions/53256091/how-can-i-fix-error-tokenizing-data-on-pandas-csv-reader — ddejohn, Mar 11 '22 at 05:32
```UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 1: invalid start byte``` — Prater, Mar 11 '22 at 05:53
Microsoft Word files are not plain text files. Save your data as a plain text file. — tdy, Mar 11 '22 at 05:57

score 0 · Answer 1 · answered Mar 11 '22 at 13:32

You can't read docx with pandas, however you can read it with python-docx:

import docx
import pandas as pd
 
# open connection to Word Document
doc = docx.Document("test.docx")
 
# read in each paragraph in file
result = [p.text for p in doc.paragraphs]
print(result)

#Then you can convert it to Dataframe
df = pd.DataFrame(result)
#You can specify the return orientation.
df.to_dict('series')
#or 
df.to_dict('split')
#or
df.to_dict('records')
#or
df.to_dict('index')

Removing python pandas data parse error while reading .docx file

1 Answers1