How to read txt file in Pandas: Error tokenizing data

Question

Question: I used pandas.read_csv to read txt files, but there are some errors. The process is shown below:

import pandas as pd
the txt file's path: './Data/fold2_l25431/test.txt'
the example test.txt's content: (The first three lines of the txt, when read in, want to be divided into three columns, with '1', '2' and '3' in one column, 'persona' in one column, and the sentence after the colon in one column)

First line: 1 persona: i am adorkable.
Second line: 2 persona: i am book dumb.
Third line: 3 persona: i am token evil teammate.

code: pd.read_csv('./Data/fold2_l25431/test.txt') or pd.read_csv('./Data/fold2_l25431/test.txt', sep=" ")
ParserError: Error tokenizing data. C error: Expected 8 fields in line 6, saw 9

It also shows 'Error tokenizing data. C error: Expected 1 fields in line 9, saw 16'; '\n' just to indicate that a sentence is followed by a line break, there is no such '\n' in the txt. — zeizeiv9, Apr 26 '22 at 12:00
if you use `sep=" "` then some lines have more elements/columns then other lines - but in `CSV` you should have always the same number of elements/columns in every line. — furas, Apr 26 '22 at 12:08
what result do you expect ? What columns you want to get in dataframe? Maybe you should use `sep=":"` to create two columns - with `1 persona`(and similar) and with rest. — furas, Apr 26 '22 at 12:11
you will have to create own function to read it becasause it can't use different separators and it can't count spaces to split only on two first spaces. — furas, Apr 26 '22 at 12:14
You are right, I used on_bad_lines='skip' due to the difference in the content of each line, but there was a difference in the result, I tried to separate it first with sep="" before merging it and writing it into a function — zeizeiv9, Apr 26 '22 at 12:17

score 0 · Answer 1 · answered Apr 26 '22 at 12:06

0

try this:

 import pandas as pd
 pd.read_csv( 'test.txt',header=None ,on_bad_lines='skip')

answered Apr 26 '22 at 12:06

maryam_k

126
1
7

score 0 · Answer 2 · answered Apr 26 '22 at 12:08

0

I cannot reproduce your error.

General advice:

Only use read_csv for csv files
If you have to use a txt, specify seperator sep=":" and line break lineterminator="\n"
If you think some of the data might be invalid, use on_bad_lines="skip" and check your output

answered Apr 26 '22 at 12:08

Lukas Schmid

1,895
1
6
18

Since some contents of the txt files don't match the rules, I'm processing them now after on_bad_lines="skip". – zeizeiv9 Apr 26 '22 at 12:19

score 0 · Answer 3 · answered Apr 26 '22 at 12:11

The reason for your errors is SPACE (sep = " "). Use something else (like a , or | to separate the fields. The updated table with comma would look like this

1, persona:, i am adorkable.
2, persona:, i am book dumb.
3, persona:, i am token evil teammate.
4, persona:, i am never my fault.
5, persona:, i am honor before reason.
6, persona:, i am jerk with a heart of gold.
7, persona:, i am no social skills.
8, persona:, i am bad liar

.. and this command should be used pd.read_csv('test1.txt', sep = ",", header = None)

The output would be

0   1   2
1   persona:    i am adorkable.
2   persona:    i am book dumb.
3   persona:    i am token evil teammate.
4   persona:    i am never my fault.
5   persona:    i am honor before reason.
6   persona:    i am jerk with a heart of gold.
7   persona:    i am no social skills.
8   persona:    i am bad liar

score 0 · Answer 4 · answered Apr 26 '22 at 12:22

Your file is NOT csv so you may have to write own function to read it and split to columns

I used io only to simulate file in memory - so everyone can copy and test it - but you should use open()

text  = '''1 persona: i am adorkable.
2 persona: i am book dumb.
3 persona: i am token evil teammate.
4 persona: i am never my fault.
5 persona: i am honor before reason.
6 persona: i am jerk with a heart of gold.
7 persona: i am no social skills.
8 persona: i am bad liar'''

import io

#f = open('./Data/fold2_l25431/test.txt')
f = io.StringIO(text)

rows = []

for line in f:
    line = line.strip()               # remove '\n'

    first, rest = line.split(' ', 1)  # split only on first space
    second, third = rest.split(': ')  # split on ": "

    rows.append( [first, second, third] )
    
print(rows)

Result:

[
  ['1', 'persona', 'i am adorkable.'], 
  ['2', 'persona', 'i am book dumb.'], 
  ['3', 'persona', 'i am token evil teammate.'], 
  ['4', 'persona', 'i am never my fault.'], 
  ['5', 'persona', 'i am honor before reason.'], 
  ['6', 'persona', 'i am jerk with a heart of gold.'], 
  ['7', 'persona', 'i am no social skills.'], 
  ['8', 'persona', 'i am bad liar']
]

And later you can convert this list to DataFrame

import pandas as pd

df = pd.DataFrame(rows, columns=['1', '2', '3'])

print(df)

Result:

   1        2                                3
0  1  persona                  i am adorkable.
1  2  persona                  i am book dumb.
2  3  persona        i am token evil teammate.
3  4  persona             i am never my fault.
4  5  persona        i am honor before reason.
5  6  persona  i am jerk with a heart of gold.
6  7  persona           i am no social skills.
7  8  persona                    i am bad liar

How to read txt file in Pandas: Error tokenizing data

4 Answers4