149

I've been reading a tab-delimited data file in Windows with Pandas/Python without any problems. The data file contains notes in first three lines and then follows with a header.

df = pd.read_csv(myfile,sep='\t',skiprows=(0,1,2),header=(0))

I'm now trying to read this file with my Mac. (My first time using Python on Mac.) I get the following error.

pandas.parser.CParserError: Error tokenizing data. C error: Expected 1
fields in line 8, saw 39

If set the error_bad_lines argument for read_csv to False, I get the following information, which continues until the end of the last row.

Skipping line 8: expected 1 fields, saw 39
Skipping line 9: expected 1 fields, saw 125
Skipping line 10: expected 1 fields, saw 125
Skipping line 11: expected 1 fields, saw 125
Skipping line 12: expected 1 fields, saw 125
Skipping line 13: expected 1 fields, saw 125
Skipping line 14: expected 1 fields, saw 125
Skipping line 15: expected 1 fields, saw 125
Skipping line 16: expected 1 fields, saw 125
Skipping line 17: expected 1 fields, saw 125
...

Do I need to specify a value for the encoding argument? It seems as though I shouldn't have to because reading the file works fine on Windows.

user3062149
  • 4,173
  • 4
  • 17
  • 26
  • 1
    Are you using the exact same version of pandas on both OSes? Can you provide some sample data that illustrates the problem on Mac? – joris Jan 12 '15 at 08:51
  • unrelated: do you understand the difference between: `(0)` and `(0,)` in Python? Note: `(0)` is `0` and `(0,)` is `0,` -- comma creates a tuple (except an empty one), not parentheses. – jfs Jan 12 '15 at 09:30
  • Have you tried `df = pd.read_table(myfile, skiprows=[0,1,2], header=0)`? – pbreach Jan 12 '15 at 23:12
  • Hi all. Thanks for the suggestions. I produced a temporary solution but may need to revisit this issue and look for a better solution in the future. If and when I do I will look further into your suggestion. My temporary solution was to take the csv file I had (and had previously converted to the problematic tab delimited file using Excel) and save it as a .tsv with Google docs. I used Gdocs only because it was the most convenient doc application available to me at the time. This conversion worked. Pandas was able to correctly read the file, I believe, and move on to the the rest of my code. – user3062149 Jan 20 '15 at 18:32
  • I suspect the issue you are seeing here with your mac is line terminators. Spreadsheets made on a mac can cause all sorts of fun behaviors with various libraries, including the csv_reader lib in python – brad sanders Jun 20 '16 at 21:02
  • @bradsanders I'm not sure what the source was of the original encoding of the file. It could have been on a mac or windows. I think what would be helpful as an answer would be suggestions on quick diagnostics to help determine what about the characters or overall file encoding was causing the problem. – user3062149 Jun 22 '16 at 14:16

3 Answers3

227

The biggest clue is the rows are all being returned on one line. This indicates line terminators are being ignored or are not present.

You can specify the line terminator for csv_reader. If you are on a mac the lines created will end with \rrather than the linux standard \n or better still the suspenders and belt approach of windows with \r\n.

pandas.read_csv(filename, sep='\t', lineterminator='\r')

You could also open all your data using the codecs package. This may increase robustness at the expense of document loading speed.

import codecs

doc = codecs.open('document','rU','UTF-16') #open for reading with "universal" type set

df = pandas.read_csv(doc, sep='\t')
brad sanders
  • 2,439
  • 1
  • 13
  • 6
  • 7
    The adding codecs piece of code helped me. Then I realized there is a paramter in read_csv that does the same. I've added encoding='utf-16' and it fixed the issue for me. – Mikhail Venkov Nov 15 '17 at 21:57
16

Another option would be to add engine='python' to the command pandas.read_csv(filename, sep='\t', engine='python')

user3479780
  • 525
  • 7
  • 18
0

you can use delimter parameter

input_data=pd.read_csv("my_data.txt",header=0,delimiter="\t")
Ali karimi
  • 371
  • 3
  • 10
  • He already did, since the `delimiter` parameter is actually the same as the `sep` one. See [pd documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) – Asriel Jun 28 '23 at 06:21