0

I have a csv file (tab seperated) written in German. I did not create the file. I tried to read that file by using Python's pandas package. I do the following:

import pandas as pd
trn_file ="data/train.csv"
pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None)
# pd_train is [1153 rows x 12 columns]
# the first  couple of rows of pd_train can be seen below:
>>> pd_train
        0                                                  1                                     2    3           4   5   6                                                7                                                8                      9     10    11
0       35  Auch in Großbritannien, wo 19 Atomreaktoren in...                              Ausstieg -1.0  2011-03-13  10  10                                     Sunday Times                                     Sunday Times           Sunday Times   NaN     1
1      117  Deswegen sollte Deutschland nicht für [...] we...                              Ausstieg  1.0  2011-04-11  60  62                                 Dietram Hoffmann                                 Dietram Hoffmann                    NaN   NaN   121

When I investigate the dataframe, I realized that the file does not properly parsed. I mean, I see lines that seems merged even though there is a newline character between them. For example the example below shows a sentence but actually it contains 4 sentences. (They should have been in seperate rows in the dataframe):

>>> pd_train[1][483]
'Wer keine Brücke will, kann auch keine Brückenmaut verlangen. Eine Klage gegen die Kernbrennstoffsteuer schließe ich nicht aus.\tKonsens/Einigkeit\t-1.0\t2011-05-03\t90\t91\tEon\tJohannes Teyssen\tEon\t\t558\n3\tEin solches schicksalhaftes Langzeitprojekt ist für einen kurzsichtigen Profilierungswettstreit der Parteien ungeeignet. Deshalb müssen wir einen Konsens finden, der von einer breiten Mehrheit auf Dauer getragen wird.\tKonsens/Einigkeit\t1.0\t2011-05-10\t50\t55\tAlois Glück\tAlois Glück\tZentralkomitee der Katholiken\t31.0\t576\n1459\tWir brauchen jetzt keine Kommissionen, sondern einen neuen, breiten Konsens, der dann wirklich hält.\tKonsens/Einigkeit\t1.0\t2011-04-12\t30\t30\tClaudia Roth\tClaudia Roth\tGrüne\t34.0\t671\n1745\tDie Parteispitze zeigt sich offen für einen Konsens. Das würde die Richtigkeit des Atomausstiegs und des grünen Kurses besiegeln", sagt Steffi Lemke, politische Geschäftsführerin der Grünen.'

How can I fix this problem?

Please let me know If I need to provide further information.

EDIT I tried @abby's suggestions. When I gave the full path, nothing changed, when I remove the delimeter and encoding parameters, I got the following erros:

pd.read_csv(trn_file,header=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 11 fields in line 14, saw 12
zwlayer
  • 1,752
  • 1
  • 18
  • 41
  • 2 things, 1) try giving the full path of the file, 2) try removing the delimiter and encoding argument – Abby Aug 15 '18 at 13:21
  • @Abby I am going to try both now. However I wonder your intiution for the first option? – zwlayer Aug 15 '18 at 13:22
  • @Abby I edited my question after I tried your suggestions. They did not work unfortunatelly – zwlayer Aug 15 '18 at 13:28
  • What version of python are you using? You can pass `error_bad_lines=False` to skip those lines rather than error on them. – sundance Aug 15 '18 at 13:31
  • my python version is 3.6.5 . Thank you for your suggestion however, when I use delimiter `delimiter='\t'` I don't get any error. Besides, I have very limited data I prefer to fix these lines (somehow) rather than ignoring them – zwlayer Aug 15 '18 at 13:34
  • what is wondering me is why the script expects 11 fields and if the first lines already have 12 – nicksheen Aug 15 '18 at 13:35
  • @nicksheen it expects 11 fields if I not set delimiter='\t' . Do you think it shouldn't do that even if I not set delimiter ? – zwlayer Aug 15 '18 at 13:39
  • Can yor try reading the csv with the `quoting = csv.QUOTE_NONE` parameter. – Stef Aug 15 '18 at 13:41
  • @Stef wauw, it seems you solved my problem. If you don't mind could you explain it now ? :)_ – zwlayer Aug 15 '18 at 13:44

2 Answers2

0

The problem is that some text entries contain quoting characters. They mask the delimiters and line feeds. By specifiying quoting = csv.QUOTE_NONE you can switch off this special treatment of quoting chars. So use

pd_train = pd.read_csv(trn_file,delimiter='\t',encoding='utf-8',header=None,quoting = csv.QUOTE_NONE)

to read files with occasional quoting characters. See https://docs.python.org/3/library/csv.html:

csv.QUOTE_NONE

Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character. If escapechar is not set, the writer will raise Error if any characters that require escaping are encountered.

Instructs reader to perform no special processing of quote characters.

Stef
  • 28,728
  • 2
  • 24
  • 52
0
pd.read_csv(”train.csv", quotechar='"',skipinitialspace=True)
quotechar=‘”’ --  Any commas between these characters shouldn’t be treated as new columns.

skipinitialspace=True --  Skip spaces after delimiter.
Bob Dalgleish
  • 8,167
  • 4
  • 32
  • 42
snowy13
  • 1
  • 1