0

I have been trying to use pandas to analyze some genomics data. When reading a csv, I get the CParserError: Error tokenizing data. C error: out of memory error, and I have narrowed down to the particular line that causes it, which is 43452. As shown below, the error doesn't happen until the parser goes beyond Line 43452.

I have also pasted the relevant lines from less output with the long sequences truncated, and the second column (seq_len) shows the length of that sequences. As you could see, some of the sequences are fairly long with a few millions of characters (i.e. bases in genomics). I wonder if the error is a result of too big a value in the csv. Does pandas post a limit to the length of a value at a cell? If so, how big is it?

BTW, the data.csv.gz is about 9G in size if decompressed with less than 2 million lines. My system has over 100G memory, so I think physical memory is unlikely to be the cause.

Successful read at Line 43451

In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
                         compression='gzip', header=None,
                         names=['accession', 'seq_len', 'tax_id', 'seq'],
                         nrows=43451)

Failed read at Line 43452

In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
                         compression='gzip', header=None,
                         names=['accession', 'seq_len', 'tax_id', 'seq'],
                         nrows=43452)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-1-036af96287f7> in <module>()
----> 1 import pandas as pd; df = pd.read_csv('filtered_gb_concatenated.csv.gz', compression='gzip', header=None, names=['accession', 'seq_len', 'tax_id', 'seq'], nrows=43452)

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472                     skip_blank_lines=skip_blank_lines)
    473
    --> 474         return _read(filepath_or_buffer, kwds)
    475
        476     parser_f.__name__ = name

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
254                                   " together yet.")
    255     elif nrows is not None:
    --> 256         return parser.read(nrows)
    257     elif chunksize or iterator:
        Successful258         return parser

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
719                 raise ValueError('skip_footer not supported for iteration')
    720
    --> 721         ret = self._engine.read(nrows)
    722
        723         if self.options.get('as_recarray'):

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
   1168
  1169         try:
  -> 1170             data = self._reader.read(nrows)
     1171         except StopIteration:
    1172             if nrows is None:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7952)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8401)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8275)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:20691)()

CParserError: Error tokenizing data. C error: out of memory

Line 43450-43455 of less -N -S output with long seq truncated. The first column is line number, after which are csv content separated by commas. The column names are ['accession', 'seq_len', 'tax_id', 'seq']

43450 FP929055.1,3341681,657313,AAAGAACCTTGATAACTGAACAATAGACAACAACAACCCTTGAAAATTTCTTTAAGAGAA....
43451 FP929058.1,3096657,657310,TTCGCGTGGCGACGTCCTACTCTCACAAAGGGAAACCCTTCACTACAATCGGCGCTAAGA....
43452 FP929059.1,2836123,717961,GTTCCTCATCGTTTTTTAAGCTCTTCTCCGTACCCTCGACTGCCTTCTTTCTCACTGTTC....
43453 FP929060.1,3108859,245012,GGGGTATTCATACATACCCTCAAAACCACACATTGAAACTTCCGTTCTTCCTTCTTCCTC....
43454 FP929061.1,3114788,649756,TAACAACAACAGCAACGGTGTAGCTGATGAAGGAGACATATTTGGATGATGAATACTTAA....
43455 FP929063.1,34221,29290,CCTGTCTATGGGATTTGGCAGCGCAATGCAGGAAAACTACGTCCTAAGTGTGGAGATCGATGC....
zyxue
  • 7,904
  • 5
  • 48
  • 74

1 Answers1

0

Well, the last line says it all, it doesn't have enough memory to split a chunk of data. I'm not sure how the archive block reading works and how much data it loads into memory, but it's clear that you will have to somehow control the size of the chunks. I found a solution here:

pandas-read-csv-out-of-memory

and here:

out-of-memory-error-when-reading-csv-file-in-chunk

Please try to read the normal file line by line and see if it works.

Community
  • 1
  • 1
asalic
  • 664
  • 3
  • 6
  • The two are essentially the same answer. Reading line by line finishes fines, but that's not what I want, I do want to read it into a DataFrame. – zyxue Sep 25 '15 at 16:13
  • @zyxue I understand, but I suggested line by line to see if it works that way. The error seems to be thrown when it tries to split a line to obtain the fields. My advice is to juggle the following params: **engine**, **nrows**, **chunksize**. Here's the doc: [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) – asalic Sep 28 '15 at 05:54