I have been trying to use pandas to analyze some genomics data. When reading a csv, I get the CParserError: Error tokenizing data. C error: out of memory
error, and I have narrowed down to the particular line that causes it, which is 43452. As shown below, the error doesn't happen until the parser goes beyond Line 43452.
I have also pasted the relevant lines from less
output with the long sequences truncated, and the second column (seq_len) shows the length of that sequences. As you could see, some of the sequences are fairly long with a few millions of characters (i.e. bases in genomics). I wonder if the error is a result of too big a value in the csv. Does pandas post a limit to the length of a value at a cell? If so, how big is it?
BTW, the data.csv.gz
is about 9G in size if decompressed with less than 2 million lines. My system has over 100G memory, so I think physical memory is unlikely to be the cause.
Successful read at Line 43451
In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
compression='gzip', header=None,
names=['accession', 'seq_len', 'tax_id', 'seq'],
nrows=43451)
Failed read at Line 43452
In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
compression='gzip', header=None,
names=['accession', 'seq_len', 'tax_id', 'seq'],
nrows=43452)
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-1-036af96287f7> in <module>()
----> 1 import pandas as pd; df = pd.read_csv('filtered_gb_concatenated.csv.gz', compression='gzip', header=None, names=['accession', 'seq_len', 'tax_id', 'seq'], nrows=43452)
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472 skip_blank_lines=skip_blank_lines)
473
--> 474 return _read(filepath_or_buffer, kwds)
475
476 parser_f.__name__ = name
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
254 " together yet.")
255 elif nrows is not None:
--> 256 return parser.read(nrows)
257 elif chunksize or iterator:
Successful258 return parser
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
719 raise ValueError('skip_footer not supported for iteration')
720
--> 721 ret = self._engine.read(nrows)
722
723 if self.options.get('as_recarray'):
/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
1168
1169 try:
-> 1170 data = self._reader.read(nrows)
1171 except StopIteration:
1172 if nrows is None:
pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()
pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7952)()
pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8401)()
pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8275)()
pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:20691)()
CParserError: Error tokenizing data. C error: out of memory
Line 43450-43455 of less -N -S
output with long seq truncated. The first column is line number, after which are csv content separated by commas. The column names are ['accession', 'seq_len', 'tax_id', 'seq']
43450 FP929055.1,3341681,657313,AAAGAACCTTGATAACTGAACAATAGACAACAACAACCCTTGAAAATTTCTTTAAGAGAA....
43451 FP929058.1,3096657,657310,TTCGCGTGGCGACGTCCTACTCTCACAAAGGGAAACCCTTCACTACAATCGGCGCTAAGA....
43452 FP929059.1,2836123,717961,GTTCCTCATCGTTTTTTAAGCTCTTCTCCGTACCCTCGACTGCCTTCTTTCTCACTGTTC....
43453 FP929060.1,3108859,245012,GGGGTATTCATACATACCCTCAAAACCACACATTGAAACTTCCGTTCTTCCTTCTTCCTC....
43454 FP929061.1,3114788,649756,TAACAACAACAGCAACGGTGTAGCTGATGAAGGAGACATATTTGGATGATGAATACTTAA....
43455 FP929063.1,34221,29290,CCTGTCTATGGGATTTGGCAGCGCAATGCAGGAAAACTACGTCCTAAGTGTGGAGATCGATGC....