0

I am trying to read some information into Pandas DataFrame and facing the problem due to the value of the data.

Specs of PC:

RAM 32 GB

IntelCore i7 4GHz

Setup:

Data is in MySQL DB, 9 columns (7 int, 1 date, 1 DateTime). DB is on the local machine, so no internet bandwidth issues. 22 mil. rows of data.

Tried to read directly from MySQL server - it never ends.

engine = sqlalchemy.create_engine('mysql+pymysql://root:@localhost:3306/database')

search_df = pd.read_sql_table('search', engine)

I checked with SO and got the impression that instead of using the connector, better to parse CSV. I exported table to CSV.

CSV file size - 1.5GB

My code

dtype = {
    'search_id' : int,
    'job_count_total' : int,
    'job_count_done' : int,
    'city_id_start' : int,
    'city_id_end' : int,
    'date_start' : str,
    'datetime_create' : str,
    'agent_id' : int,
    'ride_segment_found_cnt' : int
}

search_df = pd.read_csv('search.csv', sep=',', dtype=dtype)

I tried both engines, c and python, different chunk sizes, low_memory as True and False, specified dtypes and not, but still getting MemoryError.

I tried everything, mentioned in the question above (which was marked as of origin, mine as duplicate), but nothing changes.

I spotted only two difference:

If I parsing without chunks that I get Memory Error on parsing.

When I am parsing in chunks - on concatenation into one DF.

Also, chunking by 5_000_000 rows gives an error on parsing, less - on concatenation.

Here is an error message on concatenation:

pandas.errors.ParserError: Error tokenizing data. C error: out of memory
Simon Osipov
  • 404
  • 5
  • 18
  • @Vaishali, I would like to disagree with you on the closure of the question. I tried everything that is mentioned in the post you added (different engines, dtypes, low memory and chunks, and nothing helped me). Could you reopen the question? Thank you in advance – Simon Osipov Dec 29 '18 at 14:36
  • I have reopened the question, this is the [dupe](https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options) link. – Vaishali Dec 29 '18 at 15:23

1 Answers1

0

Basically, the problem was with memory. I played a bit with chunk-size + added some filtrations, that I had later in code on the chunk. That allowed me to fit dataframe into memory.

Simon Osipov
  • 404
  • 5
  • 18