I am trying to read some information into Pandas DataFrame and facing the problem due to the value of the data.
Specs of PC:
RAM 32 GB
IntelCore i7 4GHz
Setup:
Data is in MySQL DB, 9 columns (7 int, 1 date, 1 DateTime). DB is on the local machine, so no internet bandwidth issues. 22 mil. rows of data.
Tried to read directly from MySQL server - it never ends.
engine = sqlalchemy.create_engine('mysql+pymysql://root:@localhost:3306/database')
search_df = pd.read_sql_table('search', engine)
I checked with SO and got the impression that instead of using the connector, better to parse CSV. I exported table to CSV.
CSV file size - 1.5GB
My code
dtype = {
'search_id' : int,
'job_count_total' : int,
'job_count_done' : int,
'city_id_start' : int,
'city_id_end' : int,
'date_start' : str,
'datetime_create' : str,
'agent_id' : int,
'ride_segment_found_cnt' : int
}
search_df = pd.read_csv('search.csv', sep=',', dtype=dtype)
I tried both engines, c
and python
, different chunk sizes, low_memory
as True
and False
, specified dtypes and not, but still getting MemoryError
.
I tried everything, mentioned in the question above (which was marked as of origin, mine as duplicate), but nothing changes.
I spotted only two difference:
If I parsing without chunks that I get Memory Error on parsing.
When I am parsing in chunks - on concatenation into one DF.
Also, chunking by 5_000_000 rows gives an error on parsing, less - on concatenation.
Here is an error message on concatenation:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory