Perform EDA and visualize it if my data can not fit in memory? my dataset size is 200gigs

Question

Performing exploratory data analysis is the first step in any machine learning project, I mostly use pandas to perform data exploration using datasets that fit in memory... but I would like to know how to perform data cleaning, handle missing data and data outlier, single variable plots, density plot of how a feature impacts label, correlation, etc, etc

Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.

For datasets that are greater than size of RAM... 100s of gigabytes

I have seen tutorials where they use spark to filter out based on rules and generate a dataframe that fits in memory... eventually there is always data that resides entirely in memory but i want to know how to work with big data set and perform exploratory data analysis

Another challenge would be to visualize big data for exploratory data analysis... its easy to do using packages like seaborn or matplotlib if it fits in memory but how to perform it for big data

Yes, dask can do these things, and so can spark. So what exactly do you want to do, what have you tried, where is the problem? Have you tried tutorials? — mdurant, Aug 03 '18 at 01:59
Most of my projects at work and tutorials I tried during training didnt have this massive scale of data... once I used 35gigs of data but when i filtered it using spark it fits in memory so i just used the tutorial way of doing things... Now i need to reskill myself but cant find helpful information and things are overwhelming I would be greatful if you can point to some reference code or tutorial and adapt it for my needs Thanks... — AVR, Aug 03 '18 at 12:20
https://github.com/dask/dask-tutorial/blob/master/04_dataframe.ipynb — mdurant, Aug 03 '18 at 13:33
@mdurant can you also point to some tutorial that will let us visualize large datasets that dont fit inmemory and wrangled using Dask — AVR, Aug 07 '18 at 05:04
Generally you'll want to aggregate to something small enough to fit in memory and plot normally, but there are examples such as http://pyviz.org/tutorial/10_Working_with_Large_Datasets.html — mdurant, Aug 07 '18 at 12:43

score 2 · Accepted Answer · answered Aug 09 '18 at 13:15

To put up something concrete:

normally you will want to reduce your data, by aggregation, sampling, etc., to something small enough that a direct visualisation makes sense
some tools exist for directly dealing with bigger-than-memory (Dask) data to create visuals. One good link was this: http://pyviz.org/tutorial/10_Working_with_Large_Datasets.html

Perform EDA and visualize it if my data can not fit in memory? my dataset size is 200gigs

1 Answers1