Performing exploratory data analysis is the first step in any machine learning project, I mostly use pandas to perform data exploration using datasets that fit in memory... but I would like to know how to perform data cleaning, handle missing data and data outlier, single variable plots, density plot of how a feature impacts label, correlation, etc, etc
Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.
For datasets that are greater than size of RAM... 100s of gigabytes
I have seen tutorials where they use spark to filter out based on rules and generate a dataframe that fits in memory... eventually there is always data that resides entirely in memory but i want to know how to work with big data set and perform exploratory data analysis
Another challenge would be to visualize big data for exploratory data analysis... its easy to do using packages like seaborn or matplotlib if it fits in memory but how to perform it for big data