I'm a beginner and need some guidance on what probably is a very basic problem, yet unsolvable to me :
I'm working on a Kaggle dataset with over 10M rows and would like to sample it to go into proper EDA. I've seen a couple people putting simply an nrows argument to the .read_csv method, but wouldn't it be inefficient sampling to cut it at an arbitrary point, and therefore bias any results ?
The method .sample uses a simple randomizer and I feel like it wouldn't capture the different proportions of categories. What would be a better sampling option ?