Efficient Sampling

Question

I'm a beginner and need some guidance on what probably is a very basic problem, yet unsolvable to me :

I'm working on a Kaggle dataset with over 10M rows and would like to sample it to go into proper EDA. I've seen a couple people putting simply an nrows argument to the .read_csv method, but wouldn't it be inefficient sampling to cut it at an arbitrary point, and therefore bias any results ?

The method .sample uses a simple randomizer and I feel like it wouldn't capture the different proportions of categories. What would be a better sampling option ?

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html — ALollz, Aug 02 '19 at 14:36

score 0 · Answer 1 · answered Aug 02 '19 at 14:33

0

If this is supervised learning (i.e you have data label ) you can use

train_X, test_X, train_Y, test_Y = train_test_split(data, label, test_size = 0.2, random_state = 138,shuffle=True,stratify=label)

stratify will allow you to keep same proportion of each class in the final data set

answered Aug 02 '19 at 14:33

akhetos

686
1
10
31

I thought about *train_test_split* but I'm building a recommendation engine where I need to recommend which products would be the best fit to each customer, so I don't have a label per say.. But if there's a way around it, do let me know :) Thank you! – Javi St Aug 02 '19 at 15:30

Efficient Sampling

1 Answers1