-1

I'm a beginner and need some guidance on what probably is a very basic problem, yet unsolvable to me :

I'm working on a Kaggle dataset with over 10M rows and would like to sample it to go into proper EDA. I've seen a couple people putting simply an nrows argument to the .read_csv method, but wouldn't it be inefficient sampling to cut it at an arbitrary point, and therefore bias any results ?

The method .sample uses a simple randomizer and I feel like it wouldn't capture the different proportions of categories. What would be a better sampling option ?

  • https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html – ALollz Aug 02 '19 at 14:36

1 Answers1

0

If this is supervised learning (i.e you have data label ) you can use

train_X, test_X, train_Y, test_Y = train_test_split(data, label, test_size = 0.2, random_state = 138,shuffle=True,stratify=label)


stratify will allow you to keep same proportion of each class in the final data set

akhetos
  • 686
  • 1
  • 10
  • 31
  • I thought about *train_test_split* but I'm building a recommendation engine where I need to recommend which products would be the best fit to each customer, so I don't have a label per say.. But if there's a way around it, do let me know :) Thank you! – Javi St Aug 02 '19 at 15:30