Analyse huge csv file in R/Python and sampling X% according to the distribution of the file?

Question

I have a large csv file (6 GB) and I want to sample 20% of it.

These 20% should be with same distribution as the large original file.

For example, take Kaggles data: https://www.kaggle.com/c/avazu-ctr-prediction/data

I thought about chunks but how can I get the distribution to be the same?

Tried read_csv, fread but without luck.

Please advise how can I do this? My laptop can't handle 6GB csv file.

score 3 · Answer 1 · answered Sep 08 '18 at 02:10

It's not clear what you mean by "tried fread, but without luck". Was there a specific error? How much RAM does your laptop have?

On my laptop (with 16GB memory) the file can be read without problems and will take only 3.7GB in RAM when loaded:

import numpy as np
import datatable as dt
from datatable import f

train = dt.fread("~/datasets/avazu/train.csv")
print(train.shape)
# (40428967, 24)
sample = train[np.random.binomial(1, 0.2, size=train.nrows).astype(bool), :]
sample.to_csv("train20.csv")  # produces roughly 1.25GB file

However, if for some reason your computer really can't load the original file, then I'd recommend loading it in pieces, by columns; then applying the same slice to each piece, and finally cbind-ing the result:

train1 = dt.fread("~/datasets/avazu/train.csv", columns=slice(0, 8))
smp = dt.Frame(np.random.binomial(1, 0.2, size=train1.nrows).astype(bool))
sample1 = train1[smp, :]
del train1

train2 = dt.fread("~/datasets/avazu/train.csv", columns=slice(8, 16))
sample2 = train2[smp, :]
del train2

train3 = dt.fread("~/datasets/avazu/train.csv", columns=slice(16, 24))
sample3 = train3[smp, :]
del train3

sample = dt.cbind(sample1, sample2, sample3)
sample.to_csv("train20.csv")

score 1 · Accepted Answer · answered Aug 20 '18 at 11:56

With the RevoScaleR library you have many options to analyze data that does not fit in RAM.

If you do not like this option, you can make a large number of cuts (100 or 200 percentiles) in your sample, and read your file in batches, counting how many records fall in each cut. When you finish, you add them and you can compare the frequency distribution of the complete file with the sample, and you can implement a ks-test, calculate weights means and compare them, or see the differences graphically.

score 0 · Answer 3 · answered Aug 21 '18 at 09:25

One of the ways that solved my issue was using ff package in R. Now using: ff::read.csv.ffdf() I have accessed the file on my disk using a pointer. Afterwards I have worked on it as regular data.table / data_frame / tibble.

It helped me, hope it will help you.

Analyse huge csv file in R/Python and sampling X% according to the distribution of the file?

3 Answers3