Sample 10000 random rows from a 200GB dataset

Question

I am trying to sample 10000 random rows from a large dataset with ~3 billion rows (with headers). I've considered using shuf -n 1000 input.file > output.file but this seems quite slow (>2 hour run time with my current available resources).

I've also used awk 'BEGIN{srand();} {a[NR]=$0} END{for(i=1; i<=10; i++){x=int(rand()*NR) + 1; print a[x];}}' input.file > output.file from this answer for a percentage of lines from smaller files, though I am new to awk and don't know how to include headers.

I wanted to know if there was a more efficient solution to sampling a subset (e.g. 10000 rows) of data from the 200GB dataset.

there are multiple ways to do this in R on a dataframe, but I dont know if this is going to be faster — pyr0, Jul 02 '20 at 16:21
If you're OK with stratified sampling, you can split the large data file into say 1000 partitions of 3m records each, and take 10 sample from each and merge the samples back. — karakfa, Jul 02 '20 at 16:25

James Brown · Answer 1 · 2020-07-04T22:38:43.147

Something in awk. Supply it with random seed ($RANDOM in Bash) and number n of wanted records. It counts the lines with wc -l and uses that count to select randomly n values between 1—lines[1] in file and outputs them. Can't really say anything about speed, I don't even have 200 GBs of disk. (:

$ awk -v seed=$RANDOM -v n=10000 '
BEGIN {
    cmd="wc -l " ARGV[1]                          # use wc  for line counting
    if(ARGV[1]==""||n==""||(cmd | getline t)<=0)  # require all parameters
        exit 1                                    # else exit
    split(t,lines)                                # wc -l returns "lines filename"
    srand(seed)                                   # use the seed
    while(c<n) {                                  # keep looping n times
        v=int((lines[1]) * rand())+1              # get a random line number
        if(!(v in a)){                            # if its not used yet
            a[v]                                  # use it
            ++c                                   
        }
    }
}
(NR in a)' file                                   # print if NR in selected

Testing with dataset from seq 1 100000000. shuf -n 10000 file took about 6 seconds where the awk above took about 18 s.

This worked pretty well with small files, and it's worth a shot to run this with the large file to at least compare with the runtime of shuf. I'll try to report back in a couple of days. — Geode, Jul 02 '20 at 19:09
Tested with awk for the heck of it and it took ~2 hours with awk as opposed to ~1.7 hours with shuf. — Geode, Jul 07 '20 at 18:50

score 0 · Answer 2 · answered Jul 03 '20 at 05:45

I don't think any program written in a scripting language can beat the shuf in the context of this question. Anyway, this is my try in bash. Run it with ./scriptname input.file > output.file

#!/bin/bash

samplecount=10000
datafile=$1
[[ -f $datafile && -r $datafile ]] || {
    echo "Data file does not exists or is not readable" >&2
    exit 1
}

linecount=$(wc -l "$datafile")
linecount=${linecount%% *}
pickedlinnum=(-1)
mapfile -t -O1 pickedlinnum < <(
    for ((i = 0; i < samplecount;)); do
        rand60=$((RANDOM + 32768*(RANDOM + 32768*(RANDOM + 32768*RANDOM))))
        linenum=$((rand60 % linecount))
        if [[ -z ${slot[linenum]} ]]; then # no collision
            slot[linenum]=1
            echo ${linenum}
            ((++i))
        fi
    done | sort -n)

for ((i = 1; i <= samplecount; ++i)); do
    mapfile -n1 -s$((pickedlinnum[i] - pickedlinnum[i-1] - 1))
    echo -n "${MAPFILE[0]}"
done < "$datafile"

Sample 10000 random rows from a 200GB dataset

2 Answers2