I am trying to sample 10000 random rows from a large dataset with ~3 billion rows (with headers). I've considered using shuf -n 1000 input.file > output.file
but this seems quite slow (>2 hour run time with my current available resources).
I've also used awk 'BEGIN{srand();} {a[NR]=$0} END{for(i=1; i<=10; i++){x=int(rand()*NR) + 1; print a[x];}}' input.file > output.file
from this answer for a percentage of lines from smaller files, though I am new to awk and don't know how to include headers.
I wanted to know if there was a more efficient solution to sampling a subset (e.g. 10000 rows) of data from the 200GB dataset.