How to work around open files limit when demuxing files?

Question

I frequently have large text files (10-100GB decompressed) to demultiplex based on barcodes in each line, where in practice the number of resulting individual files (unique barcodes) is between 1K and 20K. I've been using awk for this and it accomplishes the task. However, I've noticed that the rate of demuxing larger files (which correlates with more unique barcodes used) is significantly slower (10-20X). Checking ulimit -n shows 4096 as the limit on open files per process, so I suspect that the slowdown is due to the overhead of awk being forced to constantly close and reopen files whenever the total number of demuxed files exceeds 4096.

Lacking root access (i.e., the limit is fixed), what kinds of workarounds could be used to circumvent this bottleneck?

I do have a list of all barcodes present in each file, so I've considered forking multiple awk processes where each is assigned a mutually exclusive subset (< 4096) of barcodes to search for. However, I'm concerned the overhead of having to check each line's barcode for set membership might defeat the gains of not closing files.

Is there a better strategy?

I'm not married to awk, so approaches in other scripting or compiled languages are welcome.

Specific Example

Data Generation (FASTQ with barcodes)

The following generates data similar to what I'm specifically working with. Each entry consists of 4 lines, where the barcode is an 18 character word using the non-ambiguous DNA alphabet.

1024 unique barcodes | 1 million reads

cat /dev/urandom | tr -dc "ACGT" | fold -w 5 | \
awk '{ print "@batch."NR"_"$0"AAAAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.1K.fastq

16384 unique barcodes | 1 million reads

cat /dev/urandom | tr -dc "ACGT" | fold -w 7 | \
awk '{ print "@batch."NR"_"$0"AAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.16K.fastq

`awk` script for demultiplexing

Note that in this case I'm writing to 2 files for each unique barcode.

demux.awk

#!/usr/bin/awk -f
BEGIN {
    if (length(outdir) == 0 || length(prefix) == 0) {
        print "Variables 'outdir' and 'prefix' must be defined!" > "/dev/stderr";
        exit 1;
    }
    print "[INFO] Initiating demuxing..." > "/dev/stderr";
}
{
    if (NR%4 == 1) {
        match($1, /.*_([ACGT]{18})_([ACGTN]{6}).*/, bx);
        print bx[2] >> outdir"/"prefix"."bx[1]".umi";
    }
    print >> outdir"/"prefix"."bx[1]".fastq";

    if (NR%40000 == 0) {
        printf("[INFO] %d reads processed\n", NR/4) > "/dev/stderr";
    }
}
END {
    printf("[INFO] %d total reads processed\n", NR/4) > "/dev/stderr";
}

Usage

awk -v outdir="/tmp/demux1K" -v prefix="batch" -f demux.awk cells.1K.fastq

or similarly for the cells.16K.fastq.

Assuming you're the only one running awk, you can verify the approximate number of open files using

lsof | grep "awk" | wc -l

Observed Behavior

Despite the files being the same size, the one with 16K unique barcodes runs 10X-20X slower than the one with only 1K unique barcodes.

sorting the file first will eliminate the problem (only one file will be open at a time). Not sure the gain will be more than the cost of sorting. Another alternative is some kind of partitioning (cheaper than sort) so that partitions are mutually exclusive (e.g. `C++` partition) — karakfa, Aug 23 '18 at 19:29
Post a [mcve] with concise, testable sample input and expected output and you'll probably get some answers. See [ask] if that's not clear. — Ed Morton, Aug 23 '18 at 19:51
It's applicable. We don't know what your code is actually doing so we're guessing at what that might be while you're guessing at what's causing the problem and we're guessing at solutions without knowing what your input looks like, what output you need, etc.. All it'd take is a [mcve] to remove most of the guesswork and give us a starting point. — Ed Morton, Aug 23 '18 at 20:09
Install a proper dbms. Load the data into tables. Do the work in SQL. — Ben, Sep 23 '18 at 20:30

Ed Morton · Accepted Answer · 2018-08-24T14:08:14.993

Without seeing any sample input/output or the script you're currently executing it's very much guesswork but if you currently have the barcode in field 1 and are doing (assuming GNU awk so you don't have your own code managing the open files):

awk '{print > $1}' file

then if managing open files really is your problem you'll get a significant improvement if you change it to:

sort file | '$1!=f{close(f};f=$1} {print > f}'

The above is, of course, making assumptions about what these barcoode values are, which field holds them, what separates fields, whether or not the output order has to match the original, what else your code might be doing that gets slower as the input grows, etc., etc. since you haven't shown us any of that yet.

If that's not all you need then edit your question to include the missing MCVE.

Given your updated question with your script and the info that the input is 4-line blocks, I'd approach the problem by adding the key "bx" values at the front of each record and using NUL to separate the 4-line blocks then using NUL as the record separator for sort and the subsequent awk:

$ cat tst.sh
infile="$1"
outdir="${infile}_out"
prefix="foo"

mkdir -p "$outdir" || exit 1

awk -F'[_[:space:]]' -v OFS='\t' -v ORS= '
    NR%4 == 1 { print $2 OFS $3 OFS }
    { print $0 (NR%4 ? RS : "\0") }
' "$infile" |
sort -z |
awk -v RS='\0' -F'\t' -v outdir="$outdir" -v prefix="$prefix" '
BEGIN {
    if ( (outdir == "") || (prefix == "") ) {
        print "Variables \047outdir\047 and \047prefix\047 must be defined!" | "cat>&2"
        exit 1
    }
    print "[INFO] Initiating demuxing..." | "cat>&2"
    outBase = outdir "/" prefix "."
}
{
    bx1   = $1
    bx2   = $2
    fastq = $3

    if ( bx1 != prevBx1 ) {
        close(umiOut)
        close(fastqOut)
        umiOut   = outBase bx1 ".umi"
        fastqOut = outBase bx1 ".fastq"
        prevBx1  = bx1
    }

    print bx2   > umiOut
    print fastq > fastqOut

    if (NR%10000 == 0) {
        printf "[INFO] %d reads processed\n", NR | "cat>&2"
    }
}
END {
    printf "[INFO] %d total reads processed\n", NR | "cat>&2"
}
'

When run against input files generated as you describe in your question:

$ wc -l cells.*.fastq
4000000 cells.16K.fastq
4000000 cells.1K.fastq

the results are:

$ time ./tst.sh cells.1K.fastq 2>/dev/null

real    0m55.333s
user    0m56.750s
sys     0m1.277s

$ ls cells.1K.fastq_out | wc -l
2048

$ wc -l cells.1K.fastq_out/*.umi | tail -1
1000000 total

$ wc -l cells.1K.fastq_out/*.fastq | tail -1
4000000 total


$ time ./tst.sh cells.16K.fastq 2>/dev/null

real    1m6.815s
user    0m59.058s
sys     0m5.833s

$ ls cells.16K.fastq_out | wc -l
32768

$ wc -l cells.16K.fastq_out/*.umi | tail -1
1000000 total

$ wc -l cells.16K.fastq_out/*.fastq | tail -1
4000000 total

Very nice! Testing on input with 1K versus 16K output files, this definitely makes the execution times comparable between the two. It might be worth noting for others that sorting alone and letting the OS (CentOS) manage closing the previous files still results in significant slow down - i.e., one must explicitly close files as you've indicated. Unfortunately, much of the data I have involves multiline blocks (one barcode per block), so it's not such a straightforward sorting operation, but still doable. — merv, Aug 24 '18 at 00:48
Of course. Assuming you're right about the number of open files being the issue, sorting doesn't help the execution time it just creates an input order such that we can in the awk script only have 1 output file open at a time and THAT is what helps the execution time. wrt multi-line input blocks - I added a shell script to my answer to show how to handle that. — Ed Morton, Aug 24 '18 at 13:04