Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

Question

I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.

In particular, I have files in which some lines begin with the '@' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.

So far I've written a bash script that finds duplicated lines beginning with '@', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:

#!/bin/bash                                                                                                                                      


fdir=$1"*"

#for each fastq file

for f in $fdir
do
    (

#find duplicated read names and write to file $f.txt

sort $f | uniq -d | grep ^@  > "$f".txt
#loop over each duplicated readname

    while read in; do
        rname=$in
        i=1

        #while this readname still exists in the file increment and replace

        while grep -q "$rname" $f; do
            replace=${rname/0:0/$i:0}
            sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
            let "i+=1"

        done

    done < "$f".txt



rm "$f".txt
rm "$f".bu

done

echo "done" >> progress.txt

)&

background=( $(jobs -p) )
if (( ${#background[@]} ==40)); then
wait -n
fi

done

The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.

My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.

EDIT

Strangely the loop works fine on one file. Basically I ran

sort $f | uniq -d | grep ^@  > "$f".txt


while read in; do
    rname=$in
    i=1

    while grep -q "$rname" $f; do
        replace=${rname/0:0/$i:0}
        sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
        let "i+=1"

    done

done < "$f".txt

To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.

With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?

Sample input:

@D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
@D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG

Sample output:

@D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
@D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG

3 days, gulp. Roughly how big are the files? I would skip the `jobs` stuff, and think about why 1 file is taking (3*24)/384 hrs to process. Good luck. — shellter, Apr 20 '17 at 21:12
On average about 40 MB. I just put the code in how I wrote it originally but you're right, I'll try it on one of the smaller files and see if it works. Any tips on the best way to do this task? I feel like the way I'm doing it has a lot of unnecessary step. — Alon Gelber, Apr 20 '17 at 22:31
I'm not clear about what the loop is acomplishing. Can you edit your Q to show 1-2 lines of input and then the expected output from that. If you're re-reading the file many times to create the final output, there is almost certainly a way to process it only once, and rely on logic tests to make the modifications, but as is, it's too hard too tell. Also, do'nt make the data a mile-wide, try to make your sample data 50-80 chars wide (for 1 record). Will look at this later or tomrrow. Good luck. — shellter, Apr 21 '17 at 03:12
Also make sure that your disk drives are working per spec. I would try `time gzip 46MBfile ; time gunzip 46MBfile.gz`. If that takes more than 1 min (should be less), then you have some basic hardware problems that program optimization can't help. If you have good relations with your sysadmin, you could also have them check logs and `iostat` for the drives you are using to be sure there are no problems being flagged. Good luck. — shellter, Apr 21 '17 at 04:07
I think I'm only reading the file once. Just to find the duplicated lines. Then I store those in a file and read them in one at a time, do the while loop to sequentially change all of them, then move on to the next duplicated line. — Alon Gelber, Apr 21 '17 at 15:09
There's a lot of obvious tactical stuff wrong (much of which http://shellcheck.net/ will flag), but those aren't things that'll explain the performance issue. Frankly, a good place to start is **measuring** the problem. For instance, run `PS4=':$LINENO:$SECONDS+' bash -x yourscript`, and it'll log next to each command it runs both the line of source it's from, and how many seconds from the script's execution where it's being executed -- that log would at least allow identifying individual commands that are slow. — Charles Duffy, Apr 21 '17 at 15:28
(Example of a little tactical fix, by the way: Right now, your script is waiting for the `sort | uniq | grep` pipeline to completely finish before it starts the `while read in` loop at all -- if you used a process substitution, you could let that pipeline stream content into the loop as it becomes available). — Charles Duffy, Apr 21 '17 at 15:30
Thanks for the sample inputs. We need to see what the sample output of that input should be. Can you update your Q please? ALSO, Is a "record" separated by the lone `+` char? If that is true, then is the data really folded across multiple lines, or did you reformat per my mention of 50-80 chars wide? It will be easier for you if a record is on a single line. AND ++++ to Charles Duffy's recommendations (of course!). But I still recommend getting 1 file to work correctly first. — shellter, Apr 21 '17 at 15:58
AND , right now that sample data is still confusing. Based on your problem description, I think all you need is 30-40 chars per record/line to illustarte your problem, i.e. `@D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT` . `@D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+`, yes? Good luck. — shellter, Apr 21 '17 at 16:04
So the format of the input files is many 4 line "reads", which consists of a read name (the line we need to manipulate), the read itself, the plus, and then a line of quality scores. The read names are the lines that start with @ and we'd like to change. I edited my question to hopefully make it a bit more clear. — Alon Gelber, Apr 21 '17 at 16:15
That's better. I think there is much simpler way to process the data, but unfortunately, I have to take off for a while and I have several things to do this afternoon. maybe `awk '/^@/&&prev==$1{ ;prev=$1;split($1,tmpArr,":");tmpArr[6]++;#magic to rebuild $1 here}1' file` will give other readers an idea on how to solve this. (This assumes that `file` is sorted on the first space delimited field). Good luck. — shellter, Apr 21 '17 at 16:29
Thanks for the help and steering me in the right direction! If anyone could elaborate on using awk to do this that would be greatly appreciated. — Alon Gelber, Apr 21 '17 at 16:46
This must be the most content free question title ever. Why not be more specific? It basically says, "I have a problem. Help." http://www.catb.org/esr/faqs/smart-questions.html — Jens, Apr 23 '17 at 14:19

score 1 · Answer 1 · answered Apr 21 '17 at 22:47

Here's some code that produces the required output from your sample input.

Again, it is assumed that your input file is sorted by the first value (up to the first space character).

time awk '{
        #dbg if (dbg) print "#dbg:prev=" prev
        if (/^@/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1}
        if (/^@/ && (prev==$1 || NR==1) ) {
                prev=$1
                n=split($1,tmpArr,":") ; n++
                #dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum
                fixNum++;tmpArr[6]=fixNum;

                # magic to rebuild $1 here
                for (i=1;i<n;i++) {
                        tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i]
                }
                $1=tmpFix ; $0=$0  
                print  $0
        }
        else { tmpFix=""; print $0 } 
        }' file > fixedFile

output

@D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
@D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG

I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change.

Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name.

There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e.

real    0m0.18s
user    0m0.03s
sys     0m0.06s

IHTH

So this codes didn't exactly work on mac. I haven't tried it on Linux yet so it might be that. It basically fixes the first occurrence. I came up with another solution that is relatively fast regex. I'll post it as well. I haven't tested it on the really large files yet though. — Alon Gelber, Apr 23 '17 at 14:14

score 0 · Accepted Answer · answered Apr 23 '17 at 14:16

0

#!/bin/bash                                                                     

i=4

sort $1 | uniq -d | grep ^@ > dups.txt

while read in; do

    if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
        x="$in"
        x=${x/"0:0 "/$i":0 "}
        echo "$x" >> $1"fixed.txt"

    else
        echo "$in" >> $1"fixed.txt"

    fi

    let "i+=1"
done < $1

answered Apr 23 '17 at 14:16

Alon Gelber

113
7

Glad you found a solution. Please post your new timings on your files. I hope its not more that a minute ;-). Also would still like to see what the trouble is on the mac. Can you post what the trouble is with the awk code (it works for me! ;-) ). – shellter Apr 25 '17 at 02:22

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

2 Answers2