4

I have a series of text files in a folder sub.yr_by_yr which I pass to a for loop to subset a Beagle file from the header. I want to parallelize this script to subset the Beagle file from the header values (which is done using my subbeagle.awk script). I use the title of the text files to export the subset to a new file name using the base pattern matching in bash (file11=${file1%.subbeagle.txt}) to get the desired output (MM.beagle.${file11}.gz)

for file1 in $(ls sub.yr_by_yr)
do 
echo -e  "Doing sub-samples \n $file1"
file11=${file1%.subbeagle.txt}
awk -f subbeagle.awk \
       ./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz
done

The for loop works, but takes for ever... hence the need for parallelization. the folder sub.yr_by_yr contains >10 files named something like similar to this: sp.yrseries.site1.1.subbeagle.txt, sp.yrseries.site1.2.subbeagle.txt, sp.yrseries.site1.3.subbeagle.txt...

I've tried

parallel "file11=${{}%.subbeagle.txt}; awk -f $SUBBEAGLEAWKSCRIPT ./sub.yr_by_yr/{} <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz" ::: sub.yr_by_yr/*.subbeagle.txt

But it gives me 'bad substitution'

How could I use the awk script in parallel and rename the files accordingly?

Content of subbeagle.awk:

# Source: https://stackoverflow.com/questions/74451358/select-columns-based-on-their-names-from-a-file-using-awk

BEGIN  { FS=OFS="\t" }                             # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
        { sep=""
          for (i=1; i<=NF; i++) {
              if (FNR==1 && ($i in headers)) {
                 fldids[i]
              }
              if (i in fldids) {
                 printf "%s%s",sep,$i
                 sep=OFS                            # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
              }
          }
          print ""
        }

Content of MajorMinor.beagle.gz

marker      allele1  allele2  FINCH_WB_ID1_splitMerged  FINCH_WB_ID1_splitMerged  FINCH_WB_ID1_splitMerged  FINCH_WB_ID2_splitMerged  FINCH_WB_ID2_splitMerged
chr1_34273  G        C        0.79924                   0.20076                   3.18183e-09               0.940649                      0.0593509
chr1_34285  G        A        0.79924                   0.20076                   3.18183e-09               0.969347                      0.0306534
chr1_34291  G        C        0.666111                  0.333847                  4.20288e-05               0.969347                      0.0306534
chr1_34299  C        G        0.000251063               0.999498                  0.000251063               0.996035                      0.00396529

UPDATE:

I was able to get this from this source:

parallel "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{/.}_test.gz'" ::: sub.yr_by_yr/*.subbeagle.txt

The only fancy thing that needs to be removed is the .subbeagle par of the input file name...

M. Beausoleil
  • 3,141
  • 6
  • 29
  • 61
  • 1
    Consider making it easy for folk to help you... Just show your input as `<(zcat XXX.gz)` and your output as `gzip > YYY${file11}.gz` Then show the first three filenames in your subdirectory and the first 3 commands you want **GNU Parallel** to run with the parameters. – Mark Setchell Dec 15 '22 at 19:52
  • 1
    Please read https://mywiki.wooledge.org/BashFAQ/001, https://mywiki.wooledge.org/DontReadLinesWithFor and https://mywiki.wooledge.org/ParsingLs – Paul Hodges Dec 15 '22 at 20:01
  • OK, `for file1 in sub.yr_by_yr/*.txt` would be better in this case. But it doesn't answer the parallel part. – M. Beausoleil Dec 15 '22 at 20:11
  • `fldids[i]` ?? don't you want something like `fldids[i]=$i"`? Good luck. – shellter Dec 15 '22 at 20:14
  • and what is the output of `ls sub.yr_by_yr| wc -l` ? (How many files to process?) ... Maybe you can just background the `awk` script with `&` at the end of the line, if you only have a "small" number of files. Can't recall how to query system for maxNum background processes right now. Good luck. – shellter Dec 15 '22 at 20:18
  • The problem is not the number of files (which is about 22). It's the fact that it rans sequentially. It'd be much more efficient to have the 22 run in parallel. Each process takes to time, but if ran at the same time, I have to wait a lot less... also `parallel "echo '{= s:\.[^.]+$::;s:\.[^.]+$::; =}'" ::: sub.yr_by_yr/*.subbeagle.txt`is a way to rename the file in. but I'm not able to reuse that in the parallel output. – M. Beausoleil Dec 15 '22 at 20:28
  • 1
    See you solved your problem with the best tool for the job. And yes, I appreciated that you wanted your scripts to run in parallel. If your system allows 22 jobs in the background (many now do, in the 32bit days it was often 12-15), changing that one line of code to `awk -f subbeagle.awk \ ./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz &` (note the ending char, `&`, which means run in the background), would have executed all of your script "instances" with different values for `$file1` all at "once" (I think!). Good luck – shellter Dec 16 '22 at 01:51

1 Answers1

3

So the parallel tutorial helped me here:

parallel --rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;' "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'" ::: sub.yr_by_yr/*.subbeagle.txt

Let's break this:

--rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;'
  • --rpl will "define a shorthand replacement string" (see parallel tutorial and another example here)

  • {mymy} is my 'new' replacement string, which will execute what is after it.

  • s:.*/::; is the definition to {/} (see parallel tutorial, search for "Perl expression replacement string", the last part of that section shows the definition of 7 'default' replacement strings)

  • s:\.[^.]+$::;s:\.[^.]+$::; removes 2 extensions (so .subbeagle.txt where .txt is the first extension and .subbeagle is the second)

    "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'"
    
  • is the subsetting and compressing par of the script. Note that the {mymy} is where the replacement will take place. As you can see {} will be in input string. The rest is unchanged!

  • ::: sub.yr_by_yr/*.subbeagle.txt will pass all the files to parallel as input.

It took ~ 2 hours to do at least ~5 files, but using 22 cores, I could do all files this in a fraction of the time (~20 minutes)!

M. Beausoleil
  • 3,141
  • 6
  • 29
  • 61