I have a series of text files in a folder sub.yr_by_yr
which I pass to a for loop to subset a Beagle
file from the header. I want to parallelize this script to subset the Beagle file from the header values (which is done using my subbeagle.awk
script). I use the title of the text files to export the subset to a new file name using the base pattern matching in bash (file11=${file1%.subbeagle.txt}
) to get the desired output (MM.beagle.${file11}.gz
)
for file1 in $(ls sub.yr_by_yr)
do
echo -e "Doing sub-samples \n $file1"
file11=${file1%.subbeagle.txt}
awk -f subbeagle.awk \
./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz
done
The for loop works, but takes for ever... hence the need for parallelization. the folder sub.yr_by_yr
contains >10 files named
something like similar to this: sp.yrseries.site1.1.subbeagle.txt
, sp.yrseries.site1.2.subbeagle.txt
, sp.yrseries.site1.3.subbeagle.txt
...
I've tried
parallel "file11=${{}%.subbeagle.txt}; awk -f $SUBBEAGLEAWKSCRIPT ./sub.yr_by_yr/{} <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz" ::: sub.yr_by_yr/*.subbeagle.txt
But it gives me 'bad substitution
'
How could I use the awk script in parallel and rename the files accordingly?
Content of subbeagle.awk
:
# Source: https://stackoverflow.com/questions/74451358/select-columns-based-on-their-names-from-a-file-using-awk
BEGIN { FS=OFS="\t" } # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
{ sep=""
for (i=1; i<=NF; i++) {
if (FNR==1 && ($i in headers)) {
fldids[i]
}
if (i in fldids) {
printf "%s%s",sep,$i
sep=OFS # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
}
}
print ""
}
Content of MajorMinor.beagle.gz
marker allele1 allele2 FINCH_WB_ID1_splitMerged FINCH_WB_ID1_splitMerged FINCH_WB_ID1_splitMerged FINCH_WB_ID2_splitMerged FINCH_WB_ID2_splitMerged
chr1_34273 G C 0.79924 0.20076 3.18183e-09 0.940649 0.0593509
chr1_34285 G A 0.79924 0.20076 3.18183e-09 0.969347 0.0306534
chr1_34291 G C 0.666111 0.333847 4.20288e-05 0.969347 0.0306534
chr1_34299 C G 0.000251063 0.999498 0.000251063 0.996035 0.00396529
UPDATE:
I was able to get this from this source:
parallel "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{/.}_test.gz'" ::: sub.yr_by_yr/*.subbeagle.txt
The only fancy thing that needs to be removed is the .subbeagle
par of the input file name...