Match string and print output specified field side by side for multiple files

Question

I'm new to programming so I might need explanation for each step and I have an issue:

Say I have these (tab delimited) files:

genelist.txt contains:

start_position  end_position    description
1   840 putative replication protein
1839    2030    hypothetical protein
2095    2328    hypothetical protein
3076    4020    transposase
4209    4322    hypothetical protein

a.txt contains:

NA1.fa
NA1:0-840   scaffold40|size16362    100.000
NA1:1838-2030   scaffold40|size16362    100.000
NA1:3075-4020   scaffold40|size16362    100.000
NA1:4208-4322   scaffold40|size16362    92.105`

b.txt contains:

NA4.fa
NA4:1838-2030   scaffold11|size142511   84.707
NA4:2094-2328   scaffold11|size142511   84.599
NA4:3075-4020   scaffold11|size142511   84.707`

And my desired output is:

start_position  end_position    description NA1 NA4
1   840 putative replication protein    100 -
1839    2030    hypothetical protein    100 84.707
2095    2328    hypothetical protein    -   84.599
3076    4020    transposase 100 84.707
4209    4322    hypothetical protein    92.105  -

Basically, I want to match the genes based on the end position and print out the percentage matches (of the 3rd field) side by side according to the respective IDs so I can get a comparison table of their percentage identity. And if there's no match, print - or 0 so I know which exactly has a match and which doesn't.

I'm open to bash/regex/perl/python or any sort of scripting that will do the job. Apologies if this has been asked before but I couldn't find any solutions so far.

Once you define your problem in a systematic way, your program is almost done. Syntax varies from one language to the other, but your algorithm would stay the same. Let me know if this is ok... foreach line in genelist.txt, get the end_position. From a.txt, check in what range end_position is included, extract the 3rd field if found. '-' or '0' otherwise. From b.txt, check in what range end_position is included, extract the 3rd field if found. '-' or '0' otherwise. Print each line from genelist.txt and the values from a.txt and b.txt. Right? — Nic3500, Nov 07 '17 at 12:44
What have you tried? What language do you know? This can be done in any bash, perl, php, python, c, java, ... Users here will help you fix a problem with your code, but will not write the solution for you. — Nic3500, Nov 07 '17 at 12:45
@Nic3500 Sorry, I didn't know this forum was solely for fixing codes. Like I mentioned, I am new to programming so although I know what I want to do, I'm clueless as to what commands to use to achieve it. I did try something like while read a b; do grep -w ..... (I prefer using bash) And yes, I think you got the idea there? Keep in mind that I have lots of files like a.txt and b.txt so I need to find a way to loop it eventually. — Chrystine Yan, Nov 08 '17 at 04:38
Here is my proposed workflow: 1. Extract field 2 (end position) in genelist.txt 2. Match that to file a.txt 3. If matched, print 3rd field from a.txt to genelist.txt next to the corresponding match (or can output to a new file with my desired output) 4. If no match, print '-' or '0' 5. Repeat for b.txt Also, need to print file header/file name so I know which column's values belongs to which file. — Chrystine Yan, Nov 08 '17 at 04:55

score 0 · Answer 1 · answered Nov 08 '17 at 15:33

Well that was a challenge. So here is the code:

#!/bin/bash
#
# Process genelist file
#
################################################################################

usage()
{
    echo "process.bash <GENELIST> <DATAFILE1> [<DATAFILE n>]"
    echo "Requires at least the genelist and 1 data file."
    exit 1
}

# Process arguments
if [ $# -lt 2 ]
then
    usage
else
    genelistfile=$1
    # Remove the fist argument from $*
    shift
    datafiles=$*
fi

# Setup the output file ########################################################
processdate=$(date +%Y%M%d-%H%m%S)
outputfile="process_$processdate.out"

# Build the header:
#   the first line of the genelist.txt
#   and the first line of each datafile (processed)
header="start_position\tend_position\tdescription"
for datafile in $datafiles
do
    datafileheader=$(grep -v ":" $datafile | cut -d'.' -f1)
    header="$header\t$datafileheader"
done
echo -e $header >$outputfile

# Process the genelistfile #####################################################

# Read each line from the genelistfile
while read -r line
do
    # Do nothing with the header line
    if [ $(echo $line | grep -c start_position) -gt 0 ]
    then
        continue
    fi

    # Setup the output line, which is the line from genelistfile
    # The program will add values from the datafiles as they are processed
    outputline=$line

    # Extract the second field in the line, endposition
    endposition=$(echo $line | awk '{print $2}')

    # loop on each file in argument
    for datafile in $datafiles
    do
        foundsomething='false'

        # for each line in the datafile...
        while read -r line2
        do
            # If the line is a range line, process it
            if [ $(echo $line2 | grep -c ":") -gt 0 ]
            then
                # Extract the range
                startrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f1)
                endrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f2)
                #echo "range= $startrange --> $endrange"

                # Verify if endposition fits within the range...
                if [ $endposition -ge $startrange -a $endposition -le $endrange ]
                then
                    percentage=$(echo $line2 | awk '{print $3}')
                    outputline="$outputline\t$percentage"
                    foundsomething='true'
                fi
            fi
        done < $datafile

        # When done processing the file, we must check if something was found
        if [ $foundsomething == 'false' ]
        then
            outputline="$outputline\t-"
        fi
    done

    # When done processing that line from genelist, output it
    echo -e $outputline >>$outputfile

done < $genelistfile

I have put lots of comments to explain what is going on, but here some assumptions I took to simplify the code:

all data files have a first line with SOMETHING1.SOMETHING2. I keep SOMETHING1 as the column header.
there will not be NA1 and NAx mixed data in the same file.
the range data is always specified like NAx:start-end.
the value to extract form the range data is always the 3rd element in a line.

It worked for me with your sample data. Have fun!

Match string and print output specified field side by side for multiple files

1 Answers1