I'm new to programming so I might need explanation for each step and I have an issue:
Say I have these (tab delimited) files:
- genelist.txt contains:
start_position end_position description 1 840 putative replication protein 1839 2030 hypothetical protein 2095 2328 hypothetical protein 3076 4020 transposase 4209 4322 hypothetical protein
a.txt contains:
NA1.fa NA1:0-840 scaffold40|size16362 100.000 NA1:1838-2030 scaffold40|size16362 100.000 NA1:3075-4020 scaffold40|size16362 100.000 NA1:4208-4322 scaffold40|size16362 92.105`
b.txt contains:
NA4.fa NA4:1838-2030 scaffold11|size142511 84.707 NA4:2094-2328 scaffold11|size142511 84.599 NA4:3075-4020 scaffold11|size142511 84.707`
And my desired output is:
start_position end_position description NA1 NA4
1 840 putative replication protein 100 -
1839 2030 hypothetical protein 100 84.707
2095 2328 hypothetical protein - 84.599
3076 4020 transposase 100 84.707
4209 4322 hypothetical protein 92.105 -
Basically, I want to match the genes based on the end position and print out the percentage matches (of the 3rd field) side by side according to the respective IDs so I can get a comparison table of their percentage identity. And if there's no match, print -
or 0
so I know which exactly has a match and which doesn't.
I'm open to bash
/regex
/perl
/python
or any sort of scripting that will do the job. Apologies if this has been asked before but I couldn't find any solutions so far.