0

I have two files as follows:

File1:

ABC_1   123 E   +   7.595927    Anotation1
ABC_2   456 E   -   7.369319    Anotation2
EFG_1   261 E   -   7.259135    Anotation3
EFG_2   947 E   -   5.021707    Anotation4
EFG_3   1   E   +   7.398731    Anotation5
HIJ_1   12  E   +   7.398731    Anotation6
HIJ_2   14  E   -   5.496079    Anotation7

File2:

ABC_1   123 E   +   inclusion   0.165805338
ABC_2   456 E   -   inclusion   0.165805338
EFG_1   261 E   -   inclusion   0.165805338
EFG_2   947 E   -   inclusion   0.165805338
EFG_3   1   E   +   inclusion   0.165805338
LMN_1   21  I   -   exclusion   0.5
LMN_2   43  I   -   inclusion   0.3

Is there a way to create a join on both the files, such that the first 5 columns are retained as follows, assuming obviously that there will be some rows that won't have joins in each file, in that case first 4 columns will be retained irrespective of the file with other fields blank following a specific format as follows:

ABC_1   123 E   +   7.595927    Anotation1  inclusion   0.165805338
ABC_2   456 E   -   7.369319    Anotation2  inclusion   0.165805338
EFG_1   261 E   -   7.259135    Anotation3  inclusion   0.165805338
EFG_2   947 E   -   5.021707    Anotation4  inclusion   0.165805338
EFG_3   1   E   +   7.398731    Anotation5  inclusion   0.165805338
HIJ_1   12  E   +   7.398731    Anotation6  NULL    NULL
HIJ_2   14  E   -   5.496079    Anotation7  NULL    NULL
LMN_1   21  I   -   NULL    NULL    exclusion   0.5
LMN_2   43  I   -   NULL    NULL    inclusion   0.3

I did try join -a 1 -a 2 -e NULL -o command but it doesn't give me the desired order and format

Allan
  • 12,117
  • 3
  • 27
  • 51
AishwaryaKulkarni
  • 774
  • 1
  • 8
  • 19
  • (1) Have you looked at `man join`, and/or tried using it? Can you show what you tried, and describe how it didn't work? (2) Please try to make your example comply with the [mcve] definition -- that means, specifically, building the *simplest possible* reproducer for a problem. Simplifying your data to something a human can glance at and visually distinguish the values would do a lot of good here. – Charles Duffy Jan 23 '18 at 22:15
  • @CharlesDuffy you are absolutely right on that I did make it simpler, thanks for your input. – AishwaryaKulkarni Jan 23 '18 at 22:33
  • In the real file, are columns separated with spaces or tabs? – Charles Duffy Jan 23 '18 at 22:34
  • @CharlesDuffy they are separated with tabs – AishwaryaKulkarni Jan 23 '18 at 22:35
  • I'd suggest that `join -a 1 -a 2 -e NULL -o '0.1 1.2 1.3 1.4 1.5 2.4 2.5' file1 file2` is about as close as you're going to get without printing both `1.2 1.3 1.4` and `2.2 2.3 2.4` separately in your output streams and using `awk` to combine them (emitting the first non-NULL empty of each pair). – Charles Duffy Jan 23 '18 at 22:42

1 Answers1

1

You can use the following chain of commands to do your join operation:

$ (join -a1 -a2 -e NULL -o '1.1 1.2 1.3 1.4 1.5 1.6 2.5 2.6' file1 file2; join -a1 -a2 -e NULL -o '2.1 2.2 2.3 2.4 1.5 1.6 2.5 2.6' file1 file2 )| grep -v '^NULL' | sort -k 1,2 | uniq

On your input files, it produces:

$ (join -a1 -a2 -e NULL -o '1.1 1.2 1.3 1.4 1.5 1.6 2.5 2.6' file1 file2; join -a1 -a2 -e NULL -o '2.1 2.2 2.3 2.4 1.5 1.6 2.5 2.6' file1 file2 )| grep -v '^NULL' | sort -k 1,2 | uniq
ABC_1 123 E + 7.595927 Anotation1 inclusion 0.165805338
ABC_2 456 E - 7.369319 Anotation2 inclusion 0.165805338
EFG_1 261 E - 7.259135 Anotation3 inclusion 0.165805338
EFG_2 947 E - 5.021707 Anotation4 inclusion 0.165805338
EFG_3 1 E + 7.398731 Anotation5 inclusion 0.165805338
HIJ_1 12 E + 7.398731 Anotation6 NULL NULL
HIJ_2 14 E - 5.496079 Anotation7 NULL NULL
LMN_1 21 I - NULL NULL exclusion 0.5
LMN_2 43 I - NULL NULL inclusion 0.3

Explanations:

  • (join -a1 -a2 -e NULL -o '1.1 1.2 1.3 1.4 1.5 1.6 2.5 2.6' file1 file2; join -a1 -a2 -e NULL -o '2.1 2.2 2.3 2.4 1.5 1.6 2.5 2.6' file1 file2 ) will produce the joining operation with your desired output including lines not present in file1 and lines not present in file2! Those lines will start with NULL
  • grep -v '^NULL' | sort -k 1,2 | uniq will remove those lines starting with NULL and the sort -k 1,2 | uniq will take care of the duplicates.

Last but not least for formatting purpose you can add |column -t at the end of your command to produce the following beautiful output:

enter image description here

Allan
  • 12,117
  • 3
  • 27
  • 51