5

I have a TAB file something like:

V    I      280     6   -   VRSSAI
N    V      2739    7   -   SAVNATA
A    R      203     5   -   AEERR
Q    A      2517    7   -   AQSTPSP
S    S      1012    5   -   GGGSS
L    A      281    11   -   AAEPALSAGSL

And I would like to check the last column respect to the order of letters in 1st and 2nd column. If are coincidences between the first and last letter in last column comparing to the 1st and 2nd column respectively remain identical. On the contrary if there are not coincidences I would like to locate the reverse order pattern in last column and then print the string from the letter in 1st column to the end and then take the first letter and print to the letter in 2nd column. The desired output would be:

V    I      280     6   -   VRSSAI
N    V      2739    7   -   NATASAV
A    R      203     5   -   AEERR
Q    A      2517    7   -   QSTPSPA
S    S      1012    5   -   SGGGS
L    A      281    11   -   LSAGSLAAEPA

In this way I'm try to do different scripts but do not work correctly I don't know exactly why.

awk 'BEGIN {FS=OFS="\t"}{gsub(/$2$1/,"\t",$6); print $1$7$6$2}' "input" > "output";

Other way is:

awk 'BEGIN {FS=OFS="\t"} {len=split($11,arrseq,"$7$6"); for(i=0;i<len;i++){printf "%s ",arrseq[i],arrseq[i+1]}' `"input" > "output";`

And I try by means of substr function too but finally no one works correctly. Is it possible to do in bash? Thanks in advance

I try to put an example in order to understand better the question.

$1                 $2                 $6
L                  A                  AAEPALSAGSL (reverse pattern 'AL' $2$1)

desired output in $6 from the corresponding $2 letter within reverse pattern to the end following by first letter to corresponding $1 letter within the reverse pattern

$1                 $2                 $6
L                  A                  LSAGSLAAEPA
  • 2
    I do not understand what you mean with: "I would like to locate the reverse order pattern in last column and then print the string from the letter in sixth column to the end and then take the first letter and print to the letter in seventh column". Can you rephrase this? También entiendo el Castellano. – Pierre François Dec 27 '17 at 17:30
  • Hi Pierre, quiero ver si la primera y ultima letra de la ultima columna coinciden con las letras de la sexta y septima columna respectivamente. Si es asi no lo modifico. Sin embargo si no coincide es porque el patron inverso (es decir ---$7$6---) esta presente en la ultima columna. en ese caso quiero que se imprima desde la letra correspondiente a $6 hasta el final y concatenar con desde la primera letra hasta la letra correspondiente a $7. En el output se ve como cambian algunas columnas finales en base a lo que quiero modificar. Gracias de antemano Pierre – Perceval Vellosillo Gonzalez Dec 27 '17 at 17:47
  • 1
    Often, just by showing a very small example of the problem, (small sample input data, and required output from same data), the question is self-defining. Try reducing the size of your data set, so we can see the problem in 1 or 2 records that are only 40 chars wide (or so). (Do we really need to know what is in fields `$1,$2,$3,$4,$5` ? (I did not downvote your Q). Good luck! – shellter Dec 27 '17 at 18:00
  • 1
    Are not necessary in fact. You are right @sheller. Now I modify the question in order to explain better and optimize the size. However I have written bellow a small example that I would like to get, maybe its better to understand the question. I put the scripts due to occasionally someone ask me what I have tried to do it. – Perceval Vellosillo Gonzalez Dec 27 '17 at 18:11
  • 3
    Can you provide a better example or better explanation of the example? I can't figure out any transformation that would produce `LSAGSLAAEPA` from `AAEPALSAGSL` nor can I figure out what either of your scripts are intended to do. Also, your text talks about 6th and 7th columns but there only ARE 6 columns in the preceding example so that adds to the confusion. – Ed Morton Dec 27 '17 at 20:26
  • It'd make your example much clearer if the same letters didn't appear multiple times btw! – Ed Morton Dec 27 '17 at 20:35
  • 1
    You are right Ed Morton. When I reduced the both input and output and I corrected the question in order to understand better I have skipped into the 2nd sentence the sixth and seventh for 1st and 2nd. Thanks! – Perceval Vellosillo Gonzalez Dec 28 '17 at 09:15

3 Answers3

5

If I understood the question correctly, this awk should do it:

awk '( substr($6, 1, 1) != $1 || substr($6, length($6), 1) != $2 ) && i = index($6, $2$1) { $6 = substr($6, i+1) substr($6, 1, i)  }1' OFS=$'\t' data

You basically want to rotate the string so that the beginning of the string matches the char in $1 and the end of the string matches the char in $2. Strings that cannot be rotated to match that condition are left unchanged, for example:

A    B    3    3    -    BCAAB
PesaThe
  • 7,259
  • 1
  • 19
  • 43
2

You can try this awk, it's not perfect but it give you a starting point.

awk '{i=(match($6,$1));if(i==1)print;else{a=$6;b=substr(a,i);c=substr(a,1,(i-1));$6=b c;print}}' OFS='\t' infile
ctac_
  • 2,413
  • 2
  • 7
  • 17
1
gawk '
BEGIN{
    OFS="\t"
}
$6 !~ "^"$1".*"$2"$" {
    $6 = gensub("(.*"$2")("$1".*)", "\\2\\1", 1, $6)
}
{print}
' input.txt

Output

V   I   280     6   -   VRSSAI
N   V   2739    7   -   NATASAV
A   R   203     5   -   AEERR
Q   A   2517    7   -   QSTPSPA
S   S   1012    5   -   SGGGS
L   A   281     11  -   LSAGSLAAEPA
MiniMax
  • 983
  • 2
  • 8
  • 24
  • 2
    What about `VRSIVSAI` for example, or regex characters? :) Escaping such regex will be rather annoying. – PesaThe Dec 27 '17 at 23:51
  • @PesaThe Fixed and improved. – MiniMax Dec 28 '17 at 00:08
  • @PesaThe Which regex characters? `$1` and `$2` can be only capital letter by the condition of the task. – MiniMax Dec 28 '17 at 00:17
  • 2
    Why not create a solution that is consistent for any character? `[ ] 4 4 - aaa][bbb` will break your regex approach. Using regex in this case is also considerably slow for big files. However, if you want to consider just letters and reasonably big files, it's works just fine now :) – PesaThe Dec 28 '17 at 00:51