0

I am trying to find records in a text file that appear in reverse order then sum the duplicate records that occur. Records I mean as a row. The first two columns will contain identifiers which can be in an order of (A to B) or (B to A). What I would like to do is sum columns 3,4,5 if (columns A,B are equal to B,A). I found an answer for removing records here with awk. However, could find anything on adding them.

remove redundancy in a file based on two fields, using awk

However I would like to add them.

Example data:

g10001.t1   g6838.t1    3   8   4
g6838.t1    g10001.t1   3   5   3
g10001.t1   g964.t1     2   4   2
g964.t1     g10001.t1   1   2   1
g10001.t1   g10604.t1   1   2   1

Desired data:

g10001.t1   g6838.t1    6   13  7
g10001.t1   g964.t1     3   6   3
g10001.t1   g10604.t1   1   2   1
user1238097
  • 87
  • 1
  • 14

2 Answers2

1

Here's a solution in base R:

# Create new grouping variable 
df$group <- apply(df[,1:2],1,function(x) paste(sort(x),collapse=" "))

# Aggregate data
aggregate(cbind(V3,V4,V5)~group,data=df,sum)
#                group V3 V4 V5
#1 g10001.t1 g10604.t1  1  2  1
#2  g10001.t1 g6838.t1  6 13  7
#3   g10001.t1 g964.t1  3  6  3

Data

df <- structure(list(V1 = c("g10001.t1", "g6838.t1", "g10001.t1", "g964.t1", 
"g10001.t1"), V2 = c("g6838.t1", "g10001.t1", "g964.t1", "g10001.t1", 
"g10604.t1"), V3 = c(3L, 3L, 2L, 1L, 1L), V4 = c(8L, 5L, 4L, 
2L, 2L), V5 = c(4L, 3L, 2L, 1L, 1L), group = c("g10001.t1 g6838.t1", 
"g10001.t1 g6838.t1", "g10001.t1 g964.t1", "g10001.t1 g964.t1", 
"g10001.t1 g10604.t1")), .Names = c("V1", "V2", "V3", "V4", "V5", 
"group"), row.names = c(NA, -5L), class = "data.frame")
mtoto
  • 23,919
  • 4
  • 58
  • 71
0
$ awk '
    {
        key=($1<$2 ? $1 FS $2 : $2 FS $1)
        keys[key]
        for (i=3;i<=NF;i++) {
            sum[key,i]+=$i
        }
    }
    END{
        for (key in keys) {
            printf "%s%s", key, OFS
            for (i=3;i<=NF;i++) {
                printf "%s%s", sum[key,i], (i<NF?OFS:ORS)
            }
        }
    }
' file
g10001.t1 g6838.t1 6 13 7
g10001.t1 g10604.t1 1 2 1
g10001.t1 g964.t1 3 6 3

The above will work with any awk on any UNIX box, pipe the output to column -t to align columns:

$ awk '{key=($1<$2 ? $1 FS $2 : $2 FS $1); keys[key]; for (i=3;i<=NF;i++) sum[key,i]+=$i} END{for (key in keys) { printf "%s%s", key, OFS; for (i=3;i<=NF;i++) printf "%s%s", sum[key,i], (i<NF?OFS:ORS)} }' file |
column -t
g10001.t1  g6838.t1   6  13  7
g10001.t1  g10604.t1  1  2   1
g10001.t1  g964.t1    3  6   3
Ed Morton
  • 188,023
  • 17
  • 78
  • 185