7

I have a data frame in R which has one individual per line. Sometimes, individuals appear on two lines, and I would like to combine these lines based on the duplicated ID.

The problem is, each individual has multiple IDs, and when an ID appears twice, it does not necessarily appear in the same column.

Here is an example data frame:

dat <- data.frame(a = c('cat', 'canine', 'feline', 'dog'),
                  b = c('feline', 'puppy', 'meower', 'wolf'),
                  c = c('kitten', 'barker', 'kitty', 'canine'),
                  d = c('shorthair', 'collie', '', ''),
                  e = c(1, 5, 3, 8))

> dat
       a      b      c         d e
1    cat feline kitten shorthair 1
2 canine  puppy barker    collie 5
3 feline meower  kitty           3
4    dog   wolf canine           8

So rows 1 and 3 should be combined, because ID b of row 1 equals ID a of row 3. Similarly, ID a of row 2 equals ID c of row 4, so those rows should be combined as well.

Ideally, the output should look like this.

     a.1    b.1    c.1       d.1 e.1    a.2    b.3    c.2 d.2 e.2
1    cat feline kitten shorthair   1 feline meower  kitty       3
2 canine  puppy barker    collie   5    dog   wolf canine       8

(Note that the rows were not combined based on sharing IDs that are empty strings.)

My thoughts on how this could be done are below, but I'm pretty sure that I've been headed down the wrong path, so they're probably not helpful in solving the problem.

I thought that I could assign a row ID to each row, then melt the data. After that, I could to through row by row. When I found a row where one of the IDs matched an earlier row (e.g. when one of the row 3 IDs matches one of the row 1 IDs), I would change the every instance of the current row's row ID to match the earlier row ID (e.g. all row IDs of 3 would be changed to 1).

Here's the code I've been using:

dat$row.id <- 1:nrow(dat)
library(reshape2)
dat.melt <- melt(dat, id.vars = c('e', 'row.id'))
for (i in 2:nrow(dat.melt)) {
  # This next step is just to ignore the empty values
  if (grepl('^[[:space:]]*$', dat.melt$value[i])) {
    next
  }
  earlier.instance <- dat.melt$row.id[which(dat.melt$value[1:(i-1)] == dat.melt$value[i])]
  if (length(earlier.instance) > 0) {
    earlier.row.id <- earlier.instance[1]
    dat.melt$row.id[dat.melt$row.id == dat.melt$row.id[i]] <- earlier.row.id
  }
}

There are two problems with this approach.

  1. It could be that an ID in row 3 matches row 1, and a different ID in row 5 matches row 3. In this case, the row IDs for both row 3 and row 5 should be changed to 1. This means that it's important to go through the rows sequentially, which has been leading me to use a for loop, not an apply function. I know that this is not very R-like, and with the large data frame I am working with it is very slow.
  2. This code produces the output below. There are now multiple rows with identical row.id and variable, so I don't know how to cast it in order to get the kind of output I showed above. Using dcast here will be forced to use an aggregation function.

Output:

   e row.id variable     value
1  1      3        a       cat
2  5      2        a    canine
3  3      3        a    feline
4  8      2        a       dog
5  1      3        b    feline
6  5      2        b     puppy
7  3      3        b    meower
8  8      2        b      wolf
9  1      3        c    kitten
10 5      2        c    barker
11 3      3        c     kitty
12 8      2        c    canine
13 1      3        d shorthair
14 5      2        d    collie
15 3      3        d          
16 8      2        d          
njc
  • 126
  • 1
  • 6

2 Answers2

2

New answer. Had some fun (/frustration) working through this. I'm sure it is not the fastest solution but it should get you past where my other answer left off. Let me explain:

dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'cat','fido'),
                  b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'),
                  c = c('kit', 'barker', 'kitty', 'canine', 'feline','wolf'),
                  d = c('shorthair', 'collie', '', '','',''),
                  e = c(1, 2, 3, 4, 5, 6))

dat[, All := paste(a, b,c),]

Two changes: dat$e is now an index column, so it is just the numeric position of whichever row it is. If e is otherwise important, you can make a new column to replace it.

Below is the first loop. This makes 3 new columns FirstMatchingID etc. These are like before: they give the index of the earliest (lowest row #) matching dat$All for a b and c.

for(i in 2:nrow(dat)) {
  x <- grepl(dat[i]$a, dat[i-(1:i)]$All)
  y <- max(which(x %in% TRUE))
  dat[i, FirstMatchingID := dat[i-y]$e]

  x2 <- grepl(dat[i]$b, dat[i-(1:i)]$All)
  y2 <- max(which(x2 %in% TRUE))
  dat[i, SecondMatchingID := dat[i-y2]$e]

  x3 <- grepl(dat[i]$c, dat[i-(1:i)]$All)
  y3 <- max(which(x3 %in% TRUE))
  dat[i, ThirdMatchingID := dat[i-y3]$e]

}

Next, we use pmin to find the earliest matching row of the MatchingID columns and set it in its own columns. This is in case you have a match a in row 25 and a match for b in row 12; it will give you 12 (I assume this is what you'd want based on your question).

dat$MinID <- pmin(dat$FirstMatchingID, dat$SecondMatchingID, dat$ThirdMatchingID, na.rm=T)

Last, this loop will do 3 things, creating a FinalID column with all the matching ID numbers from e:

  1. Where MinID is NA (no matches) set FinalID to e
  2. If MinID is a number, find that row (the earliest match) and check if its MinID is a number; if it is not, there are no earlier matches and it sets FinalID to MinID
  3. The rows that don't fit the above condition are your special cases where row is earliest match has an earlier match itself. This will find that match and set it to FinalID.

for (i in 1:nrow(dat)) { x <- dat[i]$MinID if (is.na(dat[i]$MinID)) { dat[i, FinalID := e] } else if (is.na(dat[x]$MinID)) { dat[i, FinalID := MinID] } else dat[i, FinalID := dat[x]$MinID] }

I think this should do it; let me know how it goes. I make no claims about its efficiency or speed.

moman822
  • 1,904
  • 3
  • 19
  • 33
  • This seems like a good approach. I had been seeking a vectorized solution, believing that this would be faster, but after reading part of _The R Inferno_, I learned that this is not necessarily the case. I have a couple questions. 1) Why did you use `grepl` and then `max` to get the maximum index instead of just using `max(grep(...))`? 2) When you write `dat[i]$MinID`, do you mean to write `dat$MinID[i]`? Those work differently for me. I haven't tried your exact code, because I'm just adapting the ideas for my real data, so I can't speak to the accuracy of the syntax, but the ideas seem solid. – njc Sep 14 '16 at 15:03
  • Let's see. 1) `x<- grepl(...` returns a vector of T/F for match/no match, then `which(x %in% TRUE)` gives the position of the T/F elements that are true (meaning there was a match), then `max` gets the match closest to the top of the data.table (or furthest from current row i). 2) I meant to write it that way. Try my full code out, it works for me as is and not when I change it, but you can check yourself. – moman822 Sep 14 '16 at 15:41
  • Thanks! I how that `grepl` then `which` works, I was just wondering if there was a specific reason you were creating a T/F vector and then pulling indices from it instead of getting the indices directly with `grep`. I guess either way is good. – njc Sep 14 '16 at 15:50
  • Oh, you're absolutely right. The specific reason was that I did not know that, it would seem. – moman822 Sep 14 '16 at 18:20
  • Well, I guess we both learned something! Mostly me though; I didn't even know about data.table! – njc Sep 14 '16 at 22:01
  • 1
    I'm a big fan of data.table over dplyr, personally (if you know about dplyr). The syntax is much more compact and I haven't found that I need to resort to dplyr for any functionality. – moman822 Sep 15 '16 at 01:57
1

Here is an amateur attempt. I think it does some of what you want. I have expanded the data.frame (now a data.table) two rows to give a better example.

This loop creates a new column, dat$FirstMatchingID, that contains the ID from dat$e for the earliest match. I've only done it to match the first column, dat$a, but I think it could be expanded to b and c easily enough.

library(data.table)

dat <- data.table(a = c('cat', 'canine', 'feline', 'dog', 'feline','puppy'),
                  b = c('feline', 'puppy', 'meower', 'wolf', 'kitten', 'dog'),
                  c = c('kitten', 'barker', 'kitty', 'canine', 'cat','wolf'),
                  d = c('shorthair', 'collie', '', '','',''),
                  e = c(1, 5, 3, 8, 4, 6))

dat[, All := paste(a, b,c),]

for(i in 2:nrow(dat)) {
  print(dat[i])
  x <- grepl(dat[i]$a, dat[i-(1:i)]$All)
  y <- max(which(x %in% TRUE))
  dat[i, FirstMatchingID := dat[i-y]$e]
}

The result:

        a      b      c         d e                 All FirstMatchingID
1:    cat feline kitten shorthair 1   cat feline kitten              NA
2: canine  puppy barker    collie 5 canine puppy barker              NA
3: feline meower  kitty           3 feline meower kitty               1
4:    dog   wolf canine           8     dog wolf canine              NA
5: feline kitten    cat           4   feline kitten cat               1
6:  puppy    dog   wolf           6      puppy dog wolf               5

You then have to find out how you want to combine the rows to get your desired result, but hopefully this helps!

moman822
  • 1,904
  • 3
  • 19
  • 33
  • This is an extremely helpful suggestion, and I've already started playing around with it. Right now I'm looking for ways to 1) replace the loop, since I have a huge dataset and looping is slow, and 2) make all instances point to the first line in their "network" of names. For example, say that row six was `fido dog wolf`. Then FirstMatchingID for row six would be 4. But because row six would be connected to row two _via row four_, I would want FirstMatchingID to be 2. If you can figure out a way around these problems, it would turn an already very helpful answer into an amazing one. Thanks! – njc Sep 13 '16 at 18:42