R How to remove duplicates from a list of lists

Question

I have a list of lists that contain the following 2 variables:

> dist_sub[[1]]$zip
 [1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
[26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962 963 965 966 968 969 970 975 981

> dist_sub[[1]]$hu
 [1]  4990    NA   168 13224    NA  3805    NA  6096  3884  4065    NA 16538    NA 12348 10850    NA
[17]  9322 17728    NA 13969 24971  5413 47317  7893    NA    NA    NA    NA    NA   140    NA     4
[33]    NA    NA    NA    NA    NA 13394  8939    NA  3848  7894  2228 17775    NA    NA    NA



> dist_sub[[2]]$zip
 [1] 921 934 952 956 957 958 959 960 961 962 965 966 968 969 970 971

> dist_sub[[2]]$hu
 [1] 17728   140  4169 32550 18275    NA 22445     0 13394  8939  3848  7894  2228 17775    NA 12895

Is there a way remove duplicates such that if a zipcode appears in one list is removed from other lists according to specific criteria?

Example: zipcode 00921 is present in the two lists above. I'd like to keep it only on the list with the lowest sum of hu (housing units). In this I would like to keep zipcode 00921 in the 2nd list only since the sum of hu is 162,280 in list 2 versus 256,803 in list 1.

Any help is very much appreciated.

Do you have only two sublists or are you after a more generic solution? — asb, Jul 16 '13 at 22:18
I have hundreds of subsets. So I am looking for a function to implement a deduping process according to the criteria described above. — Marc Moroccoholic, Jul 17 '13 at 10:36

asb · Accepted Answer · 2013-07-17T12:04:34.957

2

Here is a simulate dataset for your problem so that others can use it too.

dist_sub <- list(list("zip"=1:10,
                      "hu"=rnorm(10)),
                list("zip"=8:12,
                      "hu"=rnorm(5)),
                list("zip"=c(1, 3, 11, 7),
                      "hu"=rnorm(4))
                )

Here's a solution that I was able to come up with. I realized that loops were really the cleaner way to do this:

do.this <- function (x) {
  for(k in 1:(length(x) - 1)) {
    for (l in (k + 1):length(x)) {
      to.remove <- which(x[[k]][["zip"]] %in% x[[l]][["zip"]])
      x[[k]][["zip"]] <- x[[k]][["zip"]][-to.remove]
      x[[k]][["hu"]] <- x[[k]][["hu"]][-to.remove]
    }
  }
  return(x)
}

The idea is really simple: for each set of zips we keep removing the elements that are repeated in any set after it. We do this until the penultimate set because the last set will be left with no repeats in anything before it.

The solution to use the criterion you have, i.e. lowest sum of hu can be easily implemented using the function above. What you need to do is reorder the list dist_sub by sum of hu like so:

sum_hu <- sapply(dist_sub, function (k) sum(k[["hu"]], na.rm=TRUE))
dist_sub <- dist_sub[order(sum_hu, decreasing=TRUE)]

Now you have dist_sub sorted by sum_hu which means that for each set the sets that come before it have larger sum_hu. Therefore, if sets at values i and j (i < j) have values a in common, then a should be removed from ith element. That is what this solution does too. Do you think that makes sense?

PS: I've called the function do.this because I usually like writing generic solutions while this was a very specific question, albeit, an interesting one.

edited Jul 17 '13 at 12:04

answered Jul 16 '13 at 22:33

asb

4,392
1
20
30

Since you've the data, would you mind pasting the `dput` output here for others to try out? The OP hasn't provided.. – Arun Jul 16 '13 at 22:34
I am using this on a simulated dataset. :D Just trying to get the idea. – asb Jul 16 '13 at 22:35
1

@Arun: I've added a simulated dataset. – asb Jul 16 '13 at 22:51
@asb: Like your idea, but it does not take into account the criteria I specified above. Instead of removing the elements that are repeated in subsequent sets. I want to keep the repeated zips only in the set where the sum of housing units (within that set) is the smallest. – Marc Moroccoholic Jul 17 '13 at 10:46
@MarcMoroccoholic: Drat! Let me get back to you. – asb Jul 17 '13 at 11:40
@asB: thanks much. Your function should work. However, I am getting the following error:Error in `[[<-.data.frame`(`*tmp*`, "zip", value = c(901L, 902L, 906L, : replacement has 35 rows, data has 43 – Marc Moroccoholic Jul 17 '13 at 15:35
Can you show a `traceback` or other means to show where this problem is occuring. – asb Jul 17 '13 at 21:36
I think I know the issue. Your code works on a list of lists. My data consists of a list of data frames. Is there an easy way to convert my list of DFs to a list of lists? Again your help is very much appreciated. – Marc Moroccoholic Jul 17 '13 at 21:50
> traceback() 4: stop(sprintf(ngettext(N, "replacement has %d row, data has %d", "replacement has %d rows, data has %d"), N, nrows), domain = NA) 3: `[[<-.data.frame`(`*tmp*`, "zip", value = c(901L, 906L, 961L, 962L, 963L)) at zip_dedup.r#5 2: `[[<-`(`*tmp*`, "zip", value = c(901L, 906L, 961L, 962L, 963L )) at zip_dedup.r#5 1: zip_dedup(dist_sub_am) – Marc Moroccoholic Jul 17 '13 at 21:55

R How to remove duplicates from a list of lists

1 Answers1