16

When I merge two data frames, the result has more rows than the original data.

In this instance, the all dataframe has 104956 rows, koppen has 3968 rows and alltest dataframe has 130335 rows. Ordinarily, alltest should have had rows equal to or less than all.

Why is this inflation happening? I am not sure if giving the reproducible example would help as it does work in the previous instances I have used it.

alltest <- merge(all, koppen, by = "fips", sort = F)
Henrik
  • 65,555
  • 14
  • 143
  • 159
Geekuna Matata
  • 1,349
  • 5
  • 19
  • 38
  • 4
    Probably because one of the two dataframes had duplicate values of `fips`. – James King Jun 10 '14 at 21:14
  • 1
    Oh wait, there are duplicate FIPS in the ALL ofcourse. It is not supposed to be unique. Any idea how can I can merge without increasing these rows? – Geekuna Matata Jun 10 '14 at 21:20
  • try `merge(all,koppen,by="fips",sort=F, all.x = T)` – talat Jun 10 '14 at 21:29
  • 2
    Does `nrow(koppen)==length(unique(koppen$fips))` return `TRUE`? If not, then there are duplicated values of `fips` – jlhoward Jun 10 '14 at 21:29
  • 3
    How would you merge them unambiguously without increasing the number of rows? – stanekam Jun 10 '14 at 21:33
  • @jlhoward Thanks. That returns a FALSE. Also, this returns a FALSE as well. nrow(all)==length(unique(all$fips)) – Geekuna Matata Jun 10 '14 at 21:36
  • @beginneR Tried that before. Doesn't work. – Geekuna Matata Jun 10 '14 at 21:38
  • The latter result is not surprising - as you say `all` should have duplicated `fips`. But `koppen` should not, so there is bug somewhere. You are getting all these rows because of the duplicated `koppen$fips`. – jlhoward Jun 10 '14 at 21:39
  • True. I am using this data: http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt Any piece of code to detect the non-unique values? Also, can I remove all but 1 (first one) instance of the overlapping rows. Thanks – Geekuna Matata Jun 10 '14 at 21:42
  • Or is there a way to do this without removing duplicate rows? I am sure this is not a one-off problem and can imagine a lot of datasets where this may be an issue. I am not limited to using merge. – Geekuna Matata Jun 10 '14 at 21:51
  • 1
    "I am not sure if giving the reproducible example would help as it does work in the previous instances I have used it." A reproducible example should reproduce the issue you're having. – jbaums Jun 11 '14 at 02:34

1 Answers1

19

First, from ?merge:

The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.

Using your link in the comments:

url    <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen <- read.table(url, header=T, sep="\t")
nrow(koppen)
# [1] 3594
length(unique(koppen$FIPS))
# [1] 2789

So clearly koppen has duplicated FIPS codes. Examining the dataset and the website, it appears that many of the counties are in more than one climate class, so for example, the county of Ankorage, Alaska has three climate classes:

koppen[koppen$FIPS==2020,]
#     STATE    COUNTY FIPS CLS  PROP
# 73 Alaska Anchorage 2020 Dsc 0.010
# 74 Alaska Anchorage 2020 Dfc 0.961
# 75 Alaska Anchorage 2020  ET 0.029

The solution depends on what you are trying to accomplish. If you want to extract all rows in all with any FIPS that appear in koppen, either of these should work:

merge(all,unique(koppen$FIPS))

all[all$FIPS %in% unique(koppen$FIPS),]

If you need to append the county and state name to all, use this:

merge(all,unique(koppen[c("STATE","COUNTY","FIPS")]),by="FIPS")

EDIT Based on the exchange below in the comments.

So, since there are sometimes multiple rows in koppen with the same FIPS, but different CLS, we need a way to decide which of the rows (e.g., which CLS) to pick. Here are two ways:

# this extracts the row with the largest value of PROP, for that FIPS
url        <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
koppen     <- read.csv(url, header=T, sep="\t")
koppen     <- with(koppen,koppen[order(FIPS,-PROP),])
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),head,n=1)
result     <- merge(all, sub.koppen, by="FIPS")

# this extracts a row at random
sub.koppen <- aggregate(koppen,by=list(koppen$FIPS), 
                        function(x)x[sample(1:length(x),1)])
result     <- merge(all, sub.koppen, by="FIPS")
Henrik
  • 65,555
  • 14
  • 143
  • 159
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Ok, one problem..I definitely need the class as well. When I am usingmerge(all,unique(koppen[c("STATE","COUNTY","CLASS","FIPS")]),by="FIPS") , it is again more than required rows. But the code works fine when I only use state, county, fips. Why is that? – Geekuna Matata Jun 11 '14 at 18:43
  • There can be more than one class for a given FIPS. Which class do you want? If you want all of them, then there will be duplicated FIPS. – jlhoward Jun 11 '14 at 18:51
  • Whichever row in *koppen* from which *all* reads a particular FIPS (one of the many occurrences of it), I want the corresponding class. – Geekuna Matata Jun 11 '14 at 18:55
  • Any clue on this sir? – Geekuna Matata Jun 11 '14 at 19:34
  • So `all` has a `CLS` column as well? Do you want rows in `all` where both `FIPS` and `CLS` match in `koppen`? – jlhoward Jun 11 '14 at 20:23
  • ALL has no CLS column. Now that I have FIPS in ALL, all I need are the corresponding CLS *FROM* Koppen as a column in All. I am guessing the way it works is, MERGE matches FIPS in ALL with FIPS in Koppen. So whichever FIPS it read in Koppen, I want the corresponding CLS. Hope I am not rambling too much. – Geekuna Matata Jun 11 '14 at 20:34
  • 1
    The problem is that, for a given FIPS, there is *more than one CLS* in `koppen`. So if you want all of them, you will get more than one row per FIPS in the merge. Take Ankorage, from my example. If `all` has one row with FIPS=2020, and you want all three CLS from `koppen`, then you will have three rows in the merge for FIPS=2020. If `all` has multiple rows with FIPS=2020, then you will have three rows in the result *for every row in `all` with FIPS=2020*. – jlhoward Jun 11 '14 at 20:39
  • Yes, I realized that. We faced the same problem with multiple FIPS, didn't we? How does your piece of code decide which one to pick? For example, koppen[koppen$FIPS==2020,]. How did it decide between the three? – Geekuna Matata Jun 11 '14 at 20:42
  • I want to follow the same strategy as with multiple FIPS. Just need one class (even random would do. Or maybe the first match. Or the CLS that occurs maximum times). For example, in your example,Dsc occurs twice and ET occurs only once, so I would choose Dsc. Please help if you can. I appreciate it so much. :) – Geekuna Matata Jun 11 '14 at 20:47