0

I have 3 dataframes called respectively: barometre2013, barometre2016, barometre2018.

I've already merge barometre2018 and barometre2016 like this:

baro1618 <- merge(barometre2016, barometre2018, all = TRUE)

All was good, I have all rows of the two dataframes and the columns names that are the same are merged in one with all rows of the tow dataframes. Exactly what I wanted.

The merged table looks like this:

names(baro1618)
    [1] "q0qc"           "regio"          "sexe"           "age"            "langu"          "q1a_1"          "q1a_2"          "q1a_3"          "q1a_4"          "q1a_5"         
    [11] "q1a_6"          "q1a_7"          "q1a_8"          "q1a_9"          "q1a_10"         "q1b_1"          "q1b_2"          "q1b_3"          "q1b_4"          "q1b_5"         
    [21] "q1b_6"          "q1b_7"          "q1b_8"          "q1b_9"          "q1b_10"

NOW, my problem start here.

I want to merge baro1618 with barometre2013, but before doing that I have to lower case all the columns names because when I tried to merge without doing this, the columns in uppercase of barometre2013 that have the same name in lower case baro1618 weren't merged.

The df barometre2013 looks like this:

names(barometre2013)
    [229] "POND"        "Q1A_1"       "Q1A_2"       "Q1A_3"       "Q1A_4"       "Q1A_5"       "Q1A_6"       "Q1A_7"       "Q1A_8"       "Q1A_9"       "Q1A_10"      "Q1B_1"      
    [241] "Q1B_2"       "Q1B_3"       "Q1B_4"       "Q1B_5"       "Q1B_6"       "Q1B_7"       "Q1B_8"       "Q1B_9"       "Q1B_10"      "Q5A_1"       "Q5A_2"       "Q5A_3"  

So I've tried this two solutions to lower case (both works):

barometre2013 <- setnames(barometre2013, tolower(names(barometre2013)))

colnames(barometre2013) <- tolower(colnames(barometre2013))

The result:

[229] "pond"        "q1a_1"       "q1a_2"       "q1a_3"       "q1a_4"       "q1a_5"       "q1a_6"       "q1a_7"       "q1a_8"       "q1a_9"       "q1a_10"      "q1b_1"      
[241] "q1b_2"       "q1b_3"       "q1b_4"       "q1b_5"       "q1b_6"       "q1b_7"       "q1b_8"       "q1b_9"       "q1b_10"      "q5a_1"       "q5a_2"       "q5a_3"  

BUT, when I've tried to merge like this :

baro1118 <- merge(baro1618, barometre2013, all = TRUE)

It give me this error :

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

I don't understand why it was working in the first example and not in this second one. I can't specify any columns because I have TOO much name columns that match and a lot that do not match.

It should be possible not to specify right ?

Also, I want to keep all the columns names that match and the ones that don't match of both df.

Sorry for this long explanation, but I really need answer and I've read a lot of Q/A on SO and didn't find my answer.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Marine Leroi
  • 23
  • 1
  • 6
  • I forgot to specify that when I don't do the tolower function before, it actually merge without giving me the error !! – Marine Leroi May 04 '18 at 16:25
  • I don't think we have enough information to give you clear instructions. Would need to see `str()` done on both dataframes before the merge to know what the column names looked like and the structure of each of them. Be advised that executing `setnames(barometre2013, tolower(names(barometre2013)))` really does nothing. You would need to assign the value back to `barometre2013` if you wanted the result to be "durable". – IRTFM May 04 '18 at 18:22
  • I've edited my question and I've corrected the setnames. Thank you so much for your time ! – Marine Leroi May 04 '18 at 18:43
  • Appears to me that you are trying to merge with matching too many columns at once. You clearly are not giving me the full "picture" from str() since those column numbers are in the 200's!!!! Don't those dataframes have some single (or maybe two) column(s) that are going to match up? Or is it possible that your eally want to do an rbind operation? These have names suggesting they are weather readings in different years. – IRTFM May 04 '18 at 21:50
  • Exactly, it's not the full picture of my dfs because they have more than 400 columns and more than 1000 rows each, so it's kind difficult to show you all. I choose to show you this part because you can see that there is some columns that have the same name but one in lower case and one in upper case. When I merged both of them, these columns didn't match and it's not good for me because it's the same column i.e. the same question. I'm going to try the rbind function. I did some research on it and find out that there is a rbind.fill function for dfs that don't have the same amount of columns. – Marine Leroi May 06 '18 at 16:39
  • @42- rbindlist() worked very well !! Thanks a lot for all the help – Marine Leroi May 08 '18 at 16:29
  • You can write an answer to your own question. It would be more useful if you started by including the output of str() on both dataframes from before the effort started and also posted str() or the output of head() on the final results. Remember to include library calls for all packages used, since `rbindlist` on my machine is from data.table. It would have the beneficial result of boosting your rep to a point where you can make comments on other posts (and maybe other things that I forget.) It's technically an "append" operation rather than what's know in computerese as a "merge". – IRTFM May 08 '18 at 17:00
  • Ok, thank you for the information, I'll do it from now on. – Marine Leroi May 08 '18 at 17:41

1 Answers1

0

Maybe worth a try:

baro1118 <- merge(baro1618, barometre2013, all = TRUE, by=intersect(names(baro1618), names(barometre2013))

This merges only by common columns.

That being said, your hunch of using rbind for this is probably more correct. If this is data from differentt time periods, and they don't overlap, rbind will simply stack one on top of the other. This doesn't always go smoothly, but here's a crude hack:

# maybe barometre2013 has missing column names
missing.column.names <- setdiff(names(baro1618), names(barometre2013))
barometre2013[, missing.column.names] <- NA

# maybe baro1618 has missing column names
missing.column.names <- setdiff(names(barometre2013), names(baro1618))
baro1618[, missing.column.names] <- NA
Liudvikas Akelis
  • 1,164
  • 8
  • 15