0

I have two large datasets, one that has it's grouping column removed, as well as any duplicates, and the original data. My problem is that I then need to reattach the grouping column from the original data after a bunch of data-wrangling/ machine-learning, to the other dataset (that has the duplicates/grouping column removed). I have tried to replicate this in an example:

#Using iris data set, adding duplictaes as my real dataset involves duplicates
#repeat row 1 ,20 times
iris1 <- rbind(iris, iris[rep(1, 20), ])

#new dataset with dupes removed and species column removed
iris_rm <- subset(iris1, select = -c(Species) )
iris_rm <- iris_rm[!duplicated(iris_rm), ]

# Data analysis occuring here...
#
#
#
#

#I then want to semi_join the two datasets, without the Species column, 
# i.e. returning all the rows that are from iris1, that match iris_rm (ignoring the Species column)
library(dplyr)
new <- semi_join(iris_rm, iris1[,-5], by = NULL)

#How do I then reattach the Species column to the new dataframe? 
#I have tried this, however as there are differing row lengths, it won't work
new['Species2']= iris1['Species']

Ideally, in the semi_join, it would ignore the Species column, without actually removing it from the new dataframe. Note that the duplicated rows in the dataset are not true duplicates (i.e. the species column could be different despite the rest of the columns being the same). Hope this makes sense!

MM1
  • 478
  • 15
  • This leaves me with the entire dataset (190 variables), and removes the species column which I want – MM1 Apr 20 '23 at 06:07

1 Answers1

0

Maybe you can use inner_join instead of semi_join. It can create duplicate rows so you need to use distinct().

new <- inner_join(iris_rm, iris1) %>% distinct()
Bei
  • 122
  • 6
  • This doesn't work for me as the distinct function is still taking into account my 'species' column, so doesn't return the correct amount of rows. I need to be able to join the data (ignoring the species column), but still have the species column attached. – MM1 Apr 20 '23 at 04:51
  • If I were to use: `new <- inner_join(iris_rm, iris1) %>% select(-Species) %>% distinct()` it obviously removes the column that I need to be present at the end – MM1 Apr 20 '23 at 04:52
  • So the duplicated rows in your dataset are not true duplicates (i.e. the species column could be different despite the rest of the columns being the same)? – Bei Apr 20 '23 at 05:07
  • Yes that is correct :) – MM1 Apr 20 '23 at 05:30
  • In that case, some of your data can belong to multiple groups. Depending on the goal of your analysis, perhaps you can collapse these groups into one single group using ``` iris2 <- group_by(iris1, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>% summarize(new_group = paste0(Species, collapse = ", "))``` Then you can use ```inner_join(iris_rm, iris2)``` to get the group variable and the correct number of rows you want. – Bei Apr 21 '23 at 01:51