0

I am consolidating data to put it into a new data base and I am having some issues with replacing the old columns with the new Foreign Key (FK) variables. I am consolidating all my data into a long format and then cast it again to create the new table structure of the new data model. Due to some name duplicates for generic variables like 'total' and merging some old names into new ones, direct merging does not always work because the names would not match. There are some work arounds that I can use and maybe there are some other conceptual improvements to fix my issues. However, I am generally courious if there is a way to merge and replace a FK into a long data table, like it works for a wide table.

Searching for threads with "merging" or "long" always yields other melting problems, thats why I am asking myself.

I tried creating a generic example using 'mtcars'

library(data.table)
dt<-data.table(mtcars)
#Create Id to melt the data 
dt[,id:=1:.N]
dt_long<-melt(dt, id.vars="id")
#Create a unique sub table for 4 variables of mtcars with a new sub_id
merge_cols<-c("vs","am","gear","carb")
dt_sub<-dt[,..merge_cols]
dt_sub<-unique(dt_sub)
dt_sub[,id_sub:=1:.N]
# Easy to merge with orginal table and delete old columns
dt<-merge(dt,dt_sub, by=merge_cols )
set(dt, , merge_cols, NULL)
#Any way to do this with the long data frame? 
setorder(dt_long, id)
dt_long[variable %in% merge_cols,]
dt_sub[vs==0 & am==1 & gear ==4 & carb==4,]

For instance the first 4 rows should be replaced by one row with variable "sub_id" and value "4". Just if I were to melt the merged 'dt' from above using the wide original dt. So my final result should look like I melted the wide table after it has been merged with the sub table, but for each four rows of the 'merge_cols' we have one row with 'sub_id'

dt_final<-melt(dt,  id.vars="id"  )
dt_final[variable=="id_sub"]
Max M
  • 806
  • 14
  • 29
  • I do not understand what you mean by "For instance the first 4 rows should be replaced by one row with variable `sub_id` and value `4`". Please, can you [edit] your question and show the expected result? Thank you. – Uwe May 31 '20 at 18:26
  • @Uwe hope that helps? I just explicitly wrote the conditions for 'id==1' . Is there a way to do this generally. I want to to do this for every id and I can also understand if this is something that one just does not do, because one should use normal merges/left join of wide tables – Max M Jun 01 '20 at 07:46
  • Max, thank you for the additional explanation. However, I am afraid I will need to see the expected result to understand. Thank you. – Uwe Jun 01 '20 at 08:35
  • @Uwe the expected results is 'dt_final' – Max M Jun 01 '20 at 10:21

1 Answers1

1

If I understand correctly (but may be I'm missing the point), the question is how to merge or subset in long format. Perhaps, the OP is looking for something along the following lines.

The first step is to understand that subsetting is equivalent (to some extent) to merging / joining with a lookup table. So,

library(data.table) # just to highlight that data.table syntax is used here
dt_sub[vs == 0 & am == 1 & gear == 4 & carb == 4, ]

returns the same result as

lut_wide <- data.table(vs = 0, am = 1, gear = 4, carb = 4)
dt_sub[lut_wide, on = names(lut_wide)]
   vs am gear carb id_sub
1:  0  1    4    4      1

To do the same in long format, also the lookup table needs to be reshaped:

lut_long <- melt(lut_wide, measure.vars = names(lut_wide))
   variable value
1:       vs     0
2:       am     1
3:     gear     4
4:     carb     4

Alternatively, fread() can be used to create a lookup table in long format from scratch:

lut_long <- fread(
"variable, value
vs, 0
am, 1
gear, 4
carb, 4")

Now, we need to find those ids which fullfil all 4 conditions:

idx <- dt_long[lut_long, on = .(variable, value)][
  , which(.N == nrow(lut_long)), by = id][, .(id)]
idx
   id
1:  1
2:  2

Alternatively, we can write

idx <- dt_long[lut_long, on = .(variable, value)][
  , .N, by = id][N == nrow(lut_long), .(id)]

Finally, idx is used to subset dt_long (by joining):

dt_long[variable %in% merge_cols, ][idx, on = "id"]
   id variable value
1:  1       vs     0
2:  1       am     1
3:  1     gear     4
4:  1     carb     4
5:  2       vs     0
6:  2       am     1
7:  2     gear     4
8:  2     carb     4

Here, only the variables defined in merge_cols are considered.

Apparently, the result contains a duplicate. This can be removed by picking only the first row of idx:

dt_long[variable %in% merge_cols, ][first(idx), on = "id"]
   id variable value
1:  1       vs     0
2:  1       am     1
3:  1     gear     4
4:  1     carb     4
Uwe
  • 41,420
  • 11
  • 90
  • 134