Suppose I've got this data set to start with, in this silly layout:
originalDF <- data.frame(
Index = 1:14,
Field = c("Name", "Weight", "Age", "Name", "Weight", "Age", "Height", "Name", "Weight", "Age", "Height", "Name", "Age", "Height"),
Value = c("Sara", "115", "17", "Bob", "158", "22", "72", "Irv", "210", "42", "68", "Fred", "155", "65")
)
I want it to look like this:
Basically, I want to match the Weight, Age, and Height rows to the Name row above it. Splitting the data up is easy using dplyr
:
namesDF <- originalDF %>%
filter(Field == "Name")
detailsDF <- originalDF %>%
filter(!Field == "Name")
From here, using the Index (row number) seems the best way, i.e. match each row in detailsDF
with the entry in namesDF
that has the closest Index without going over. I used the fuzzyjoin
package and joined them with
fuzzy_left_join(detailsDF, namesDF, by = "Index", match_fun = list(`>`))
This sort of works, but it also joins each row in detailsDF
with EVERY row in namesDF
with a smaller Index number:
I came up with a solution using the distance to the next Index and filtering out the extra rows that way, but I want to avoid doing this; the actual source file will be over 200k rows, and the temporary resulting dataframe with the extra rows would be too big to fit into memory. Is there anything I can do here? Thanks!