How to infer missing values in a R data frame from other rows that have the data?

Question

In my example I have a table with products that have a unique ID in Store 1 and a different (but also unique) ID in Store 2.

As a rule, the same product has always the same ID in store 1 and always the same ID in store 2. i.e. if I know only one of these variables, I should be able to fill in the remaining ones.

However, I don't have a neat reference table or list that shows which Product and IDs belong together. All I have to start with is a table with lots of gaps like this:

Before

I would like to fill the gaps using the information that is already in the table like so:

After

Is there a R function that can do this or how would you go about this?

#Example data
df <- data.frame(
  c(NA, "Shovel", NA, NA, "Gloves", NA),
  c("W06", NA, "W06", "W11", "W11", NA),
  c("EF001", "EF001", NA, NA, "EF004", "EF004")
)
colnames(df) <- c("Product", "Store1_ID", "Store2_ID")


df2 <- data.frame(
  c("Shovel",NA, NA, NA, "Gloves", NA),
  c(NA, "W06", "W06", "W11", "W11", NA),
  c("EF001", "EF001", NA, NA, "EF004", "EF004")
)
colnames(df2) <- c("Product", "Store1_ID", "Store2_ID")

akrun · Answer 1 · 2022-05-04T21:00:40.117

Create a grouping index column, after looping across the 'Store' columns, apply the na.locf0 to replace the NA elements with previous non-NA values, if there are NAs as beginning element, then apply the na.locf0 with fromLast = TRUE on top of the first na.locf0, convert this to logical column with duplicated, negate (!) so that it returns TRUE for first non-duplicated element and FALSE for others, get the cumulative sum (cumsum), use pmax to find the max index elementwise to create the 'grp', then we use fill on the other columns

library(tidyr)
library(dplyr)
library(purrr)
df %>% 
  group_by(grp = invoke(pmax, across(starts_with('Store'), 
    ~ cumsum(!duplicated(zoo::na.locf0(zoo::na.locf0(.x), 
           fromLast = TRUE))), .names = "{.col}_new")))  %>% 
 fill(everything(), .direction = "downup") %>%
 ungroup %>% 
 select(-grp)
# A tibble: 6 × 3
  Product Store1_ID Store2_ID
  <chr>   <chr>     <chr>    
1 Shovel  W06       EF001    
2 Shovel  W06       EF001    
3 Shovel  W06       EF001    
4 Gloves  W11       EF004    
5 Gloves  W11       EF004    
6 Gloves  W11       EF004

The .names is used to create a new column, though in this case, it is not needed

Same output with df2

df2 %>% 
  group_by(grp = invoke(pmax, across(starts_with('Store'), 
    ~ cumsum(!duplicated(zoo::na.locf0(zoo::na.locf0(.x), 
     fromLast = TRUE))), .names = "{.col}_new"))) %>% 
  fill(everything(), .direction = "downup") %>%
  ungroup %>% 
  select(-grp)

-output

# A tibble: 6 × 3
  Product Store1_ID Store2_ID
  <chr>   <chr>     <chr>    
1 Shovel  W06       EF001    
2 Shovel  W06       EF001    
3 Shovel  W06       EF001    
4 Gloves  W11       EF004    
5 Gloves  W11       EF004    
6 Gloves  W11       EF004

TarJae · Accepted Answer · 2022-05-04T21:06:04.650

Update II: For the new dataset df2:

add the .direction argument to fill in both direction the grouping variable and then apply the fill_run funciton:

library(dplyr)
library(runner)

df2 %>% 
  fill(Store1_ID, .direction = "updown") %>% 
  group_by(Store1_ID) %>% 
  mutate(across(everything(), ~fill_run(., run_for_first = TRUE)))

  Product Store1_ID Store2_ID
  <chr>   <chr>     <chr>    
1 Shovel  W06       EF001    
2 Shovel  W06       EF001    
3 Shovel  W06       EF001    
4 Gloves  W11       EF004    
5 Gloves  W11       EF004    
6 Gloves  W11       EF004

Update to consider groupwise:

library(dplyr)
library(runner)

df %>% 
  fill(Store1_ID) %>% 
  group_by(Store1_ID) %>% 
  mutate(across(everything(), ~fill_run(., run_for_first = TRUE)))

   Product Store1_ID Store2_ID
  <chr>   <chr>     <chr>    
1 Shovel  W06       EF001    
2 Shovel  W06       EF001    
3 Shovel  W06       EF001    
4 Gloves  W11       EF004    
5 Gloves  W11       EF004    
6 Gloves  W11       EF004

.direction = "updown" seems to do the trick. The downside is that I still need to pick a column first (Store1_ID), and I am not sure if this will work in situations where I have different column headers. Nonetheless, this solved my problem for now and I accepted this answer because I find it easier to understand. — mandmeier, May 05 '22 at 16:09

How to infer missing values in a R data frame from other rows that have the data?

2 Answers2