-3

I have a dataset that I am trying to understand why one specific variable (Var1) has blank values. I have two questions, and I use R (and am a novice coder):

  1. Var1 is 60% complete (field in entered with alphnumeric values, but 40% of entries are just blank). How do I write a code to understand which of the other variables of my dataset (Var2, Var3, Var4,...) are most associated with a blank field entry?

  2. One variable I am interested in is the store (lets call this variable Store), can I run a code to see if the majority of blank entries of Var1 are due to only a few Stores? This is what I am suspicious of, a few Stores just aren't recording Var1.

Thanks so much for your help.

Ryan S
  • 35
  • 1
  • 3

2 Answers2

0

Welcome to R, and to Stack Overflow. First things first...if you want people to help you, you need to give an example they can actually work with.

The first thing you'd want to do is give us some data that looks like yours....sounds like this will do:

df <-
  data.frame(
    Store = c('A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'),
    Var1 = c(100, NA, 200, NA, NA, NA, 100, 150, 200),
    Var2 = c(30, 20, NA, NA, NA, 40, 20, 30, 50)
  )

Regarding your second question, I would group by store and count missing values like this

library(tidyverse)
df %>% 
  group_by(Store) %>% 
  summarise(
    missing_count = sum(is.na(Var1)),
    total_count = n())
pyll
  • 1,688
  • 1
  • 26
  • 44
  • Thank you very much for your help (and also appreciate the advice) – Ryan S Feb 19 '21 at 03:21
  • No problem! If this answer solves your question, consider clicking the green check mark to select it as a solution so others who visit in the future can see what worked for you. – pyll Apr 23 '22 at 12:43
0

Assuming your data looks like this:

dat
  store var1 var2 var3
1   one   NA    2    k
2   one   NA    3    w
3   two    2    3    s
4  five    3    4    f
5 other    2    5    d
6  four    2    3    s
7 three    2    3    f
8   ten    7    5    g
9   one   NA    3    w

Finding which varX is mostly impacted:

colSums( is.na(dat[,2:4]) )
var1 var2 var3 
   3    0    0

You can print-out affected stores by doing:

dat[ is.na(dat$var1), "store" ]
[1] "one" "one" "one"

A summary can be generated with table:

table( dat[ is.na(dat$var1), "store" ] )

one 
  3
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29