How to fill NAs with LOCF by factors in data frame, split by country

Question

I have the following data frame (simplified) with the country variable as a factor and the value variable has missing values:

country value
AUT     NA
AUT     5
AUT     NA
AUT     NA
GER     NA
GER     NA
GER     7
GER     NA
GER     NA

The following generates the above data frame:

data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA))

Now, I would like to replace the NA values in each country subset using the method last observation carried forward (LOCF). I know the command na.locf in the zoo package. data <- na.locf(data) would give me the following data frame:

country value
AUT     NA
AUT     5
AUT     5
AUT     5
GER     5
GER     5
GER     7
GER     7
GER     7

However, the function should only be used on the individual subsets split by the country. The following is the output I would need:

country value
AUT     NA
AUT     5
AUT     5
AUT     5
GER     NA
GER     NA
GER     7
GER     7
GER     7

I can't think of an easy way to implement it. Before starting with for-loops, I was wondering if anyone has any idea as to how to solve this.

Many thanks!!

You might get a quicker response if you edited your question to include a reasonable test data structure. — IRTFM, Nov 29 '12 at 00:53
You want [zoo::na.locf()](http://www.inside-r.org/packages/cran/zoo/docs/na.locf) ! — smci, Sep 19 '14 at 22:19

score 16 · Answer 1 · edited Aug 21 '17 at 23:07

16

A modern version of the ddply solution is to use the package dplyr:

library(dplyr)
DF %>%
  group_by(county) %>% 
  mutate(value = na.locf(value, na.rm = F))

edited Aug 21 '17 at 23:07

MBB

347
3
18

answered Sep 19 '14 at 22:51

Gregor Thomas

136,190
20
167
294

2

+1 for the dplyr version. @Gregor, note that you need to add `na.rm = F` to the `na.locf()` call. Otherwise this throws an error. – Felix Apr 20 '15 at 09:01
is there a way to do this for more than one column at once? – Rafael Jul 13 '17 at 21:40
1

@RafaelMartins See `?mutate_at` – Gregor Thomas Jul 13 '17 at 21:42

Jilber Urbina · Accepted Answer · 2012-11-29T08:45:20.150

14

Here's a ddply solution. Try this

library(plyr)
ddply(DF, .(country), na.locf)
  country value
1     AUT  <NA>
2     AUT     5
3     AUT     5
4     AUT     5
5     GER  <NA>
6     GER  <NA>
7     GER     7
8     GER     7
9     GER     7

Edit From ddply help you can find that

.variables:  variables to split data frame by, 
as quoted variables, a formula or character vector.

so another alternatives to get what you want are:

ddply(DF, "country", na.locf)
ddply(DF, ~country, na.locf)

note that replacing .variables with DF$variable is not allowed, that's why you got an error when doing this.

DF is your data.frame

edited Nov 29 '12 at 08:45

answered Nov 29 '12 at 01:07

Jilber Urbina

58,147
10
114
138

Amazing, thanks! Exactly what I needed. I tried `ddply` before, using `ddply(DF, DF$country, na.locf)` and that didn't work. What is the difference in using the `.()` notation? – rp1 Nov 29 '12 at 01:52

score 8 · Answer 3 · answered Jan 13 '17 at 02:54

The tidyverse way, albeit not using locf, is:

library(tidyverse)

data %>% 
    group_by(country) %>% 
    fill(value)

Source: local data frame [9 x 2]
Groups: country [2]

country value
(fctr) (dbl)
1     AUT    NA
2     AUT     5
3     AUT     5
4     AUT     5
5     GER    NA
6     GER    NA
7     GER     7
8     GER     7
9     GER     7

nograpes · Answer 4 · 2012-11-29T16:50:16.547

6

Split the data.frame with by and use na.locf on the subsets:

do.call(rbind,by(data,data$country,na.locf))

If you would like to remove the row names:

do.call(rbind,unname(by(data,data$country,na.locf)))

edited Nov 29 '12 at 16:50

answered Nov 29 '12 at 01:03

nograpes

18,623
1
44
67

Thanks, that works as well. However, I would have to rename the row names again to `seq_len(nrow(data))`. Therefore, I chose the above answer. However, your solution might be computationally faster, since `ddply` seems to be quite slow with large datasets. – rp1 Nov 29 '12 at 01:55

G. Grothendieck · Answer 5 · 2018-07-27T18:38:38.810

4

If speed is a consideration then this unstack/stack solution is about 4 to 6 times faster than the others on my system although it does entail a slightly longer line of code:

stack(lapply(unstack(data, value ~ country), na.locf, na.rm = FALSE))

Another approach is:

transform(data, value = ave(value, country, FUN = na.locf0))

edited Jul 27 '18 at 18:38

answered Nov 29 '12 at 10:07

G. Grothendieck

254,981
17
203
341

smci · Answer 6 · 2014-09-21T00:18:17.977

4

You simply need to split by country, then a do either a zoo::na.locf() or na.fill, filling to the right. Here is an example explicitly showing the three-component arg syntax of na.fill:

library(plyr)
library(zoo)

data <- data.frame(country=c("AUT", "AUT", "AUT", "AUT", "GER", "GER", "GER", "GER", "GER"), value=c(NA, 5, NA, NA, NA, NA, 7, NA, NA))

# The following is equivalent to na.locf
na.fill.right <- function(...) { na.fill(..., list(left=NA,interior=NA,right="extend")) }

ddply(data, .(country), na.fill.right)

  country value
1     AUT  <NA>
2     AUT     5
3     AUT     5
4     AUT     5
5     GER  <NA>
6     GER  <NA>
7     GER     7
8     GER     7
9     GER     7

edited Sep 21 '14 at 00:18

answered Sep 19 '14 at 22:19

smci

32,567
20
113
146

@Gregor, so OP also wanted to split by country, I missed that and the `na.locf` mention, they were buried in the third paragraph. Works perfectly now. Normally the title and first paragraph should specify the question, I don't see why you didn't fix those up, I just did now. Any of you could and should have corrected that in the last 1.5 years. You can remove your downvote now. – smci Sep 21 '14 at 00:21

score 3 · Answer 7 · edited Jul 27 '18 at 16:21

I'm a little late to this conversation, but here is a data.table way, which will be much faster for larger data sets:

library(zoo)
library(data.table)

# Convert to data table
setDT(data)

data[, value := na.locf(value, na.rm = FALSE), by = country]

data
   country  value
1:     AUT     NA
2:     AUT      5
3:     AUT      5
4:     AUT      5
5:     GER     NA
6:     GER     NA
7:     GER      7
8:     GER      7
9:     GER      7

# And if you want to convert "data" back to a data frame...
setDF(data)

score 1 · Answer 8 · answered Oct 08 '18 at 03:37

A combination of the packages dplyr and imputeTS can do the job.

library(dplyr)
library(imputeTS)
data %>% group_by(country) %>% 
mutate(value = na.locf(value, na.remaining="keep"))

With the na.remaining parameter of the na.locf function of imputeTS you have additionally the option to choose, what to do with the trailing NAs.

These are the options:

"keep" - return the series with NAs
"rm" - remove remaining NAs
"mean" - replace remaining NAs by overall mean
"rev" - perform nocb / locf from the reverse direction

By choosing "mean" you would for example get a result with 7 for every GER in the specific example.

How to fill NAs with LOCF by factors in data frame, split by country

8 Answers8

Linked

Related