3

I have a panel data, i.e. t rows for each of n observations (nxt), such as

data("Grunfeld", package="plm")
head(Grunfeld)
firm year   inv  value capital
   1 1935 317.6 3078.5     2.8
   1 1936 391.8 4661.7    52.6
   1 1937 410.6 5387.1   156.9
   2 1935 257.7 2792.2   209.2
   2 1936 330.8 4313.2   203.4
   2 1937 461.2 4643.9   207.2

I want to make block bootstrapping, i.e. I want resample with replacement, taking a firm [i] with all the years in which it is observed. For instance, if year=1935:1937 and firm 1 is randomly drawn, I want that firm [1] will be in the new sample 3 times, corresponding to year=1935:1937. If it is re-drawn, then it must be again 3 for times. Furthermore, I need to apply my own function to the new bootstrapped sample and I need to do this 500 times. My current code is something like this:

library(boot)
boot.fun <- function(data) {
   est.boot = myfunction(y=Grunfeld$v1, x=Grunfeld$v2, other parameters)
   return(est.boot)
}
boot.sim <- function(data, mle) {
data =  sample(data, ?? ) #
return(data)
}

start.time = Sys.time()
result.boot <- boot(Grunfeld, myfunction( ... ), R=500, sim = "parametric",  
               ran.gen = boot.sim)
Sys.time() - start.time

I was thinking to resample by specifying in a correct way data = sample(data, ?? ) as it works smooth and clean, using as index the column firm. How could I do that? Is there any other more efficient alternative?

EDIT. I do not necessarily need a new boot.function. I just need a (possibly fast) code which allows to resample with replacement, then I ll put it inside the boot argument as ran.gen=code.which.works. The output should be a sample of the same dimension of the original, even though firms can be randomly picked twice or more (or not be picked). For instance the result could be

head(GrunfeldResampled)
firm year   inv  value capital
   2 1935 257.7 2792.2   209.2
   2 1936 330.8 4313.2   203.4
   2 1937 461.2 4643.9   207.2
   1 1935 317.6 3078.5    2.8
   1 1936 391.8 4661.7    52.6
   1 1937 410.6 5387.1   156.9
   2 1935 257.7 2792.2   209.2
   2 1936 330.8 4313.2   203.4
   2 1937 461.2 4643.9   207.2
   9 1935 317.6 3078.5   122.8
   9 1936 391.8 4661.7   342.6
   9 1937 410.6 5387.1   156.9

Basically I need each firm treated as a block, and therefore the resampling should apply to the whole block. Hope this clarifies

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
Bob
  • 452
  • 6
  • 18
  • But in this case there is no element of randomness. Firm 1 appears once in each of the 3 years and you want all those 3 years returned. What's the bootstrapping needed for? Or is it that you want random firms to be picked (by some number provided as input for example like 5 firms?) and whichever gets picked, all of the years to be displayed? – LyzandeR Jan 26 '15 at 16:44
  • Also, if this is done by replacement and assuming that you want all years for a specific firm, what happens if the same company gets picked twice? Do you need all years twice then? And how is the number of firms chosen? Do you want to make a function in order to be able to specify it yourself? – LyzandeR Jan 26 '15 at 16:47
  • Hello. What I need is indeed firms to be picked randomly. In case a firm is picked twice, then it appears twice with all of the orresponding years. The number of firms in the original sample is `N`, each observed for `T` years: a balanced panel with `NxT` observations. Therefore I need a resampling with replacement which gives a sample of dimension `NxT` , – Bob Jan 26 '15 at 17:02
  • Would removing years from the last firm picked in order to have the same dimension `NxT` be acceptable? Otherwise the last firm should be picked according to an exact number of years so that the dimension `NxT` remains stable... – LyzandeR Jan 26 '15 at 17:28
  • In my sample every firm is observed for T=8 years, no more, no less. – Bob Jan 26 '15 at 17:33
  • Great. Thanks for clarifying. Apparently the same thing is true for the data set you provided. Have a look at my answer. – LyzandeR Jan 26 '15 at 17:57

2 Answers2

1

Apparently in this answer every firm is viewed for exactly 20 years, so I won't have a problem demonstrating:

data("Grunfeld", package="plm") #load data

Solution

#n is the the firms column, df is the dataframe
myfunc <- function(n,df) {      #define function
 unique_firms <- unique(n)      #unique firms
 sample_firms <- sample(unique_firms, size=length(unique_firms), replace=T ) #choose from unique firms randomly with replacement
 new_df <- do.call(rbind, lapply(sample_firms, function(x)  df[df$firm==x,] ))  #fetch all years for each randomly picked firm and rbind
}

a <- myfunc(Grunfeld$firm, Grunfeld) #run function 

Output

> str(a)
'data.frame':   200 obs. of  5 variables:
 $ firm   : int  4 4 4 4 4 4 4 4 4 4 ...
 $ year   : int  1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 ...
 $ inv    : num  40.3 72.8 66.3 51.6 52.4 ...
 $ value  : num  418 838 884 438 680 ...
 $ capital: num  10.5 10.2 34.7 51.8 64.3 67.1 75.2 71.4 67.1 60.5 ...

As you can see dim is exactly the same as the input data.frame

For your data the solution will be:

myfunc <- function(n,df) {      #define function
  unique_firms <- unique(n)      #unique firms
  print(unique_firms)
  sample_firms <- sample(unique_firms, size=length(unique_firms), replace=T ) #choose from unique firms randomly with replacement
  new_df <- do.call(rbind, lapply(sample_firms, function(x)  df[df$country==x,] ))  #fetch all years for each randomly picked firm and rbind
}

and Output:

> str(a)
'data.frame':   848 obs. of  18 variables:
 $ isocode  : Factor w/ 106 levels "AGO","ALB","ARG",..: 82 82 82 82 82 82 82 82 61 61 ...
 $ time     : int  2 3 4 5 6 7 8 9 2 3 ...
 $ country  : num  80 80 80 80 80 80 80 80 59 59 ...
 $ year     : int  1975 1980 1985 1990 1995 2000 2005 2010 1975 1980 ...
 $ gdp      : num  184619 210169 199343 268870 305255 ...
 $ pop      : num  33.4 34.9 36.6 37.8 38.3 ...
 $ gdp_k    : num  5526 6022 5443 7117 7969 ...
 $ co2      : num  340353 431436 426881 431052 350874 ...
 $ co2_k    : num  10191 12333 11674 11407 9128 ...
 $ oecd     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ LI       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ LMI      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ UMI      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HI       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ gdpk     : num  5531 6018 5449 7118 7971 ...
 $ co2k     : num  10196 12355 11668 11412 9162 ...
 $ co2_k.lag: num  8595 10191 12333 11674 11407 ...
 $ gdp_k.lag: num  4730 5526 6022 5443 7117 ...
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • Thanks for your answer. Your solution works with the Grunfeld sample. However, when applied to mine, `a` is an object with `0` observations. My id (`firms`) column is a factor columnt, with values being names of countries (`"Angola", "Brasil", ...`). I don't know if this can help. I get also this warning 50 messages `50: In is.na(e1) : is.na() applied to non-(list or vector) of type 'NULL'`. My data is a data.frame as Grunfeld is. – Bob Jan 26 '15 at 22:30
  • I just tried the above with a factor type `firm` and it works fine to be honest. – LyzandeR Jan 26 '15 at 22:39
  • Are you sure you re not missing something? I think it should work because factors are actually integers. You can give it a go, converting the factor column into character and retrying but it should work with factors... I can't really say anything else without the data... – LyzandeR Jan 26 '15 at 22:40
  • Or you could try converting this column to integer with `as.integer` as this would change it to exactly the above example just for a sense check. – LyzandeR Jan 26 '15 at 22:43
  • I am trying to understand what's wrong. Tried all the solutions you proposed, as.character, as.integer... nothing. – Bob Jan 26 '15 at 22:55
  • Hard to say without a reproducible example. Maybe you can `dput` a few records from your data set to check. I cannot tell without seeing your data. Also, make sure you re using my code as it is and that you re using the correct arguments. – LyzandeR Jan 26 '15 at 23:00
  • I am trying to understand what's wrong. Tried all the solutions you proposed, as.character, as.integer... nothing. Here the data http://www.filedropper.com/data_32. I used the function `plm.data` in the package `plm`on it. That's all – Bob Jan 26 '15 at 23:04
  • So, in my function you need to change `df[df$firm==x,]` to `df[df$country==x,]`. Sorry, I should have been clear about that. In this case your column is called country and not firm. Change that and it will run normally. I edited the answer as well to show you the results. – LyzandeR Jan 26 '15 at 23:34
  • Truth to be told, it was my fault since I completely overlooked that point. Works perfectly. As you may see, this turns out to be useful in case of panel data in which you want to estimate `y=a*y.lag + b*z` . The `boot` package does not offer such a straightforward solution, nor the function `sample` does. Thank you very much. – Bob Jan 26 '15 at 23:56
  • 1
    Really glad I could be of help :) – LyzandeR Jan 27 '15 at 00:04
0

You can do this with the "strata" parameter of the boot function. This is called stratified bootstrap. Changing the last line of your code:

result.boot <- boot(Grunfeld, boot.fun, R=500, sim = "ordinary",  
                strata = Grunfeld$firm)

i suppressed the parameter ran.gen & sim

I suggest theses changes to the boot function so it works properly:

boot.fun <- function(d, i) { # d being your data, i the set of indices)
   est.boot = myfunction(y=d[i ,]$v1, x=d[i, ]$v2, other parameters)
   return(est.boot)
}
agenis
  • 8,069
  • 5
  • 53
  • 102
  • Thank you. Is there any way to check if the bootstrapped samples are done properly? – Bob Jan 26 '15 at 13:10
  • not sure what you mean by "properly", but you can do table(result.boot$strata) to visualize the repartition of your stratae. however i'm not sure that you can retrieve the indices of the sampled lines for each single bootstrap. – agenis Jan 26 '15 at 13:18
  • Error in statistic(data, original, ...) : unused argument (original) – Bob Jan 26 '15 at 13:26
  • your boot function is probably badly structured. Please paste the entire code so we can reproduce the error, or see the ?boot for more details on how the statistic function should be passed. – agenis Jan 26 '15 at 13:29
  • I am sorry, but it looks like I am really slow. In my data the index column is labeled "isocode", or "firm" in the Grunfeld sample. Should I write `function(d, i)` or `function(d, isocode)`. In both case it does not work. Is it not simpler to resample data using `boot.sim` and `ran.gen `? – Bob Jan 26 '15 at 14:06