1

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.

My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions. My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:

totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility

I have been using this command: fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")

Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods

I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.

My data contains NAs as well.

Who can fix this, or find a different way for me to do Fama Macbeth?

Helix123
  • 3,502
  • 2
  • 16
  • 36
Bart
  • 317
  • 4
  • 18
  • Looks like an issue with your data. There's a good guide for troubleshooting your data here (see Millo Giovanni comment): https://r.789695.n4.nabble.com/error-using-pvcm-on-unbalanced-panel-data-td1569157.html – cgrafe Aug 08 '19 at 19:33
  • Can you impute the missing data? Using `impute.knn` or another method? – josephjscheidt Aug 08 '19 at 20:01
  • I cannot impute the missing data. I have firms that are available at certain points in time, then not for another 120 observations and then again for 100 observations e.g. This needs to stay so for analysis. I have deleted missing values as well, but still do not obtain output – Bart Aug 11 '19 at 11:12

3 Answers3

1

This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.

You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).

If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.

EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:

...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values

# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))

# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
    stop("Insufficient number of time periods")

That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.

Geoffrey Grimm
  • 281
  • 2
  • 6
  • Will try this out. – Bart Aug 11 '19 at 11:13
  • try listing the ID first, then the time variable in the `index` parameter, that is what `pdata.frame()` indicates is expected. As long as your ID with the fewest number of records across time is more than the number of coefficients being estimates plus 1, you should not see that error. – Geoffrey Grimm Aug 12 '19 at 15:06
1

Lately, I fixed the Fama Macbeth regression in R. From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.

logmarket<- df %>%
  group_by(date) %>%
  summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1],  rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])

You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).

To retrieve coefficients with t-statistics in a dataframe:

Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))

Bart
  • 317
  • 4
  • 18
0

There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:

try2 <- try2 %>%
  group_by(cusip) %>%
  mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
  ungroup() %>%
  filter(flag == 0)