1

I wish to run a Lasso model in R from Stata and then bring a resulting character list (the names of the subset coefficients) back into Stata as a macro (for example, a global).

At the moment I am aware of two options:

  1. I save a dta file and run an R script from Stata using shell:

    shell $Rloc --vanilla <"${LOC}/Lasso.R"
    

    This works from the saved dta file and allows me to run the Lasso model that I wish to run, but is not interactive, so I can't bring the relevant character list (with the names of subset variables) back into Stata.

  2. I run R interactively from Stata using rcall. However, rcall won't allow me to load a large enough matrix, even under max Stata memory. My predictive matrix Z (to be subset by Lasso) is 1,000 by 100 but when I run the command:

    rcall: X <- st.matrix(Z) 
    

    I receive an error stating:

    macro substitution results in line that is too long: The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 645,216 characters, which is calculated on the basis of set maxvar.

Is there some way to interactively run R from Stata, which allows large matrices, such that I may bring a character list from R back into Stata as a macro?

Thanks in advance.

Leah Bevis
  • 317
  • 1
  • 11
  • Have you tried this with user-written `rsource`? There are also several user-written `lasso` commands in Stata that might obviate the need for R. – dimitriy May 30 '18 at 21:25
  • What versions of Stata and `rcall` are you using? –  May 30 '18 at 21:35
  • Are (A) and (B) related? From your description they look like different problems but your last sentence is confusing. I will have a look at (B) but for (A) it looks like you may not be using `rcall` correctly. I recommend you try to run your lasso command without the `vanilla` mode and save the resulting string in R into a variable, which you can then bring back to Stata using either one rclass object or split it in 2+ if it exceeds 255 characters. –  May 30 '18 at 21:37
  • @DimitriyV.Masterov correct me if i am wrong but `rsource` is really no different than using `shell` to run R in batch mode? I think @LeahBevis wants to pass the objects directly in Stata. But you are correct there are `lasso` commands for Stata so they may be a better option than fiddling with R. –  May 30 '18 at 21:41
  • @PearlySpencer The last examples in the helpfile illustrates how to pass data back and forth. I wonder if it possible to store the variables names in a column in a dataset that R passes back to Stata, execute `levelsof` on that list, and then use that list after reopening the original data file. – dimitriy May 30 '18 at 22:48
  • @DimitriyV.Masterov You are correct but it looks more complicated? And limited both with respect to supported `dta` versions and the fact that it probably requires Stat/Transfer? I think the OP should clarify better what she wants. If it is just a string that she wants to bring back, then this could be returned using `rcall` to directly push back one or more character variables. And i say one or more because macros in Stata have limits. So if her list is long should be split in 2, 3 or more parts. `levelsof` could be a solution yes. –  May 30 '18 at 23:08
  • @PearlySpencer One can avoid ST by using `foreign` or `haven`. – dimitriy May 30 '18 at 23:10
  • I just don't understand what the matrix has to do with anything. `rcall` indeed appears to have issues with large matrices. But not sure how this is related to the macro. –  May 30 '18 at 23:11
  • @DimitriyV.Masterov yes i agree if the solution requires saving the data in a file. But the idea i think is to make it without one by directly passing the string to Stata. –  May 30 '18 at 23:12
  • @LeahBevis Is the character list the result of running your `lasso` command? And do you need to use the matrix as input to run the `lasso` command or is the command trying to create it? It would be helpful if you could provide the code. –  May 31 '18 at 00:39
  • PearlySpencer and Dimitriy - Apologies for poor clarity. (i) Stata 14.2, R 3.4.3, (ii) by contrasting (A) and (B) I mean that both methods should allow me pass data from my .dta file / from saved matrices to R, in order to run Lasso (glmnet package). The problem with A is that, while I can run an R script that will run my Lasso model, and can even end by saving a .dta file, I can't directly pass back macros/lists. This is why I was trying (B), but it seems that rcall won't allow me to pass a 1Kx100 matrix to R, which is what I need for my predictive matrix. – Leah Bevis May 31 '18 at 12:32
  • (iii) The character list I want to pass back contains the names of vectors/variables subset by Lasso. (I.e., the names of the subset coefs associated with cvfit = cv.glmnet().) Length varies, but 30-70 characters long. I'm aware that under method A, I can save the actual, subset coefficients in a .dta file, to be later used by Stata, but I'd prefer to pass the names back only. I could also save the character list as a .dta file, then extract names to a global w/in Stata, but this seems bulky. Ideally I'd like to pass the character list back as a macro. – Leah Bevis May 31 '18 at 12:48
  • @LeahBevis `rcall` does not appear to behave well with large matrices like the one you need. I think it would be best to save the string(s) as variables in a dataset and then read these into Stata. This requires a bit more work but it is certainly programmable. You could save the strings in separate variables or in one and use `levelsof` as @Dimitriy recommended. –  May 31 '18 at 12:50
  • (iv) Perhaps you are right that I should be using Stata's Lasso. I was attempting to use R because I know this algorithm well, and also because I want to improve my ability to interactively run R from Stata. (v) @DimitriyV.Masterov, saving a character vector via .dta file then using `levelsof` to obtain global makes sense, thanks. I think an interactive option might be more efficient, but that is a good work around, if I call R non-interactively (option A). (vi) I will look into `rsource`, have not used this. Previously I've used only `shell`, as in: `shell $Rloc --vanilla <"${LOC}/Script.R"` – Leah Bevis May 31 '18 at 12:56
  • (vii) So for total clarity, the problem with `rcall` is merely that I cannot seem to load the matrix of predictive variables. I had in my mind something like: `rcall: Y <- st.var(Xa1)` ; `rcall: X <- st.matrix(Z)` ; `rcall: cvfit = cv.glmnet(x=X, y=Y, alpha=1, type.measure = "mse", nfolds = 10)` ; `coef <- coef(cvfit, s = "lambda.1se")` ; `Xsubset <- as.data.frame(X[, coef@i[-1]])` ; then passing `names(Xsubset)` to Stata as global. But I'm stuck on `rcall: X <- st.matrix(Z)`, as apparently my Z is too large. – Leah Bevis May 31 '18 at 13:11
  • @PearlySpencer ok, that's useful. If `rcall` really can't work with large matrices, I'll go with the string export .dta file, import `levelsof` option. Though just to be sure, are there no other interactive ways to work in R, from Stata, that might handle larger matrices? – Leah Bevis May 31 '18 at 13:42
  • As @Dimitriy said there is `rsource` but i have no experience with it. From my experience, all these programs are useful for simple tasks but for more 'serious' work you need the real thing. –  May 31 '18 at 13:49

1 Answers1

3

Below i will try to consolidate the comments in a -hopefully- useful answer.

Unfortunately, rcall does not appear to play nicely with large matrices like the one you need. I think it would be best to call R to run your script using the shell command and save the string(s) as variables in a dta file. This requires a bit more work but it is certainly programmable.

Then you could read these variables into Stata and manipulate them easily using built-in functions. For example, you could save the strings in separate variables or in one and use levelsof as @Dimitriy recommended.

Consider the following toy example:

clear
set obs 5

input str50 string
"this is a string"
"A longer string is this"
"A string that is even longer is this one"
"How many strings do you have?"
end

levelsof string, local(newstr) 
`"A longer string is this"' `"A string that is even longer is this one"' `"How many strings do you have?"' `"this is a string"'

tokenize `"`newstr'"'

forvalues i = 1 / `: word count `newstr'' {
    display "``i''"
}

A longer string is this
A string that is even longer is this one
How many strings do you have?
this is a string

From my experience, programs like rcall and rsource are useful for simple tasks. However, they can become a real hassle for more complicated work in which case i personally just resort to the real thing, that is using the other software directly.

As @Dimitriy also indicated, there are now some community-contributed commands available for lasso, ehich may cover your need so you do not have to fiddle with R:

search lasso

5 packages found (Stata Journal and STB listed first)
-----------------------------------------------------

elasticregress from http://fmwww.bc.edu/RePEc/bocode/e
    'ELASTICREGRESS': module to perform elastic net regression, lasso
    regression, ridge regression / elasticregress calculates an elastic
    net-regularized / regression: an estimator of a linear model in which
    larger / parameters are discouraged.  This estimator nests the LASSO / and

lars from http://fmwww.bc.edu/RePEc/bocode/l
    'LARS': module to perform least angle regression / Least Angle Regression
    is a model-building algorithm that / considers parsimony as well as
    prediction accuracy.  This / method is covered in detail by the paper
    Efron, Hastie, Johnstone / and Tibshirani (2004), published in The Annals

lassopack from http://fmwww.bc.edu/RePEc/bocode/l
    'LASSOPACK': module for lasso, square-root lasso, elastic net, ridge,
    adaptive lasso estimation and cross-validation / lassopack is a suite of
    programs for penalized regression / methods suitable for the
    high-dimensional setting where the / number of predictors p may be large

pdslasso from http://fmwww.bc.edu/RePEc/bocode/p
    'PDSLASSO': module for post-selection and post-regularization OLS or IV
    estimation and inference / pdslasso and ivlasso are routines for
    estimating structural / parameters in linear models with many controls
    and/or / instruments. The routines use methods for estimating sparse /

sivreg from http://fmwww.bc.edu/RePEc/bocode/s
    'SIVREG': module to perform adaptive Lasso with some invalid instruments /
    sivreg estimates a linear instrumental variables regression / where some
    of the instruments fail the exclusion restriction / and are thus invalid.
    The LARS algorithm (Efron et al., 2004) is / applied as long as the Hansen