Is there a more efficient way to run three statsby regressions with different reference groups?

Question

I want to run these three regressions. Note that each has a different reference group, which is why I run them separately.

statsby _b, by(grp_iden) saving(reg_aaa.dta, replace): reg prezzo ib43.city_str i.marca_str, baselevels
statsby _b, by(grp_iden) saving(reg_bbb.dta, replace): reg prezzo ib6.city_str i.marca_str, baselevels
statsby _b, by(grp_iden) saving(reg_ccc.dta, replace): reg prezzo ib11.city_str i.marca_str, baselevels

However, before running each, I resort to the following:

Before running regression (1), I use: keep if rcode=="aaa"

Before running regression (2), I use: keep if rcode=="bbb"

Before running regression (3), I use: keep if rcode=="ccc"

Is there a way to run the three statsby regressions more efficiently, and perhaps without the need to drop observations from the sample before each respective regression?

Something like the following could work, but I would need to find a way to select different reference groups (i.e., different XX in ibXX.city_str) in each rcode set:

statsby _b, by(rcode grp_iden) saving(reg_ccc.dta, replace): reg prezzo ib11.city_str i.marca_str, baselevels

Efficient in terms of length of code, machine time, programmer time? One thing you can compare and we cannot is whether using an `if` qualifier in `statsby` is faster or slower than your solution. — Nick Cox, May 29 '19 at 06:21
A loop over `43 6 11` would make the code shorter. Is that what you most want? — Nick Cox, May 29 '19 at 06:23
In terms of length of code. Unfortunately, I can't use 'if' since doing so would require specifying the same 'base group' in each regression. — StatsScared, May 29 '19 at 17:46
My answer gives code for different `if` conditions. That's perfectly legal. If it's not what you want, then your question makes no sense to me. A `keep` command before calling up `statsby` should have the same effect as specifying observations to use on the command prefixed by `statsby`. — Nick Cox, May 29 '19 at 17:50
I addressed your comment as if it were independent from your provided answer and the code it contains. — StatsScared, May 29 '19 at 17:52

score 1 · Answer 1 · answered May 29 '19 at 07:33

1

You might use a loop over 43 6 11 and also over aaa bbb ccc:

tokenize "aaa bbb ccc" 
local x = 1 
foreach g in 43 6 11 { 
    statsby _b, by(grp_iden) saving(reg_``x''.dta, replace): reg prezzo ib`g'.city_str i.marca_str if rcode == "``x''", baselevels
    local ++x 
}

I have very mixed feelings about such coding. Sure, you exploit common structure to make the code shorter. If the real problem included say 10 cases, that would clean up the code a lot. If the real problem were very similar, you might lose much clarity, for yourself later, for people in your team, and for other people trying to understand your code. A sharp test is that if you didn't see how to do this yourself, it may be trickier than you should want to use. But it's also true that we only grow by seeing how to use language features, which then become part of our basic toolkit.

Efficiency always sounds better than its lack, but making code more clever but less clear is often not a good idea. The time gain from a loop is dubious: Stata in fact has to interpret the looping machinery, although the cost of that should be trivial. Always include time spent reading the code in your consideration.

answered May 29 '19 at 07:33

Nick Cox

35,529
6
31
47

This is very useful. I also agree with all your points. A question, what does the ' local ++x' line do? – StatsScared May 29 '19 at 17:49
It increments a local by 1. See `help macro` and follow links. Also try `local i = 1` followed by `local ++i` and then look at the macro's value. – Nick Cox May 29 '19 at 17:59
An error is appearing when I run the code. It seems to be related to the `if rcode == "``x''"` , `tokenize "aaa bbb ccc"` or `local x = 1` lines. The error in questions is given after the first set of results are calculated: `regress prezzo ib43.city_str i.marca_str if rcode == aaa"" invalid name` – StatsScared May 29 '19 at 21:48
Just use quotes `"` in the `if` statement and the `saving()` option. By the way, you forgot to also un-upvote the answer which you found very useful. – May 29 '19 at 22:25
The code works fine on my computer with toy data. It is obvious the OP's data are different. – May 29 '19 at 23:26
@StatsScared Please provide example data using the `dataex` command. – May 29 '19 at 23:29
OK I finally understood what was happening. The tokenize command always makes the regression loop through values '1', '2' and '3'. My if rcode variable is a 'long' value and the values of interest are 'aaa', 'bbb', and 'ccc' but those are labels. Their real values are: 'aaa'=3, 'bbb'=5, 'ccc'=7. Is there a way to constrain the loop to only go over these three numbers (3, 5, 7)? – StatsScared Jun 03 '19 at 23:35
There is but personally I will not respond since you clearly ignore our advice and thus do not value our efforts to help you. – Jun 03 '19 at 23:50
I understand providing example data is good practice, but is it a requirement when one can figure things out through code syntax? – StatsScared Jun 04 '19 at 00:00
Your question cannot be answered accurately without providing example data. If you had provided example data @NickCox's answer would be much more spot on and thus helpful not only for you but also for future readers. Please read the Stata [tag wiki](https://stackoverflow.com/tags/stata/info) for advice on how to ask Stata-related questions on here. Questions on Stack Overflow must be reproducible. – Jun 04 '19 at 00:03
I understand. But it's also true you just figured it out through code syntax yourself. You just don't want to share your answer because I didn't provide an example, which is odd. Sometimes certain datasets require a lot of work to anonymize, like in my case. I figured someone could help me based on syntax alone. – StatsScared Jun 04 '19 at 00:12
I figured it out because I have years of experience. Someone who reads your posts and is a beginner will not. The idea of posting on here is to create a repository that will be useful to future readers as well not just you. And you can either simulate the data easily or use one of Stata's toy datasets to provide a reproducible example. What is odd is that you expect us to provide accurate answers to inaccurate questions. What is even more odd is that you chose to penalize the answerer (in this case @NickCox) because he spent time to provide an answer based on data which you did not provide. – Jun 04 '19 at 00:42
Sorry, but i am in general agreement with @Pearly Spencer. I can only understand your comments as indicating that the code you gave in the question was not only not real, but also not even realistic in the sense that you are doing something quite different. I appreciate that confidential data cannot be posted here and that it's a burden to provide suitable examples, but it's not a burden we can take from you. It's elementary, but fundamental: the only successful basis for questions and answers is that you ask questions we can answer that also address your real misunderstanding. – Nick Cox Jun 04 '19 at 04:56
Pearly and Nick. I see your point. I wish I could always provide excepts from my data using `dataex`. I would be the first and main one to benefit from this, plus it would save me time. A follow-up question, and I hope I'm not screwing myself over by asking here: is there a way in Stata to easily modify the name and underlying values of all variables in my dataset, while still keeping my dataset's structure (for instance, if a row is empty it should remain empty, and so on)? – StatsScared Jun 04 '19 at 12:39
I can see how modifying the underlying values of my variables may be troubling in that the answers such dataset may elicit may only be applicable to the simulated dataset and not my original one. Still, any pointers in the right direction on this would be helpful. Thank you. – StatsScared Jun 04 '19 at 12:40
There are many general recipes for renaming and modifying datasets to make them anonymous. But you can easily mess up something important for your purpose. I don't think any beats (a) finding a dataset bundled with Stata that suits (b) inventing your own (where laziness puts a price on complexity). – Nick Cox Jun 04 '19 at 12:54

Is there a more efficient way to run three statsby regressions with different reference groups?

1 Answers1