This question was also posted on Statalist. Here's my answer. I tend not to go for merge
s unless the problem starts with two or more files.
clear
input obs yr str4 var1 str4 var2 str4 var3
1 90 str1 str2 str3
1 91 str1 str4 str5
2 90 str3 str4
2 91 str4 str5
2 93 str3 str5
2 94 str7
end
reshape long var , i(obs yr) j(which)
bysort obs var (yr) : gen new = _n == 1 & !missing(var)
bysort obs yr : replace new = sum(new)
by obs yr : replace new = new[_N]
reshape wide var, i(obs yr) j(which)
(MORE) Further comments focused largely on efficiency, meaning here speed rather than space. (Storage space could be biting the poster.)
Without a restructure, here using reshape
, the problem is a triple loop: over identifiers, over observations for each identifier and over variables. Possibly the two outer loops can be collapsed to one. But an explicit loop over observations is usually slow in Stata.
With the restructuring solutions proposed by Dimitriy and myself, by:
operations go straight to compiled code and are relatively fast: reshape
is interpreted code and entails file manipulations, so can be slow. On the other hand reshape
can be fast to write down with some experience, and it really is worth acquiring the fluency with reshape
which comes with experience. In addition to the help for reshape
and the manual entry, see the FAQ on reshape
I wrote at http://www.stata.com/support/faqs/data-management/problems-with-reshape/
Another consideration is what else you want to do with this kind of dataset. If there are going to be other problems of similar character, they will usually be easier with a long structure as produced by reshape
, so keeping that structure will be a good idea.