0

I have an unbalanced, large dataset where each observation can take multiple string values, each stored in a separate variable:

obs    year   var1    var2    var3    newval  

1      1990   str1    str2    str3     3   

1      1991   str1    str4    str5     2  

2      1990   str3    str4             2  

2      1991   str4    str5             1  

2      1993   str3    str5             0 

2      1994   str7                     1

At each point in time and for each observation I need to count whether the string value(s) are "new". What this means is that they do not show up among the values taken by the observation in previous years.

How should I approach this problem in Stata?

Thank you.

Deni M
  • 1
  • 2

2 Answers2

1

There's probably a more elegant way to to do this.

The main idea is that I first reshape the data and count the occurrence of each string sequentially. Reshaping makes this much easier. Then I am going to aggregate with collapse, but only count the first instance that each string appears. Then I will rejoin to your original data.

#delimit;

preserve;
    tempfile newval;

    reshape long var, i(obs year) j(s); // stack all the vars on top of each other
    bys obs var (year): gen n=_n if !missing(var); // number the appearance of each string in chronological order
    replace n=0 if n>1 & !missing(n); // only count the first instance

    collapse (sum) mynewval=n, by(obs year); // add up the counts
    save `newval';
restore;

merge 1:1 obs year using `newval', nogen;

compare newval mynewval;
dimitriy
  • 9,077
  • 2
  • 25
  • 50
1

This question was also posted on Statalist. Here's my answer. I tend not to go for merges unless the problem starts with two or more files.

clear
input obs     yr   str4 var1 str4  var2 str4   var3
1        90   str1    str2    str3
1        91    str1    str4    str5
2        90    str3    str4
2        91    str4    str5
2        93    str3    str5
2        94    str7
end
reshape long var , i(obs yr) j(which)
bysort obs var (yr) : gen new = _n == 1 & !missing(var)
bysort obs yr : replace new = sum(new)
by obs yr : replace new = new[_N]
reshape wide var, i(obs yr) j(which)

(MORE) Further comments focused largely on efficiency, meaning here speed rather than space. (Storage space could be biting the poster.)

Without a restructure, here using reshape, the problem is a triple loop: over identifiers, over observations for each identifier and over variables. Possibly the two outer loops can be collapsed to one. But an explicit loop over observations is usually slow in Stata.

With the restructuring solutions proposed by Dimitriy and myself, by: operations go straight to compiled code and are relatively fast: reshape is interpreted code and entails file manipulations, so can be slow. On the other hand reshape can be fast to write down with some experience, and it really is worth acquiring the fluency with reshape which comes with experience. In addition to the help for reshape and the manual entry, see the FAQ on reshape I wrote at http://www.stata.com/support/faqs/data-management/problems-with-reshape/

Another consideration is what else you want to do with this kind of dataset. If there are going to be other problems of similar character, they will usually be easier with a long structure as produced by reshape, so keeping that structure will be a good idea.

Nick Cox
  • 35,529
  • 6
  • 31
  • 47
  • (+1) Considerably more elegant! – dimitriy Mar 22 '13 at 06:25
  • Hi Nick, many thanks, this is very helpful! If I may follow up: I take it from your (and Dimitriy's answer) that reshape is a more efficient way to deal with this issue than, for example, building lists of values/looping over obs as you suggested in "Counting distinct strings across a set of variables/ Stata FAQ". I'm asking bc I have a big dataset, many string vb and reshape takes a long time. Thanks – Deni M Mar 22 '13 at 08:17
  • What's "efficient" here? Are you including your programmer time in any calculation? Your case includes new values in any variable, which implies keeping track of values seen so far in all variables. You could try building on the ideas in that FAQ. I don't have a gut feeling for which way would be faster. I knew I could write the solution above more quickly, which was my calculation. – Nick Cox Mar 22 '13 at 08:37
  • Right! The time to execute reshape command pales in comparison with the opportunity cost of programming time (huge for beginners). So, I am running with your code which I understand better. Lesson learned :-) – Deni M Mar 22 '13 at 09:04
  • Opportunity cost! I think you just identified yourself as an economist. See further details under (MORE) in my answer. – Nick Cox Mar 22 '13 at 10:01
  • Thanks for your comments under More- this is the type of clarifications that I was hoping for (i.e. reasons for using reshape over other approaches). Actually, I am not an economist- but I do use the concept quite often with my students. Bad or good habit? depends on where you stand. – Deni M Mar 22 '13 at 11:57
  • Thinking about all the costs of different programming strategies is just good sense. I have no objection at all to insights from economics here. – Nick Cox Mar 22 '13 at 12:33