Stata: Generate sum / total by specific date ranges and save them as a new variable

Question

I work with panel data which contain several companies (id) and cover the period from 1.1.2008 to 1.1.2013 (date, year). I want to generate new variable (sum1) which contains a sum of daily observation for var1 for each company and specific time interval. If the interval was equal to each year I would use the function total():

 bysort id year: egen sum1=total(var1)

In my case however, the time interval is determined as an interval between two events. I have a special variable called event, which takes value of 1 if the event has occurred on special date and missing otherwise. There are 5 to 10 events for each company. The intervals between events are not equal; hence the first interval can contain 60 observations, the next interval 360 observations. The intervals are also not equal for different companies. The starting date for the first interval for each company is 1.1.2008. The starting date for the second interval is the date of the first event + 1 day. Besides I would like to account for missing values, so if all values of var1 for the company x are missing variable, sum1 for company x and specific interval must contain missing values and not 0.

My panel looks like this:

   id   date        year    var1    event  sum1(to gen)  event_id(to gen)
   1    1.1.2008    2008    25        .     95 (25+30+40)     1
   1    2.1.2008    2008    30        .     95 (25+30+40)     1
   ...........................................................1      
   1    31.4.2008   2008    40        1     95 (25+30+40)     1
   1    1.5.2008    2008    50        .     160 (50+50+60)    2
   1    2.5.2008    2008    50        .     160 (50+50+60)    2
   .........................................  ................2
   1   31.4.2009    2009    60        1     160 (50+50+60)    2 
   2    1.1.2008    2008    26        .     96 (26+30+40)     1 
   2    2.1.2008    2008    30        .     96 (26+30+40)     1
   ...........................................................1      
   2    31.6.2008   2008    40        1     96 (26+30+40)     1
   2    1.5.2008    2008    51        .     161 (51+50+60)    2
   2    2.5.2008    2008    50        .     161 (51+50+60)    2
   ...........................................................2
   2   31.6.2009    2009    60        1     161 (51+50+60)    2

I tried to write different loops (while, if), but I failed to do it correctly. I cannot use rolling as my intervals are not the same.

My other idea was to create the group identifier first (called event_id), which contains the event_id for each interval and each company. Then I could use bysort id event_id: egen sum1=total(var1), but unfortunately I do not have any idea how to do that. So, the variables event_id and sum1 in my panel do not exist and serve as an example for output I want to achieve.

I'm having some trouble understanding `event`. For the first instance, `event` = 1 for the first *and* last occurrence, whereas all other events are only marked with `event` = 1 in the last occurrence. Am I missing something? — ander2ed, Oct 08 '15 at 15:22
You are right, my explanation is misleading. I edited it a little bit. The event in my setting is a publication date of the annual report. That's why the event days are different for different companies or even some times for the same company. The sum1 is the sum of all observations for `var1` between two publication dates. As I do not have the data prior to 1.1.2008. I take as a starting point for the first interval 1.1.2008. — In777, Oct 08 '15 at 15:58

Nick Cox · Accepted Answer · 2015-10-08T16:26:24.630

I can make sense of the example with the following changes:

Dates 31 April and 31 June are typos for 1 day earlier.
Date 31.6.2008 should however be 30.4.2008.

That said, one trick of reversing time makes subdivision into spells easy. Given markers 1 for the ends of each spell, we can then cumulate backwards using sum(). The crucial small detail here is that sum() ignores missing values, or more precisely treats them as zero. Here that is entirely a feature, although not quite what the OP wants when applying egen, total().

Then reverse spell numbering, reverse time to the normal direction and apply egen as in other answers. Reversing and reversing back are both just negation using -. Sorting on date within panel is just cosmetic once we have a division into spells, but still the right thing to do.

For more on spells in Stata, see here

For hints from Statalist on how to provide data examples using dataex (SSC), which apply here too with minor modification, see here

clear *
input id str10 date year var1 event DesiredSum 
   1 1.1.2008  2008 25 . 95 
   1 2.1.2008  2008 30 . 95 
   1 30.4.2008 2008 40 1 95 
   1 1.5.2008  2008 50 . 160  
   1 2.5.2008  2008 50 . 160  
   1 30.4.2009 2009 60 1 160  
   2 1.1.2008  2008 26 . 96 
   2 2.1.2008  2008 30 . 96     
   2 30.4.2008 2008 40 1 96
   2 1.5.2008  2008 51 . 161
   2 2.5.2008  2008 50 . 161 
   2 30.6.2009 2009 60 1 161    
end

gen ddate = -daily(date, "DMY") 
bysort id (ddate): gen EVENT = sum(event) 
replace ddate = -ddate 
by id: replace EVENT = EVENT[_N] - EVENT + 1 
bysort id EVENT (ddate): egen Sum = total(var1), missing
assert Sum == DesiredSum 
list, sepby(id EVENT) 

     +-----------------------------------------------------------------------+
     | id        date   year   var1   event   Desire~m   ddate   EVENT   Sum |
     |-----------------------------------------------------------------------|
  1. |  1    1.1.2008   2008     25       .         95   17532       1    95 |
  2. |  1    2.1.2008   2008     30       .         95   17533       1    95 |
  3. |  1   30.4.2008   2008     40       1         95   17652       1    95 |
     |-----------------------------------------------------------------------|
  4. |  1    1.5.2008   2008     50       .        160   17653       2   160 |
  5. |  1    2.5.2008   2008     50       .        160   17654       2   160 |
  6. |  1   30.4.2009   2009     60       1        160   18017       2   160 |
     |-----------------------------------------------------------------------|
  7. |  2    1.1.2008   2008     26       .         96   17532       1    96 |
  8. |  2    2.1.2008   2008     30       .         96   17533       1    96 |
  9. |  2   30.4.2008   2008     40       1         96   17652       1    96 |
     |-----------------------------------------------------------------------|
 10. |  2    1.5.2008   2008     51       .        161   17653       2   161 |
 11. |  2    2.5.2008   2008     50       .        161   17654       2   161 |
 12. |  2   30.6.2009   2009     60       1        161   18078       2   161 |
     +-----------------------------------------------------------------------+

Thank you for suggesting a very helpful article about spells. I found your explanation and code very detailed and I learned some interesting syntax from you. — In777, Oct 09 '15 at 13:58

Brendan · Answer 2 · 2015-10-08T16:46:08.583

It looks like you are essentially trying to create totals for unique combinations of id and eventid, not id and year. Based on your example, the event date and "special date" flag (event) don't seem to matter in calculating the desired sum. Therefore

bysort id eventid: egen _sum = total(var1)

or more simply

egen _sum = total(var1) , by(id eventid)

should both give you the total you want. Regarding

Besides I would like to account for missing values, so if all values of var1 for the company x are missing variable, sum1 for company x and specific interval must contain missing values and not 0.

The missing option on egen total() should help take care of this condition.

Update

Not necessarily an improvement on the other answers, but yet another method (relying on the events being in the proper order in the raw data):

clear *
input id str10 date year var1 event DesiredSum 
   1 1.1.2008  2008 25 . 95 
   1 2.1.2008  2008 30 . 95 
   1 30.4.2008 2008 40 1 95 
   1 1.5.2008  2008 50 . 160  
   1 2.5.2008  2008 50 . 160  
   1 30.4.2009 2009 60 1 160  
   2 1.1.2008  2008 26 . 96 
   2 2.1.2008  2008 30 . 96     
   2 30.4.2008 2008 40 1 96
   2 1.5.2008  2008 51 . 161
   2 2.5.2008  2008 50 . 161 
   2 30.6.2009 2009 60 1 161    
end

gen _obs = _n
gen date2 = daily(date, "DMY")
format date2 %td

bys id (_obs): gen eventid = sum(date2 == td(01jan2008)) + sum(event[_n-1] == 1)
egen sum = total(var1) , by(id eventid)  missing

li , sepby(id eventid)

Thank you for your answer and your editing. I think, It was not clear in my question that the variable event_id does not exist and I would like to create it based on the variable event first. Only then I can use the code you suggested. The option `missing` is indeed were useful. — In777, Oct 08 '15 at 15:28
Indeed, I misunderstood. Thanks for the clarification. I've edited my answer. — Brendan, Oct 08 '15 at 16:39
Thank you once more for your help. I liked that your alternative approach is very short and efficient. — In777, Oct 09 '15 at 14:08

ander2ed · Answer 3 · 2015-10-08T15:53:01.740

If you are not opposed to re-coding event into something a bit easier to work with, the following should suffice. I'm also assuming here that event is used to flag the end of the time interval for which the event occurred (I make this assumption based on your sample data and my comment on the question).

clear *
input id str10 date year var1 event DesiredSum 
   1 1.1.2008  2008 25 . 95 
   1 2.1.2008  2008 30 . 95 
   1 31.4.2008 2008 40 1 95 
   1 1.5.2008  2008 50 . 160  
   1 2.5.2008  2008 50 . 160  
   1 31.4.2009 2009 60 1 160  
   2 1.1.2008  2008 26 . 96 
   2 2.1.2008  2008 30 . 96     
   2 31.6.2008 2008 40 1 96
   2 1.5.2008  2008 51 . 161
   2 2.5.2008  2008 50 . 161 
   2 31.6.2009 2009 60 1 161    
 end

 bysort id : gen i = _n  // to maintain sort order

 /* This section of code changes event so that 1 indicates the start of the 
    interval. This data structure makes more sense to me */
 replace event = 0 if mi(event)
 replace event = 2 if event[_n-1] == 1 & _n != 1
 replace event = event - 1 if event > 0
 replace event = 1 in 1

 gen event_id = event
 replace event_id = event_id+event_id[_n-1] if i != 1

 bysort id event_id : egen Sum = total(var1), missing

li id date event_id DesiredSum Sum, sepby(event_id)

Naturally, if you didn't want to change event, you could generate event2 = event to use in place of event.

Although I did something different in detail, but loosely similar in principle, this made preparing my own answer easier. — Nick Cox, Oct 08 '15 at 16:18
Always glad to help. As setting up the data input is typically the largest task for me in preparing an answer, I assume having that in place saved you a fair amount of time. — ander2ed, Oct 08 '15 at 16:27
Thank you for your detailed code. I was easy to understand all steps (one by one). As I am a beginner at STATA it is very important for me. It does function really well. — In777, Oct 09 '15 at 14:01

Stata: Generate sum / total by specific date ranges and save them as a new variable

3 Answers3

Update