0

Lets suppose we have the following table ("Purchases"):

Date                 Units_Sold             Brand       Year
18/03/2010                5                   A         2010
12/04/2010                2                   A         2010
22/05/2010                1                   A         2010
25/05/2010                7                   A         2010
11/08/2011                5                   A         2011
12/07/2010                2                   B         2010
22/10/2010                1                   B         2010
05/05/2011                7                   B         2011

And the same logic continues until the end of 2014, for different brands.

What I want to do is calculate the number of Units_Sold for every Brand, in each year. However, I don't want to do it for the calendar year, but for the actual year.

So an example of what I don't want:

proc sql;
create table Dont_Want as
select Year, Brand, sum(Units_Sold) as Unit_per_Year
from Purchases
group by Year, Brand;
quit;

The above logic is ok if we know that e.g. Brand "A" exists throughout the whole 2010. But if Brand "A" appeared on 18/03/2010 for the first time, and exists until now, then a comparison of Years 2010 and 2011 would not be good enough as for 2010 we are "lacking" 3 months.

So what I want to do is calculate:

for A: the sum from 18/03/2010 until 17/03/2011, then from 18/03/2011 until 17/03/2012, etc.

for B: the sum from 12/07/2010 until 11/07/2011, etc.

and so on for all Brands.

Is there a smart way of doing this?

Noob_Strider
  • 183
  • 2
  • 2
  • 14
  • Your question is not clear. What's the difference between what you **don't** want method and the method that you **want** ? – Kosala W Jan 05 '16 at 10:53
  • In the **don't want** method, the results I will get are the total Units_Sold for the calendar years (e.g. 2010). In the **want** method I would like to calculate the total Units_Sold of a full year of a Brand since its first appearance on the data-set. A full year for a brand is not limited to 01/01/2010 - 31/12/2010, but instead might be 18/03/2010 - 17/03/2011. Does that make sense? – Noob_Strider Jan 05 '16 at 11:00
  • So what constitute a year in this case? 365 days starting from the first date of appearance? – Kosala W Jan 05 '16 at 11:15
  • 1
    What about leap years? – Joe Jan 05 '16 at 16:17
  • @Joe, good point I had not thought about that (even though 2016 is a leap year!). However, for the needs of my study I doubt one day will make a difference. But in an academic environment your addition is 100% valid. – Noob_Strider Jan 05 '16 at 21:20

4 Answers4

3

Step 1: Make sure your dataset is sorted or indexed by Brand and Date

proc sort data=want;
     by brand date;
run;

Step 2: Calculate the start/end dates for each product

The idea behind the below code:

  1. We know that the first occurrence of the brand in the sorted dataset is the day in which the brand was introduced. We'll call this Product_Year_Start.

  2. The intnx function can be used to increment that date by 365 days, then subtract 1 from it. Let's call this date Product_Year_End.

  3. Since we now know the product's year end date, we know that if the date on any given row exceeds the product's year end date, we have started the next product year. We'll just take the calculated Product_Year_End and Product_Year_Start for that brand and bump them up by one year.

This is all achieved using by-group processing and the retain statement.

data Comparison_Dates;
    set have;
    by brand date;

    retain Product_Year_Start Product_Year_End;

    if(first.brand) then do;
        Product_Year_Start = date;
        Product_Year_End = intnx('year', date, 1, 'S') - 1;
    end;

    if(Date > Product_Year_End) then do;
        Product_Year_Start = intnx('year', Product_Year_Start, 1, 'S');
        Product_Year_End = intnx('year', Product_Year_End, 1, 'S');
    end;

    format Product_Year_Start Product_Year_End date9.;
run;

Step 3: Using the original SQL code, group instead by the new product start/end dates

proc sql;
    create table want as
    select catt(year(Product_Year_Start), '-', year(Product_Year_End) ) as Product_Year
         , Brand
         , sum(Units_Sold) as Unit_per_Year
    from Comparison_Dates
    group by Brand, calculated Product_Year
    order by Brand, calculated Product_Year;
quit;
Stu Sztukowski
  • 10,597
  • 1
  • 12
  • 21
  • Thanks for taking the time to organize your answer, it was for sure helpful and easy to understand for someone who hasn't come across some of the functions in your code e.g. intnx. – Noob_Strider Jan 05 '16 at 21:13
1

The following code is doing what you ask in a literal sense, for the earliest 'date' of each 'brand', it start aggregating 'unitssold', when hits 365 days mark, it resets count, and starts another cycle.

data have;
    informat date ddmmyy10.;
    input date units_sold brand $ year;
    format date date9.;
    cards;
18/03/2010                5                   A         2010
12/04/2010                2                   A         2010
22/05/2010                1                   A         2010
25/05/2010                7                   A         2010
11/08/2011                5                   A         2011
12/07/2010                2                   B         2010
22/10/2010                1                   B         2010
05/05/2011                7                   B         2011
;

proc sort data=have;
    by brand date;
run;

data want;
    do until (last.brand);
        set have;
        by brand date;

        if first.brand then
            do;
                Sales_Over_365=0;
                _end=intnx('day',date,365);
            end;

        if date <= _end then
            Sales_Over_365+units_sold;
        else
            do;
                output;
                Sales_Over_365=units_sold;
                _end=intnx('day',date,365);
            end;
    end;

    output;
    drop _end;
run;
Haikuo Bian
  • 906
  • 6
  • 7
  • Agreed @Joe. Some times you code without thinking, at the moment, it is definitely easier for me to do intnx() than just +365. – Haikuo Bian Jan 05 '16 at 17:47
  • @Joe I beg to differ. In such a community it's not only important to find a solution, but also to explore different ways of resolving a problem. Haikuo's "day" made me search further about the `intnx` function and explore its different options, since, in the answers section he is the only one who used "day" instead of "year". – Noob_Strider Jan 05 '16 at 21:17
  • @Noob_Strider Sure, and that was going to be my next suggestion - use `year`. But for days it's pretty pointless since they're single units... – Joe Jan 05 '16 at 22:44
1

You need to have a start date for each brand. For now we can use the first sale date, but that might not be what you want. Then you can classify each sales date into which year it is for that brand.

Let's start by creating a dataset from your sample data. The YEAR variable is not needed.

data have ;
  input Date Units_Sold Brand $ Year ;
  informat date ddmmyy10.;
  format date yymmdd10.;
cards;
18/03/2010 5 A 2010
12/04/2010 2 A 2010
22/05/2010 1 A 2010
25/05/2010 7 A 2010
11/08/2011 5 A 2011
12/07/2010 2 B 2010
22/10/2010 1 B 2010
05/05/2011 7 B 2011
;;;;

Now we can get the answer you want with an SQL query.

proc sql ;
  create table want as
   select brand
        , start_date
        , 1+floor((date - start_date)/365) as sales_year
        , intnx('year',start_date,calculated sales_year -1,'same')
            as start_sales_year format=yymmdd10.
        , sum(units_sold) as total_units_sold
  from
  ( select brand
        , min(date) as start_date format=yymmdd10.
        , date
        , units_sold
    from have
    group by 1
   )
  group by 1,2,3,4
  ;
quit;

This will produce this result:

                                               total_
                       sales_      start_      units_
Brand    start_date     year     sales_year     sold
  A      2010-03-18       1      2010-03-18      15
  A      2010-03-18       2      2011-03-18       5
  B      2010-07-12       1      2010-07-12      10
Tom
  • 47,574
  • 2
  • 16
  • 29
0

There is no straight forward way of doing it. You can do something like this.

To test the code, I saved your table in to a text file.

Then I created a class called Sale.

public class Sale
{
    public DateTime Date { get; set; }
    public int UnitsSold { get; set; }
    public string Brand { get; set; }
    public int Year { get; set; }
}

Then I populated a List<Sale> using the saved text file.

var lines = File.ReadAllLines(@"C:\Users\kosala\Documents\data.text");
var validLines = lines.Where(l => !l.Contains("Date")).ToList();//remove the first line.

List<Sale> sales = validLines.Select(l => new Sale()
        {
            Date = DateTime.Parse(l.Substring(0,10)),
            UnitsSold = int.Parse(l.Substring(26,5)),
            Brand = l.Substring(46,1),
            Year = int.Parse(l.Substring(56,4)),
        }).ToList();

//All the above code is for testing purposes. The actual code starts from here.
var totalUnitsSold = sales.OrderBy(s => s.Date).GroupBy(s => s.Brand);

        foreach (var soldUnit in totalUnitsSold)
        {
            DateTime? minDate = null;
            DateTime? maxDate = null;
            int total = 0;
            string brand = "";

            foreach (var sale in soldUnit)
            {
                brand = sale.Brand;
                if (minDate == null)
                {
                    minDate = sale.Date;
                }
                if ((sale.Date - minDate).Value.Days <= 365)
                {
                    maxDate = sale.Date;
                    total += sale.UnitsSold;
                }
                else
                {
                    break;
                }
            }
            Console.WriteLine("Brand : {0} UnitsSold Between {1} - {2} is {3}",brand, minDate.Value, maxDate.Value, total);
   }
Kosala W
  • 2,133
  • 1
  • 15
  • 20
  • Thanks for your response. Although I see the reasoning behind your code, I cannot test it as it is in a format different than Base SAS. – Noob_Strider Jan 05 '16 at 15:39