0

I have data that has a reference date and a publish date. Similar to economic reports which are published/released on different dates than they reference (i.e. Q4 GDP for 2014 references the date 12/31/2014 but is published the following week on 01/07/2015). Multiple references date values can be published on a single publish date. I want to be able to add data together which has similar structure just with misaligned and duplicate reference and publish dates.

Below is a sample of the data for Item A:

Publish_ItemA           Reference_ItemA         Value_ItemA
2002-01-10 00:00:00.000 2001-09-30 00:00:00.000 83
2002-02-14 00:00:00.000 2001-12-31 00:00:00.000 48
2002-05-23 00:00:00.000 2002-03-31 00:00:00.000 57
2002-08-15 00:00:00.000 2002-06-30 00:00:00.000 41
2002-12-31 00:00:00.000 2002-09-30 00:00:00.000 18
2003-02-13 00:00:00.000 2002-12-31 00:00:00.000 18
2003-05-22 00:00:00.000 2003-03-31 00:00:00.000 29
2003-08-21 00:00:00.000 2003-06-30 00:00:00.000 40
2003-12-31 00:00:00.000 2003-09-30 00:00:00.000 51
2004-12-16 00:00:00.000 2002-12-31 00:00:00.000 17
2004-12-16 00:00:00.000 2003-03-31 00:00:00.000 28
2004-12-16 00:00:00.000 2003-06-30 00:00:00.000 33
2004-12-16 00:00:00.000 2003-09-30 00:00:00.000 60
2004-12-16 00:00:00.000 2003-12-31 00:00:00.000 107

Below is a sample of the data for Item B:

Publish_ItemB           Reference_ItemB         Value_ItemB        
2001-01-25 00:00:00.000 2000-12-31 00:00:00.000 -207
2001-04-25 00:00:00.000 2000-12-31 00:00:00.000 -195
2001-04-25 00:00:00.000 2001-03-31 00:00:00.000 43
2001-07-19 00:00:00.000 2001-06-30 00:00:00.000 61
2001-10-18 00:00:00.000 2001-09-30 00:00:00.000 66
2002-01-17 00:00:00.000 2001-12-31 00:00:00.000 38
2002-04-24 00:00:00.000 2002-03-31 00:00:00.000 40
2002-07-18 00:00:00.000 2002-06-30 00:00:00.000 32
2002-10-17 00:00:00.000 2002-09-30 00:00:00.000 -45
2003-01-16 00:00:00.000 2002-12-31 00:00:00.000 -8
2003-04-24 00:00:00.000 2003-03-31 00:00:00.000 14
2003-07-17 00:00:00.000 2003-06-30 00:00:00.000 19
2003-10-23 00:00:00.000 2003-09-30 00:00:00.000 44
2004-01-22 00:00:00.000 2003-12-31 00:00:00.000 63

I would like to be able to do alignments and arithmetic with columns of values (i.e. itemAframe?Value_ItemA + itemBframe?Value_ItemB) and return a series with either the reference date or the publish date dependent on which was required.

Aligning with the reference date is easy because the dates are non-overlapping so there is no issue with a duplicate key, but returning the frame with a publish date is problematic because not all keys will be unique

Any suggestion would be much appreciated.

Thanks!

user1129988
  • 1,516
  • 4
  • 19
  • 32

1 Answers1

0

The answer depends on what you want to do when there are multiple values for a given (duplicate) publish day. Will there be the same number of keys in both of the frames? Do you have some way of aggregating the values (e.g. take average or sum them)?

For example, let's say that Publish and Reference are just integers:

let f =  
  frame [ "Publish" => Series.ofValues [ 1; 1; 2; 2 ]  
          "Reference" => Series.ofValues [ 1; 2; 3; 4 ]
          "Value" => Series.ofValues [ 10; 9; 11; 8] ]

You can get a frame with multi-level index (grouped by publish day like this):

f
|> Frame.groupRowsByInt "Publish"

Now your keys will be tuples - the first element is the "Publish" value and the second is the original row index (here, just an ordinal index - but you could also use the "Reference" date as the secondary part of the index). If you have some way of making the keys match at this point (e.g. there is a same number of duplicates in both frames and ordinal indexing is good enough), then you can just use the frames as they are now.

However, the next thing you can do is to create a series of frames, containing the groups:

f
|> Frame.groupRowsByInt "Publish"
|> Frame.nest

So, for example, if you wanted to get an average value for each Publish day, you could do:

f
|> Frame.groupRowsByInt "Publish"
|> Frame.nest
|> Series.mapValues (fun df -> df?Value |> Stats.mean)

Alternatively, you can create a series that has a list of values for each "Publish" date, but this will make further calculations harder:

f
|> Frame.groupRowsByInt "Publish"
|> Frame.nest
|> Series.mapValues (fun df -> df?Value.Values |> List.ofSeq)

Fundamentally, you need some indexing scheme that will uniquely identify rows in both of the frames, so that you can align them. The key could be just the "Publish" date or the "Publish" date together with something else.

Tomas Petricek
  • 240,744
  • 19
  • 378
  • 553
  • First, thanks Tomas.. Couldn't I just use the reference date as the unique key to aligned them? For instance, if I wanted to select all the values of Publish dates prior to a specific publish date and then take all the latest reference dates? – user1129988 Jul 08 '15 at 15:33