Handling Missing Data, Incomplete data is an issue faced by all Data Miners/ Developers (depending on job function) in the industry.
If I want to convert hourly data (24 records) to daily (1 record), what are the best practices to achieve this.
- use
having count(*) = 24
ingroup by
clause. So how/when to import those records which were < 24 and if they miss permanently. - Don't use
having count(*) = 24
clause, so all the data even if it is < 24 will be imported. So how to re-insert those values later on when they become 24.
Note I have primary key on date/time fields so I need to delete those incomplete records first before re-inserting (which I don't want to do).
Sample data
Hourly Table: PK on Time, ElementNum, RouteName
Time ElementNum RouteName NorthTraffic SouthTraffic
15-3-14 0:00 Element1 Far Yards 12.3 14.5
15-3-14 1:00 Element1 Far Yards 10 11
15-3-14 2:00 Element1 Far Yards 9 10
15-3-14 3:00 Element1 Far Yards 8.5 9.1
… .. .. .. ..
15-3-14 21:00 Element1 Far Yards 25 26.2
15-3-14 22:00 Element1 Far Yards 29.3 33.7
15-3-14 23:00 Element1 Far Yards 25.6 27.8
15-3-14 0:00 Element1 JJ Park 80.1 98.8
15-3-14 1:00 Element1 JJ Park 75.3 86
15-3-14 2:00 Element1 JJ Park 70.9 71.1
15-3-14 3:00 Element1 JJ Park 66.5 67.4
… … … … …
15-3-14 21:00 Element1 JJ Park 112.1 115.5
15-3-14 22:00 Element1 JJ Park 125.4 130.7
15-3-14 23:00 Element1 JJ Park 120 121.3
SQL query to convert Hourly to Daily; executes everyday at 0700 hrs:
SELECT
convert(varchar(10), Time, 120),
ElementNum, RouteName,
MAX(NorthTraffic), MAX(SouthTraffic),
MAX(NorthTraffic) + MAX(SouthTraffic) TotalTrafficMAX
FROM
HourlyTable
WHERE
Time = '15/03/2014'
GROUP BY
convert(varchar(10), Time, 120), ElementNum, RouteName
HAVING
count(*) = 24
Daily Table PK on Date, ElementNum, RouteName
Date ElementNum RouteName NorthTrafficMAX SouthTrafficMAX TotalTrafficMAX
15-3-14 Element1 Far Yards 29.3 33.7 63
15-3-14 Element1 JJ Park 125.4 130.7 256.1
Problem
If the data isn't populated in Hourly Tables before 0700hrs (which is quite common) the daily table will not be populated.
My workaround
- Don't add having clause, but the data in daily table will be inaccurate. Then as a 2nd step write stored procedures which will run every hour and updates the MAX column of daily table. This way is too cumbersome. Though I have done this kind of update in many other projects, but here I feel I should have some professional approach and above all next to those columns of MAXTraffic there are many more columns where I have to apply formulas, calculate utilizations, percentage etc. Writing update for each one of them will be a huge challenge.
Need Solution
Is there a real world professional solution for this kind of situations.