3

I am trying to gather simple statistics on data, such as time a service is online, time a service is offline, averages, etc. I've found a few solutions already, but they all rely on certain things, such as the rows being back to back (ROW_NUMBER -1) or there only being two states.

My data comes in the form of logs, always after the fact (i.e. no live data). The largest problem for me to figure out is that there are more than two states. Currently, there are a possibility of four different states (Enabled, Disabled, Active, Inactive) and I'd like to be able to collect data on each.

I'm provided the data a single line at a time containing a Service Name, Old Status, New Status, and a Timestamp. Currently the data is stored in a single table. I cannot change how the data is provided, but I can change how it is stored and I'm beginning to think that table is my primary draw back.

Here is an example of how the data may end up in my table currently:

CREATE TABLE IF NOT EXISTS statusupdates (
  sid int UNIQUE, 
  fullname VARCHAR(64), 
  oldstatus VARCHAR(16), 
  newstatus VARCHAR(16), 
  time TIMESTAMP);

INSERT INTO statusupdates VALUES
(null, 'fictHTTP', 'Off', 'On', '2017-01-01 02:20:00'),
(null, 'faked', 'On', 'Inactive', '2017-01-01 02:25:00'),
(null, 'ipsum', 'Inactive', 'On', '2017-01-01 02:30:00'),
(null, 'resultd', 'On', 'Inactive', '2017-01-01 02:35:00'),
(null, 'ipsum', 'On', 'Active', '2017-01-01 02:40:00'),
(null, 'fictHTTP', 'On', 'Active', '2017-01-01 02:45:00'),
(null, 'faked', 'Inactive', 'Off', '2017-01-01 02:50:00'),
(null, 'ipsum', 'Active', 'Off', '2017-01-01 02:55:00'),
(null, 'resultd', 'Inactive', 'Off', '2017-01-01 03:00:00');

I believe one method I've found is to narrow it down to one item, such as resultd. Something like SELECT fullname, newstatus, time FROM statusupdates WHERE fullname='resultd' ORDER BY time DESC;. Then with that data, do another query with the same method, but go one step forward (since it's descending order) and get newstatus from that record. As I type that, it seems sloppy.

Alternatively grab oldstatus and in the second query, use it to find newstatus of the following record. But again, this may be sloppy.

I know there is a way to combine those two theoretical queries as well. So, to summarize, I'm far over my head, forgive me! In the end I'd like to see stats like total time, average time, etc for each status. My biggest hurdle right now is getting query to deliver a result, for example, every timestamp entry for ipsum in such a way that I can get the duration of time from the prior entry, plus repeat this until it's gone through all of the records.

Or, perhaps, I'm entirely over thinking this and am making it too complex by shoving all the data into one table--which I've done twice on this project thus far for unrelated items.

Additional thought: A single instance, I could do SELECT old_status, new_status, time FROM statusupdates WHERE time = '2017-01-01 03:00:00' Then I could use old_status like this, SELECT old_status, new_status, time FROM statusupdates WHERE time < 'timeStamp' AND new_status = 'oldStatus' Then subtract the two timestamps which would give me the data for one example. But, then how to do it for the next step, and the next, until its hit all of them.

Update, Another thought: With a combination of some of your fantastic suggestions, what about reading the logs backward? Nevermind, at that point it wouldn't matter what direction they were read. When it encounters a status, create an incomplete record. It would contain old_status and time_stamp as end_time. Then when it encounters that service again, it checks if new_status = old_status and update the record with time_stamp as start_time.

This seems like it would cause a hell of a lot of overhead though. Every record would have to be checked to see if it exists, if not make one, if yes update one. Or maybe that's not too bad?

mrUlrik
  • 150
  • 11
  • this sounds like an stream of data. Did you look in Streams? – bichito Jan 17 '17 at 23:08
  • I feel like it would be easier if I did. But no, static text files far after they are written. – mrUlrik Jan 17 '17 at 23:09
  • Maybe you could aggregate the file data? – bichito Jan 17 '17 at 23:13
  • That is essentially what I'm trying to do. My biggest problem is how to build that in the first place. A single instance, I could do `SELECT old_status, new_status, time FROM statusupdates WHERE time = '2017-01-01 03:00:00'` Then I could use old_status like this, `SELECT old_status, new_status, time FROM statusupdates WHERE time < 'timeStamp' AND new_status = 'oldStatus'` Then subtract the two timestamps which would give me the data for one example. But, then how to do it for the next step, and the next, until it's hit all of them. – mrUlrik Jan 18 '17 at 00:07
  • Do a search for java stream 8 file line by line. there are several examples that may help you – bichito Jan 18 '17 at 02:44

2 Answers2

3

Do you have access to window functions in your database? If so, you can get the value of the next row for each record (partitioned by fullname):

  select  fullname,
          newstatus,
          avg( time_diff ) as avg_time
  from    (
            select  fullname,
                    oldstatus,
                    newstatus,
                    /* get the time value of the next row for this fullname record */
                    lead( time ) over( 
                      partition by fullname 
                      order by time 
                      rows between 1 following and 1 following 
                    ) as next_time,
                    time,
                    next_time - time as time_diff
            from    statusupdates
          ) as a
   group by fullname,
          newstatus

EDIT

In the absence of window functions, you can get the next_time in a slightly more convoluted way:

select a.*,
       b.next_time
from   statusupdates as a
       left join
       (
       select a.fullname,
              a.time,
              min( b.time ) as next_time
       from   statusupdates as a
              left join
              statusupdates as b
              on a.fullname = b.fullname
              and a.time < b.time
       group by a.fullname,
              a.time
       ) as b
       on a.fullname = b.fullname
       and a.time = b.time
;
fredt
  • 24,044
  • 3
  • 40
  • 61
Alex
  • 1,633
  • 12
  • 12
  • Unfortunately I do not believe HSQLDB supports LEAD. :( – mrUlrik Jan 18 '17 at 01:58
  • Updated to try something that doesn't use lead :-) – Alex Jan 18 '17 at 02:26
  • Thank you very much for that! Great tutorial on sub queries, actually. I imagine to end up with the full set of data for all records of one status update, I assume I'll have to run that query for every line? Currently I'm at about 63,000 records. Of course, once I get this figured out I'll take that number down by specifying ranges. I'm not sure if SQL has some form of loop or some trickery I can perform to cause a loop. – mrUlrik Jan 18 '17 at 14:06
  • I'm glad this helped! I think you should be able to use the resulting query in a flexible way for your reporting. E.g.you could use it as a sub query, group by fullname and subtract time from next_time to get the total time elapsed. Then you can sum or average the time elapsed. – Alex Jan 19 '17 at 03:28
  • I tend to find that although you can do things with loops in SQL with certain tricks, it tends to be overly complex and not a good fit for what SQL is good at. Working on the data as a whole (with appropriate filters where needed) can be surprisingly quick, even with large tables. Groupings and aggregating is incredibly powerful. You could also look into indexing fields fields to improve performance. – Alex Jan 19 '17 at 03:32
  • Thank you very much for the help @Alex! – mrUlrik Jan 19 '17 at 23:29
1

You may reconsider you data structure for this as

statusUpdate {
  fullName,
  oldStatus,
  newStatus,
  startTime,
  endTime
}

Now you may easily shoot a SQL query to get you stats: example

select sum(endTime - startTime) from statusUpdate where oldStatus='active' group by fullName

In case you don't have any control over database then you may make one in memory but that will be very costly if this data is large in volume.

Edit

Solution from Alex seems to be best so far but if database is completely out of you control you may try to build you stats while parsing the log file given that log file guarantees to list records sorted by time. This may use less memory space and can be further fine tuned.

public class Aggregation {

    String fullName;
    String prevStatus;
    String currStatus;
    Date prevTime;
    Date currTime;

    Map<String, List<Long>> timePeriodListMap = new HashMap<>();
    Map<String, Long> totalTimeMap = new HashMap<>();

    public void add(Status status) {
        if(!fullName.equals(status.fullName)) {
            throw new RuntimeException("Wrong "+fullName);
        }
        if(!currStatus.equals(status.oldStatus)) {
            throw new RuntimeException("Previous record's newStatus is not this record's oldStatus");
        }
        if(prevTime.compareTo(status.time) > 0){
            throw new RuntimeException("Unsorted by time");
        }

        if(currTime == null) {
            fullName = status.fullName;
            prevTime = status.time;             
        } else {
            if(!timePeriodListMap.containsKey(prevStatus)) {
                timePeriodListMap.put(prevStatus, new ArrayList<Long>());
            }
            timePeriodListMap.get(prevStatus).add(status.time.getTime() - currTime.getTime());
            prevTime = currTime;
            currTime = status.time;
        }           
        prevStatus = status.oldStatus;
        currStatus = status.newStatus;          
    }

}

Map<String, Aggregation> statusDB = new HashMap<String, TestClass.Aggregation>();
//read from the file as status one by one
public void process(Status status) {        
    if(!statusDB.containsKey(status.oldStatus)) {
        Aggregation aggregation = new Aggregation();
        statusDB.put(status.fullName, aggregation);
    }
    statusDB.get(status.fullName).add(status);
}
old-monk
  • 799
  • 8
  • 20
  • This is fantastic. Thank you for the quick tutorial on nested queries and joins. With this information, I could then populate a table containing start and stop dates. That would simplify this dramatically. Now I'm pondering if I should figure out how to do this before the database gets involved, using that same method. But off the top of my head, I imagine I'd have to load the entire files into memory first, because I'd be skipping ahead. – mrUlrik Jan 18 '17 at 15:12
  • I may use your solution due to performance issues. Thank you again. – mrUlrik Jan 19 '17 at 23:32