0

I'm trying to figure out a way to reduce the size of a table that is growing to an alarming size by consolidating similar entries. I've gotten a working 'Group By' query that does roll up the data as I would like, I'm just not sure how to delete the existing entries and insert in new ones based upon that select statement. I suppose the most straight forward approach would be run the select, save those results into memory, delete the table entries, then insert the consolidated entries. But that seems inefficient and would use a lot of memory, hoping maybe there is a way to perform some sql magic and make this happen all at once.

This is my 'Group By' select:

select SUM(user_in), 
       SUM(user_out), 
       SUM(user_total), 
       name, 
       device_type, 
       device_os, 
       ownership, 
       host_name 
  FROM user_usage 
 GROUP BY name, 
          device_type, 
          device_os, 
          ownership, 
          host_name;

Thanks!

In case anyone else ever runs into something like this, this function accomplishes the consolidation while being very efficient:

CREATE OR REPLACE FUNCTION consolidate_user_bandwidths(
  IN _tbl REGCLASS,
  IN _savetime TIMESTAMP
  ) RETURNS void AS
  $BODY$
    BEGIN
      EXECUTE 'CREATE TEMP TABLE user_bandwidths_temp ON COMMIT DROP AS 
            SELECT MAX(id) AS id, SUM(bandwidth_in) AS bandwidth_in, 
                   SUM(bandwidth_out) AS bandwidth_out,
                   '''|| _savetime ||'''::timestamp AS created_at, 
                   SUM(bandwidth_total) AS bandwidth_total, name,                             
                   device_type, device_os,
                   ownership, host_name  
            FROM '|| _tbl ||' 
            GROUP BY name, device_type,
                     device_os, ownership, host_name';
      EXECUTE 'TRUNCATE TABLE '|| _tbl ||'';
      EXECUTE 'INSERT INTO '|| _tbl ||' SELECT * FROM user_bandwidths_temp';
    END;
  $BODY$
  LANGUAGE 'plpgsql' VOLATILE;
user3027875
  • 101
  • 8
  • What is your RDBMs? – Jorge Campos Mar 01 '17 at 21:37
  • 1
    If you're getting a lot of unwanted duplicates you should have a look at the logic of your code that adds entries to the user_usage table – SpacePhoenix Mar 01 '17 at 21:38
  • Add table schema. – McNets Mar 01 '17 at 21:39
  • You could move data that is not queried (old data) to an [archive storage engine](https://dev.mysql.com/doc/refman/5.7/en/archive-storage-engine.html) – Dan Ionescu Mar 01 '17 at 21:39
  • One option is to create a temp table based on your query drop all existing registries from the actual table, insert all data from the temp table to the actual table and then drop the temp table. But I also strongly advise you to follow the @SpacePhoenix suggestion because you will be facing the same issue in the near future. – Jorge Campos Mar 01 '17 at 21:43
  • You could also maintain an aggregated table using an trigger, that aggregates the data and then deletes the old data from your main table – Dan Ionescu Mar 01 '17 at 21:44
  • They aren't unwanted duplicates, we actually have charts that show the data mined out over the small increments which we are saving it under, however, keeping the data for every user every five minutes is amassing a huge amount of data. Thus we've decided to start consolidating the data once it is seven days old to just keep around what each user did for that entire day. We are already aggregating the data from a main table into child tables at the end of each day. – user3027875 Mar 01 '17 at 21:46
  • @Jorge Campos: We are using Postgres on top of a Rails Server using ActiveRecord. I've managed to get this working using some Rails solutions, but its taking well over an hour. Postgres calculated the Group By query for 1M entries in ~2 minutes. – user3027875 Mar 01 '17 at 21:49
  • Then my suggestion remains. You can even create a procedure to do so. The ideal solution for your remain problem would be to change the application INSERT for an UPSERT syntaxe (to increase the existing value). Based on your grouped fields. That way you wouldn't be concerned with this problem again. – Jorge Campos Mar 02 '17 at 00:58
  • Your suggestion does not work. I need every entry for seven days. Once the entries become eight days old, I will start to consolidate the old ones, I still need the new ones coming in every time to be stored separately so data can be filtered in real-time. The child tables are created at the end of every day and contain all the data from the last 24 hours. I'm trying to consolidate the 8th child table. – user3027875 Mar 02 '17 at 13:56

1 Answers1

1

A possible solution is to use another aggregation table (e.g.: user_usage_aggregated). Periodically, run your query and write the output to the aggregation table. Then delete the raw data from user_usage.

If you want to keep sum up the data, you will need to update the aggregated table rather than to insert in to it.

Ossin Java guy
  • 365
  • 3
  • 12