4

I am working with Analytic of events, I use hadoop to process the logs and store some results in Mysql. This did not work now due to scalability issues as logs are keep coming daily.

We need to show stats per year, month, week, day, hour along with filtering capability Our samples can grow for 100k users, each uses 20 websites each hour
100,000(users) * 20 (unique website) * 2 (locations) * 24 (hours) = 96,000,000 (96 Million max records per day)

Our table looks like
event_src_id, Time, User, Website, location, some stats

Some queries example are

1) select website, sum(stats), count(distinct(user_id)) from table group by website;
2) select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time);
3) select website, sum(stats), count(distinct(user_id)) from table group by website where event_src_id=XXXXXXXXXXX;
4) select website, sum(stats), count(distinct(user_id)) from table group by website where time > 1 jan 2014 and time <=31 jan 2014;
5) select website, location, sum(stats), count(distinct(user_id)) from table group by website, location;
6) select website, sum(stats) as stats_val from table group by website order by stats_val desc limit 10;
   select location, sum(stats) as stats_val from table group by location order by stats_val desc limit 10;
7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)

I tried Hadoop elastic search and its seems like insertion part can fixed with that, I am more worried on the reading part. The aggregation framework seems to give some hope but I could not work as per query one. how to group and sum and distinct at same time? How can I best use Elasticsearch with Hadoop with given scalability and performance for OLAP based quires. Any help will be appreciated.

maaz
  • 4,371
  • 2
  • 30
  • 48

2 Answers2

5

Firstly I do not think that using ElasticSearch for OLAP-like queries is a good idea. I would recommend you use some Dremel-like technologies (Impala, TEZ, Storm etc.) which support sql you specified. It has some advantages like:

  • you do not have to transfer data from Hadoop into ElasticSearch.
  • you can use sql
  • you do not have to care about parsing json from ElasticSearch query answers.

Do not get me wrong, I love ElasticSearch/Logstash/Kibana, but for log collecting and visualization. Of course it is possible to make some advanced queries but it has some limitations which I found out in my personal projects.

Also consider using Kibana, it is a great tool for data statistics in ElasticSearch and you can do a lot of things with it.

Here are some queries example as you requested (i did not tested it):

1)

{
  "aggs": {
    "website": {
      "terms": {
        "field": "website"
      },
      "aggs": {
        "sum_stats": {
          "sum": {
            "field": "stats"
          },
          "aggs": {
            "distinct_user": {
              "cardinality": {
                "field": "user_id",
                "precision_threshold": 100
              }
            }
          }          
        }
      }
    }
  }
}

2-6 are similar, use things from 1) with different filters like this:

{
  "aggs": {
    "your_filter": {
      "filter": {
        "term": {"event_src_id" : "XXXXXXXXXXX"}
      }
    },
    "aggs": {
      "website": {
        "terms": {
          "field": "website"
        },
        "aggs": {
          "sum_stats": {
            "sum": {
              "field": "stats"
            },
            "aggs": {
              "distinct_user": {
                "cardinality": {
                  "field": "user_id",
                  "precision_threshold": 100
                }
              }
            }
          }
        }
      }
    }
  }
}

7) DELETE is quite easy

    "query" : {
        "term" : { "event_src_id" : "XXXXXXXXXXX" }
    }
}
BenG
  • 1,292
  • 8
  • 11
Radek Tomšej
  • 490
  • 4
  • 15
  • Yes, using HDFS seems right, but can any be used for Realtime? – maaz Sep 26 '14 at 10:22
  • Currently in failing with AggregationInitializationException[Aggregator [sum_stats] of type [sum] cannot accept sub-aggregations]; – maaz Sep 26 '14 at 10:22
  • just swich agg for sum (sum_stats) with distinct user like this: `{ "aggs": { "website": { "terms": { "field": "website" }, "aggs": { "distinct_user": { "cardinality": { "field": "user_id", "precision_threshold": 100 }, "aggs": { "sum_stats": { "sum": { "field": "stats" } } } } } } } }` – Radek Tomšej Sep 26 '14 at 10:39
  • 1
    You can use HDFS for realtime. You are right, it is not ideal, but it is possible. I am also considering similar architecture as you - logstash put data into ElasticSearch, than visualize it in Kibana/ElasticSearch and after that batch download data from ES into HDFS for OLAP purposes. Here is an interesting slideshare about near-realtime HDFS - http://www.slideshare.net/tobym/data-pipeline-at-tapad. This topic is trully interesting to me so if you want to discuss it deeper you can write me an email. – Radek Tomšej Sep 26 '14 at 10:49
  • nope, it is not giving sum_stats, also it resulting only 10 websites, how can I get all the website (use paginations) ? cool, can you update your profile with email so that I can email you , the stack used in above slide is vertica. – maaz Sep 26 '14 at 12:04
  • Just updated my profile :). To add more then 10 websites you need to change [pagination](www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-from-size.html). If you set it to 0 you should see all websites. Also use `_search?search_type=count` as suffix of your POST method to see only aggregations. Think that reason why you are not getting sum_stats is that the stats field in your ES index is not numeric format. – Radek Tomšej Sep 26 '14 at 14:04
  • great, can you update your answer with pagination example so that I can mark it? – maaz Sep 28 '14 at 12:31
  • Here you are `{ "from": 0, "size": 0, "aggs": { "website_stats": { "terms": { "field": "website" }, "aggs": { "count_users": { "cardinality": { "field": "user_id" } }, "sum_stats": { "sum": { "field": "stats" } } } } } }` – Radek Tomšej Sep 28 '14 at 14:33
  • nope, not working, pagination is for hits not for aggregations, is there some else param may be required – maaz Sep 28 '14 at 14:55
  • Oh you are right. You should also add - "size": 0 to term agg: `{ "from": 0, "size": 0, "aggs": { "website_stats": { "terms": { "field": "website", "size": 0 }, "aggs": { "count_users": { "cardinality": { "field": "user_id" } }, "sum_stats": { "sum": { "field": "stats" } } } } } }` – Radek Tomšej Sep 28 '14 at 15:07
  • thanks for the edits! Radek is correct, there is no pargination support for aggregations; using size:0 will ensure all data is returned – BenG Sep 29 '14 at 16:22
3

how to group and sum and distinct at same time

Aggregations can have sub-aggregations.

First, the group functionality corresponds to the terms aggregation and (sometimes) the top_hits aggregation. Second, There is a sum aggregation, a simple stats metric aggregation. Finally, Your use of distinct in this case is to perform count(distinct), which corresponds to a cardinality aggregation, which can be approximate or exact depending on your needs.

7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)

There is a delete by query api which you can use, but be careful about high percentages of deleted documents; Lucene and Elasticsearch are not optimized for this, and you will incur an overhead due to delete markers in the data.

Examples

select website, sum(stats), count(distinct(user_id)) from table group by website

GET /_search
{
   "aggs": {
      "website_stats": {
        "terms": {
           "field": "website"
        },
        "aggs": {
           "sum_stats": {
             "sum": {
               "field": "stats"
             }
           },
           "count_users": {
             "cardinality": {
               "field": "user_id"
            }
          }
        }
      }
   }
}

select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time)

GET /_search
{
   "query": {
     "filter": {
       "range": {
           "Time": {
              "gt": "2009-01-01 00:00:00",
              "lt": "2010-01-01 00:00:00"
           }
       }
     }
   },
   "aggs": {
      "monthly_stats" {
        "terms": {
           "field": "website"
        },
        "aggs": {
           "months": {
              "date_histogram": {
                "field": "Time",
                "interval": "month"
              },
              "aggs" : {
                "sum_stats": {
                  "sum": {
                    "field": "stats"
                  }
                },
                "count_users": {
                  "cardinality": {
                    "field": "user_id"
                }
              }
            }
          }
        }
      }
   }
}
BenG
  • 1,292
  • 8
  • 11
  • For completeness can you share some query example for 1 to 7? – maaz Sep 23 '14 at 18:32
  • added two examples -- hopefully this is enough to expand to the other queries – BenG Sep 24 '14 at 16:19
  • failing with AggregationInitializationException[Aggregator [sum_stats] of type [sum] cannot accept sub-aggregations]; – maaz Sep 26 '14 at 10:20
  • hmmm "sum_stats" does not have sub-aggregations, "months" has two sub-aggregations: "sum_stats" and "count_users" – BenG Sep 26 '14 at 15:31
  • I think it was missing a name for the top level aggregation; see edited examples – BenG Sep 26 '14 at 15:34