0

I have indexed Twitter data in ES. There are 110 M Twitter unique users profiles and there 650 M Tweets. Both are in seperate index (index: twitter-profiles, type: profiles), for tweets (index: twitter-tweets, type: tweets).

There is user_id_str of profile is attached with every tweet.

I am running into a problem to get occurrence count of specific user. I used Facet/terms and Aggregation/Terms but both give me exception PartialShardFailureException because there are lot of data to make calculation. I used following query

{
"aggs" : {
    "userCount" : {
        "terms" : { "field" : "user_id_str" }
    }
  }
}

Then I give another Try.

I used second method Scan. Here I get ids of profiles from profiles type then search it in tweet type. it give me results but a single result came after 2seconds OOps. There are 110 M users mean I have to wait for days.

Please give me any reasonable solution for this situation.

Sohail Ahmed
  • 1,667
  • 14
  • 23
  • What is the mapping? Did you use non_analyzed on the mentioned field? How many shards do you use? How many nodes? – Jettro Coenradie Aug 10 '14 at 08:55
  • Yes the field I am looking for is not_analyzed, There are 6 shards and three nodes running on Amazon's Ec2 servers – Sohail Ahmed Aug 11 '14 at 05:32
  • The scan is used to go through all the data with sorting or scoring. If you want to aggregate over all the users (110 M), there can be a memory problem. More shards and more nodes with more memory could be an option. Maybe try with a more limited dataset and see the results than. – Jettro Coenradie Aug 11 '14 at 21:58

1 Answers1

-2

You could use Cardinality aggregation in combination with term filter

wahhzu
  • 1
  • 1