10

I have a log file which contains playerId values, some players have multiple entries in the file. I want to get an exact distinct count of to unique players, regardless of if they have 1 or multiple entries in the log file.

Using the query below it scans 497 records and finds 346 unique rows (346 is the number I want) Query:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/ 
| parse @message "\"playerId\": \"*\"" as playerId
| stats count(playerId) as CT by playerId

497 records 346 rows

If I change my query to use count_distinct instead, I get exactly what I want. Example below:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/ 
| parse @message "\"playerId\": \"*\"" as playerId
| stats count_distinct(playerId) as CT 

with count_distinct function

The problem with count_distinct however is that as the query expands to a larger timeframe/more records the number of entries get into the thousands, and tens of thousands. This presents an issue as the numbers become approximations, due to the nature of Insights count_distinct behaviour...

"Returns the number of unique values for the field. If the field has very high cardinality (contains many unique values), the value returned by count_distinct is just an approximation.".

Docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html

This is not acceptable, as I require exact numbers. Playing with the query a little, and sticking with count(), not count_distinct() I believe is the answer, however I've not been able to come to a single number... Examples which do not work... Any thoughts?

Ex 1:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/ 
| parse @message "\"playerId\": \"*\"" as playerId
| stats count(playerId) as CT by playerId
| stats count(*) 

We are having trouble understanding the query.

To be clear, I'm looking for an exact count to be returned in a single row showing the number.

ElasticThoughts
  • 3,417
  • 8
  • 43
  • 58

1 Answers1

5

What if we introduce a dummy field that's hardcoded to "1"? The idea is to retrieve its min value so that it stays as a "1" even if the same playerId occurs more than once. And then we sum this field.

The log entry might look like this:

[1]"playerId": "1b45b168-00ed-42fe-a977-a8553440fe1a"

Query:

fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/ 
| parse @message "[*]\"playerId\": \"*\"" as dummyValue, playerId
| stats sum(min(dummyValue)) by playerId as CT 

References used:

Marc Lopez
  • 561
  • 8
  • 15
  • Deleting my previous comment. That might work however the logs do not have data in it that I can key off of, as in your example [1] - So, without that data,your example would not work in my situation... – ElasticThoughts Mar 13 '20 at 15:34