I have a log file which contains playerId values, some players have multiple entries in the file. I want to get an exact distinct count of to unique players, regardless of if they have 1 or multiple entries in the log file.
Using the query below it scans 497 records and finds 346 unique rows (346 is the number I want) Query:
fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/
| parse @message "\"playerId\": \"*\"" as playerId
| stats count(playerId) as CT by playerId
If I change my query to use count_distinct instead, I get exactly what I want. Example below:
fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/
| parse @message "\"playerId\": \"*\"" as playerId
| stats count_distinct(playerId) as CT
The problem with count_distinct however is that as the query expands to a larger timeframe/more records the number of entries get into the thousands, and tens of thousands. This presents an issue as the numbers become approximations, due to the nature of Insights count_distinct behaviour...
"Returns the number of unique values for the field. If the field has very high cardinality (contains many unique values), the value returned by count_distinct is just an approximation.".
Docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html
This is not acceptable, as I require exact numbers. Playing with the query a little, and sticking with count(), not count_distinct() I believe is the answer, however I've not been able to come to a single number... Examples which do not work... Any thoughts?
Ex 1:
fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/
| parse @message "\"playerId\": \"*\"" as playerId
| stats count(playerId) as CT by playerId
| stats count(*)
We are having trouble understanding the query.
To be clear, I'm looking for an exact count to be returned in a single row showing the number.