I have a lot of audit logs coming from the Azure Databricks clusters I am managing. The logs are simple application audit logs in the format of JSON
. You have information about jobs, clusters, notebooks, etc. and you can see a sample of one record here:
{
"TenantId": "<your tenant id",
"SourceSystem": "|Databricks|",
"TimeGenerated": "2019-05-01T00:18:58Z",
"ResourceId": "/SUBSCRIPTIONS/SUBSCRIPTION_ID/RESOURCEGROUPS/RESOURCE_GROUP/PROVIDERS/MICROSOFT.DATABRICKS/WORKSPACES/PAID-VNET-ADB-PORTAL",
"OperationName": "Microsoft.Databricks/jobs/create",
"OperationVersion": "1.0.0",
"Category": "jobs",
"Identity": {
"email": "mail@contoso.com",
"subjectName": null
},
"SourceIPAddress": "131.0.0.0",
"LogId": "201b6d83-396a-4f3c-9dee-65c971ddeb2b",
"ServiceName": "jobs",
"UserAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",
"SessionId": "webapp-cons-webapp-01exaj6u94682b1an89u7g166c",
"ActionName": "create",
"RequestId": "ServiceMain-206b2474f0620002",
"Response": {
"statusCode": 200,
"result": "{\"job_id\":1}"
},
"RequestParams": {
"name": "Untitled",
"new_cluster": "{\"node_type_id\":\"Standard_DS3_v2\",\"spark_version\":\"5.2.x-scala2.11\",\"num_workers\":8,\"spark_conf\":{\"spark.databricks.delta.preview.enabled\":\"true\"},\"cluster_creator\":\"JOB_LAUNCHER\",\"spark_env_vars\":{\"PYSPARK_PYTHON\":\"/databricks/python3/bin/python3\"},\"enable_elastic_disk\":true}"
},
"Type": "DatabricksJobs"
}
At the moment, I am storing the logs into Elasticsearch and I was planning to use their Anomaly Detection tool on this type of logs. Therefore, I do not need to implement any algorithm, but rather choose the right attribute, or perform the right aggregation, or maybe combine more attributes using a multi-variate analysis. However, I am not familiar with such topic nor I have this background. I have read Anomaly Detection: A Survey by Chandola et al., which was pretty useful to point me to the right sub-field.
So, I have understood that I am dealing with time series and depending on the kind of aggregation I will perform I might face collective anomalies on sequence data (eg: the ActionName
field of these logs) or contextual anomalies on sequence data.
I was wondering whether you can point me in the right direction, since I haven't managed to find any related work of anomaly detection on audit logs. More specifically, what kind of anomalies should I investigate? and which kind of aggregation will be beneficial?
Please keep in mind that I have a quite large amount of data. Moreover, I would appreciate any kind of feedback, even if it doesn't involve Elasticsearch; therefore, feel free to propose a whole unsupervised machine learning method for this kind of anomaly detection scenario rather than a simpler use case of Elasticsearch.