1

I have a lot of audit logs coming from the Azure Databricks clusters I am managing. The logs are simple application audit logs in the format of JSON. You have information about jobs, clusters, notebooks, etc. and you can see a sample of one record here:

{
    "TenantId": "<your tenant id",
    "SourceSystem": "|Databricks|",
    "TimeGenerated": "2019-05-01T00:18:58Z",
    "ResourceId": "/SUBSCRIPTIONS/SUBSCRIPTION_ID/RESOURCEGROUPS/RESOURCE_GROUP/PROVIDERS/MICROSOFT.DATABRICKS/WORKSPACES/PAID-VNET-ADB-PORTAL",
    "OperationName": "Microsoft.Databricks/jobs/create",
    "OperationVersion": "1.0.0",
    "Category": "jobs",
    "Identity": {
        "email": "mail@contoso.com",
        "subjectName": null
    },
    "SourceIPAddress": "131.0.0.0",
    "LogId": "201b6d83-396a-4f3c-9dee-65c971ddeb2b",
    "ServiceName": "jobs",
    "UserAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",
    "SessionId": "webapp-cons-webapp-01exaj6u94682b1an89u7g166c",
    "ActionName": "create",
    "RequestId": "ServiceMain-206b2474f0620002",
    "Response": {
        "statusCode": 200,
        "result": "{\"job_id\":1}"
    },
    "RequestParams": {
        "name": "Untitled",
        "new_cluster": "{\"node_type_id\":\"Standard_DS3_v2\",\"spark_version\":\"5.2.x-scala2.11\",\"num_workers\":8,\"spark_conf\":{\"spark.databricks.delta.preview.enabled\":\"true\"},\"cluster_creator\":\"JOB_LAUNCHER\",\"spark_env_vars\":{\"PYSPARK_PYTHON\":\"/databricks/python3/bin/python3\"},\"enable_elastic_disk\":true}"
    },
    "Type": "DatabricksJobs"
}

At the moment, I am storing the logs into Elasticsearch and I was planning to use their Anomaly Detection tool on this type of logs. Therefore, I do not need to implement any algorithm, but rather choose the right attribute, or perform the right aggregation, or maybe combine more attributes using a multi-variate analysis. However, I am not familiar with such topic nor I have this background. I have read Anomaly Detection: A Survey by Chandola et al., which was pretty useful to point me to the right sub-field. So, I have understood that I am dealing with time series and depending on the kind of aggregation I will perform I might face collective anomalies on sequence data (eg: the ActionName field of these logs) or contextual anomalies on sequence data.

I was wondering whether you can point me in the right direction, since I haven't managed to find any related work of anomaly detection on audit logs. More specifically, what kind of anomalies should I investigate? and which kind of aggregation will be beneficial?

Please keep in mind that I have a quite large amount of data. Moreover, I would appreciate any kind of feedback, even if it doesn't involve Elasticsearch; therefore, feel free to propose a whole unsupervised machine learning method for this kind of anomaly detection scenario rather than a simpler use case of Elasticsearch.

dadadima
  • 938
  • 4
  • 28

0 Answers0