0

I have a python server listening to POST from an external server.I expect two JSON documents for every incident happening on the external server. One of the fields in the JSON documents is a unique_key which can be used to identify that these two documents belong together. Upon recieving the JSON documents, my python server sticks into elasticsearch. The two documents related to the incident will be indexed in the elastic search as follows.

/my_index/doc_type/doc_1

/my_index/doc_type/doc_2

i.e the documents belong to the same index and has the same document type. But I don't have an easy way to know that these two documents are related. I want to do some processing before inserting into ElasticSearch when I can use the unique_key on the two documents to link these two. What are your thoughts on doing some normalization across the two documents and merging them into a single JSON document. It has to be remembered that I will be recieving a large number of such documents per second. I need some temporary storage to store and process the JSON documents. Can some one give some suggestions for approaching this problem.

As updated I am adding the basic structure of the JSON files here.

json_1

{
    "msg": "0",
    "tdxy": "1",
    "data": {

        "Metric": "true",
        "Severity": "warn",

        "Message": {
            "Session": "None",
            "TransId": "myserver.com-14d9e013794",
            "TransName": "dashboard.action",
            "Time": 0,
            "Code": 0,
            "CPUs": 8,
            "Lang": "en-GB",
            "Event": "false",
        },
        "EventTimestamp": "1433192761097"
    },
    "Timestamp": "1433732801097",
    "Host": "myserver.myspace.com",
    "Group": "UndefinedGroup"
}

json_2

{
    "Message": "Hello World",
    "Session": "4B5ABE9B135B7EHD49343865C83AD9E079",
    "TransId": "myserver.com-14d9e013794",  
    "TransName": "dashboard.action"
    "points": [
        {
            "Name": "service.myserver.com:9065",
            "Host": "myserver.com",
            "Port": "9065",

        }
    ],
    "Points Operations": 1,
    "Points Exceeded": 0,
    "HEADER.connection": "Keep-Alive",
    "PARAMETER._": "1432875392706",
}

I have updated the code as per the suggestion.

      if rx_buffer:
           txid = json.loads(rx_buffer)['TransId']
            if `condition_1`: 
                res = es.index(index='its', doc_type='vents', id=txid, body=rx_buffer)
                print(res['created'])
            elif `condition_2`:
                res = es.update(index='its', doc_type='vents', id=txid, body={"f_vent":{"b_vent":rx_buffer}})                                                              

I get the following error.

 File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 89, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
RequestError: TransportError(400, u'ActionRequestValidationException[Validation Failed: 1: script or doc is missing;]')
liv2hak
  • 14,472
  • 53
  • 157
  • 270
  • Are those two documents similar in structure (since they have the same mapping type) or do they carry different fields and are just related by the `unique_key`? Have you thought about using [`parent/child`](https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html) relationships and making one document the parent of the other? – Val May 31 '15 at 04:47
  • @Val - These documents carry different fields.They are essentially related by the `unique_key`. I haven't looked into into the parent/child relationships.Will look into it now. – liv2hak May 31 '15 at 06:26
  • Ok, then I'd say that if they carry different fields, you have three options: 1) have two different mapping types and use parent/child relationship, 2) merge both together into a single document or 3) embed one document into another using [`nested` objects](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-nested-type.html). If you can show some sample documents, we can better decide which option is best for you. – Val May 31 '15 at 06:56
  • @val - Can I use nested objects if there is a delay of say `1 minute` between recieving two JSON documents. What I am asking is can I updated a nested section inside a document already in elasticsearch and keep my server stateless. – liv2hak Jun 02 '15 at 22:58
  • Yes, it's definitely possible to do it. You index the first document when it comes along and then you update it with the `my_index/doc_type/unique_key/_update` endpoint and add/modify the nested field containing the second document. – Val Jun 03 '15 at 02:22
  • @Val - I don't have the unique key `document_id`. I am using XPOST. – liv2hak Jun 03 '15 at 02:24
  • At this point you should show a bit more of your two documents. Please update your post with anything relevant so that we can better help you. – Val Jun 03 '15 at 02:26
  • @val - I have added the skeletal strucutre of the two JSON documents so that you can get a better idea. – liv2hak Jun 03 '15 at 02:38
  • 1
    Thanks, so you're receiving `json1` first and then `json2` minutes later, correct? I guess your `unique_key` is `"TransId": "myserver.com-14d9e013794"` (i.e. the only field both documents have in common)? – Val Jun 03 '15 at 02:43
  • @Val - yes.The time between the two documents is around 1 minute.(60 secs). Yes `TransId` is the `unique_key`. There are a coupld of other fields that are shared but I would like to preserve the structure as is. – liv2hak Jun 03 '15 at 02:45

1 Answers1

1

The code below makes the assumption you're using the official elasticsearch-py library, but it's easy to transpose the code to another library.

We'd also probably need to create a specific mapping for your assembled document of type doc_type, but it heavily depends on how you want to query it later on.

Anyway, based on our discussion above, I would then index json1 first

from elasticsearch import Elasticsearch
es_client = Elasticsearch(hosts=[{"host": "localhost", "port": 9200}])

json1 = { ...JSON of the first document you've received... }

// extract the unique ID
// note: you might want to only take 14d9e013794 and ditch "myserver.com-" if that prefix is always constant
doc_id = json1['data']['Message']['TransID']

// index the first document
es_client.index(index="my_index", doc_type="doc_type", id=doc_id, body=json1)

At this point json1 is stored in Elasticsearch. Then, when you later get your second document json2 you can proceed like this:

json2 = { ...JSON of the first document you've received... }

// extract the unique ID
// note: same remark about keeping only the second part of the id
doc_id = json2['TransID']

// make a partial update of your first document
es_client.update(index="my_index", doc_type="doc_type", id=doc_id, body={"doc": {"SecondDoc": json2}})

Note that SecondDoc can be any name of your choosing here, it's simply a nested field that will contain your second document.

At this point you should have a single document having the id 14d9e013794 and the following content:

{
  "msg": "0",
  "tdxy": "1",
  "data": {
    "Metric": "true",
    "Severity": "warn",
    "Message": {
      "Session": "None",
      "TransId": "myserver.com-14d9e013794",
      "TransName": "dashboard.action",
      "Time": 0,
      "Code": 0,
      "CPUs": 8,
      "Lang": "en-GB",
      "Event": "false"
    },
    "EventTimestamp": "1433192761097"
  },
  "Timestamp": "1433732801097",
  "Host": "myserver.myspace.com",
  "Group": "UndefinedGroup",
  "SecondDoc": {
    "Message": "Hello World",
    "Session": "4B5ABE9B135B7EHD49343865C83AD9E079",
    "TransId": "myserver.com-14d9e013794",
    "TransName": "dashboard.action",
    "points": [
      {
        "Name": "service.myserver.com:9065",
        "Host": "myserver.com",
        "Port": "9065"
      }
    ],
    "Points Operations": 1,
    "Points Exceeded": 0,
    "HEADER.connection": "Keep-Alive",
    "PARAMETER._": "1432875392706"
  }
}

Of course, you can make any processing on json1 and json2 before indexing/updating them.

Val
  • 207,596
  • 13
  • 358
  • 360
  • Thanks. You are a life saver :) – liv2hak Jun 03 '15 at 03:18
  • b.t.w I am mainly planning to represent the data graphically using Kibana. – liv2hak Jun 03 '15 at 03:20
  • 1
    Gotcha, glad to help. Feel free to create new questions if you encounter issues with your new document. – Val Jun 03 '15 at 03:24
  • it seems es.update() will work only if a document is already indexed.Otherwise it doesn't create the document.? – liv2hak Jun 04 '15 at 00:12
  • Correct, [`es.update()`](http://elasticsearch-py.readthedocs.org/en/master/api.html#elasticsearch.Elasticsearch.update) is only for doing a partial update of an existing document. If you need to create a new document or update a whole document you can use [`es.index()`](http://elasticsearch-py.readthedocs.org/en/master/api.html#elasticsearch.Elasticsearch.index) – Val Jun 04 '15 at 04:54
  • Can you please see the updated query above.I am getting an error when I use es.update() – liv2hak Jun 05 '15 at 03:31
  • 1
    In your update call the `body` parameter is not correct, here is the correct call: `es.update(index='its', doc_type='vents', id=txid, body={"doc":{"f_vent":{"b_vent":rx_buffer}}})` – Val Jun 05 '15 at 04:21
  • what are "doc", "f_event" and "b_vent" here? I am trying to understand how JSON is being encoded. If I use the above es.update() Kibana4 is not allowing me to plot graphs using the fields in b_vent.? – liv2hak Jun 09 '15 at 03:16
  • In your question, you had `res = es.update(index='its', doc_type='vents', id=txid, body={"f_vent":{"b_vent":rx_buffer}})` and I simply fixed it by adding the `"doc":{...}` part for the `es.update` to work correctly. – Val Jun 09 '15 at 03:24
  • 1
    You probably need to ask another question for this specific issue, the original issue has been answered and we're just cluttering it. – Val Jun 09 '15 at 03:47
  • done. http://stackoverflow.com/questions/30722703/error-while-updating-a-document-in-elasticsearch-using-python-es-update. I will remove the details from this bug. – liv2hak Jun 09 '15 at 03:53