2

I am attempting to use python to pull a JSON array from a file and input it into ElasticSearch. The array looks as follows:

{"name": [["string1", 1, "string2"],["string3", 2, "string4"], ... (variable length) ... ["string n-1", 3, "string n"]]}

ElasticSearch throws a TransportError(400, mapper_parsing_exception, failed to parse) when attempting to index the array. I discovered that ElasticSearch sometimes throws the same error whenever I try to feed it a string with both strings and integers. So, for example, the following will sometimes crash and sometimes succeed:

import json
from elasticsearch import Elasticsearch

es = Elasticsearch()

test = json.loads('{"test": ["a", 1, "b"]}')
print test
es.index(index, body=test)

This code is everything I could safely comment out without breaking the program. I put the JSON in the program instead of having it read from a file. The actual strings I'm inputting are quite long (or else I would just post them) and will always crash the program. Changing the JSON to "test": ["a"] will cause it to work. The current setup crashes if it last crashed, or works if it last worked. What is going on? Will some sort of mapping setup fix this? I haven't figured out how to set a map with variable array length. I'd prefer to take advantage of the schema-less input but I'll take whatever works.

Amit
  • 30,756
  • 6
  • 57
  • 88
snowfire257
  • 55
  • 1
  • 1
  • 6
  • Is this the actual code you are using? Because as is, this won't work without a valid connection config provided to `ElasticSearch`: For example: `Elasticsearch(['http://user:secret@localhost:9200/'])`. Also, you need to provide a `doc_type` and an `id` when you are calling `index. es.index(index, body=test, id='my_id', doc_type='things')` – idjaw Feb 27 '16 at 16:56
  • This actual code will return the exception I staylted. The `Elasticsearch()` method defaults to localhost:9200, and I have a local instance of Elasticsearch running. The `doc_type` and `id` are also handled perfectly fine by the defaults. This code executes successfully when not handling an array with mixed variables. – snowfire257 Feb 27 '16 at 17:07
  • Maybe this will help you. But this is running perfectly fine for me using your data structure: http://pastebin.com/EQRgrk8M – idjaw Feb 27 '16 at 17:12
  • Hmm, I appreciate your help, but that doesn't fix the array case from the beginning for me, nor the behavior of rejecting the single list case after the array case has failed. Maybe the problem is with my Elasticsearch? I'll reinstall it. – snowfire257 Feb 27 '16 at 17:26
  • Sorry, I can't replicate. I've been trying to with different data structures, and as long as I pass valid json, it works. – idjaw Feb 27 '16 at 17:38
  • Just for the sake of thoroughness, coule you try deleting the database and then adding an array like I referenced as the first input into the new database? That's the case that gets me 100% failure. If your setup works than I'll assume something is wrong with one of my distributions. Again, I really appreciate your help. – snowfire257 Feb 27 '16 at 17:46
  • Done. and everything is working fine for me. But someone just posted an interesting explanation you might want to check out. – idjaw Feb 27 '16 at 17:51

1 Answers1

5

It is possible you are running into type conflicts with your mapping. Since you have expressed a desire to stay "schema-less", I am assuming you have not explicitly provided a mapping for your index. That works fine, just recognize that the first document you index will determine the schema for your index. Each document you index afterwards that has the same fields (by name), those fields must conform to the same type as the first document.

Elasticsearch has no issues with arrays of values. In fact, under the hood it treats all values as arrays (with one or more entries). What is slightly concerning is the example array you chose, which mixes string and numeric types. Since each value in your array gets mapped to the field named "test", and that field may only have one type, if the first value of the first document ES processes is numeric, it will likely assign that field as a long type. Then, future documents that contain a string that does not parse nicely into a number, will cause an exception in Elasticsearch.

Have a look at the documentation on Dynamic Mapping.

It can be nice to go schema-less, but in your scenario you may have more success by explicitly declaring a mapping on your index for at least some of the fields in your documents. If you plan to index arrays full of mixed datatypes, you are better off declaring that field as string type.

BrookeB
  • 1,709
  • 13
  • 22
  • 1
    Thank you for the writeup! I looked into defining a mapping and read the official documentation, but I couldn't figure out how to define a variable-length list of fixed-length lists in a way that Elasticsearch would accept. I'd love any help you have on that front. I also considered that the database was interpreting the schema too rigidly and rejecting later arrays that didn't match, like you suggested, but even when I tried inputting just the first array I ran into the same crash, so I assumed that wasn't the issue. – snowfire257 Feb 27 '16 at 17:54
  • Sure, take a look at inner objects. https://www.elastic.co/guide/en/elasticsearch/reference/2.2/object.html. – BrookeB Feb 27 '16 at 17:57
  • Or, I may not be understanding what you mean by "fixed-length lists"? Can you elaborate a little more? – BrookeB Feb 27 '16 at 18:00
  • The array at the top is exactly what I'm inputting, and what's causing this exception to get thrown every time. Imagine a series of product reviews, each with a name, a review, and a rating score. Different products may have different numbers of reviews, but each review will have these three attributes. It tends to be around 10-30 reviews, and the review itself is around 1-3 sentences. I also have a lot of other product data that gets entered exactly as intended with my code, but trying to do the review array gets an exception thrown every time, even when I cut it down to a single review. – snowfire257 Feb 27 '16 at 18:13
  • Gotcha. Yeah, I believe this is happening because Elasticsearch needs a field name for each value you are trying to index. You have supplied an array of arrays, and ES has nothing to name the elements of your inner array. You will need to adjust your "name" field to be an `object` type, and include an array of objects for your reviews instead of an array of generic arrays. – BrookeB Feb 27 '16 at 18:33
  • Thanks so much for your help! I got a mapping to work and now the data is entering fine. Strangely enough, after experimenting with object definitions for a while, I discovered that the array would only input without throwing the error if I explicitly mapped it as a string. I have absolutely no idea why. But the JSON object still returns an array, integers and all, so it's working perfectly. Thanks for your push in the right direction. – snowfire257 Feb 27 '16 at 20:03
  • Glad it helped. Keep in mind that without using an object, you won't be able to query on specific fields in your "review" data. They will all be considered a "name". Essentially you lose your array of arrays, as it were, and all values get treated the same for searches. You will always get your original JSON back, however, as you discovered, because by default ES stores your original JSON doc un-parsed in a system field called `_source`. – BrookeB Feb 27 '16 at 21:43
  • @snowfire257 Btw, welcome to StackOverflow! If my answer was useful, consider accepting it so that others may find it when they search for a similar issue. Thanks! – BrookeB Feb 29 '16 at 14:37
  • Done! I still have no understanding what the issue was though, or why defining my array as a string solved things. The array is fully searchable as well. But regardless, your answer was very useful and deserves visibility. – snowfire257 Feb 29 '16 at 14:41
  • @snowfire257 Thanks! Did you have a look at the Dynamic Mapping documentation? I think the reason it is working now is that ES is collapsing all the elements of your arrays into one field, "name", and since you've declared it a `string`, it won't run into a type conflict (because everything can be safely treated as a string) – BrookeB Feb 29 '16 at 14:46
  • @snowfire257 My point about using `object` before was so you could search on a particular element of a particular array. For example, in your first inner array, ["string1", 1, "string2"], if you wanted to search where only the *first* element was "string1". You would not be able to do that because "string1" could appear anywhere in the array. Or, if you wanted to search where an inner array had BOTH "string1" and "string2". This you could also not do. But, if you don't need that kind of searchability, then an `object` is probably not necessary for you. – BrookeB Feb 29 '16 at 14:51