0

I have an index of ~113000 documents. I'm trying to retrieve all of them, and I don't care about the score. basically a select * from index;

And i'm doing this in python using elasticutils (haven't found the time to switch to elasticsearch-dsl yet)

Running

S().indexes('da_userstats').query().count()  

completes in about 0.003 seconds.

Running

S().indexes('da_userstats').query()[0:113595].execute().objects 

is taking about 15 seconds.

From what I understand of the documentation both should forcing execution, so I don't see why there is the huge difference in time.

In the mapping I've tried marking the fields as don't analyze but its had no effect. I really don't get why there is a difference of so many orders of magnitude.

@classmethod
def get_mapping(cls):
    return {
        'properties': {
            'id': {
                'type': 'integer',
                'index': 'not_analyzed',
                "include_in_all": False,
            },
            'email': {
                'type': 'string',
                'index': 'not_analyzed',
                "include_in_all": False
            },
            'username': {
                'type': 'string',
                'index': 'not_analyzed',
                "include_in_all": False
            },
            'date_joined': {
                'type': 'string',
                'index': 'not_analyzed',
                "include_in_all": False
            },
            'last_activity': {
                'type': 'string',
                'index': 'not_analyzed',
                "include_in_all": False
            },
            'last_activity_web': {
                'type': 'string',
                'index': 'not_analyzed',
                "include_in_all": False
            },
            'last_activity_ios': {
                'type': 'string',
                'index': 'not_analyzed',
                "include_in_all": False
            },
jhulme
  • 100
  • 10
  • This way of returning all documents is not the Elasticsearch way. Use [`size`/`from` or pagination](https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html) to do this. If you would have had more documents or smaller heap size you would have run out of memory doing it like this. – Andrei Stefan Aug 04 '15 at 14:43
  • If you want to retrieve all the documents, and you don't care about the order, you may want to use the scroll and scan API, which is very fast. – MauricioRoman Aug 05 '15 at 00:44
  • @AndreiStefan in elasticutils slicing is the way of specifying size and from, and looking at pagination, they advise against doing it when you want to go through all the documents. – jhulme Aug 05 '15 at 09:19
  • @MauricioRoman Going to take a look at that thanks, sorting it afterwards in python might be faster as well because its not doing it on every shard and then once at the end – jhulme Aug 05 '15 at 09:20

0 Answers0