0

I need to decide whether to use elasticsearch import or update/upsert based on the following workflow.

Elasticsearch is serving as the backend of a business intelligence application.

Every day we collect roughly 200GB of log and analytics data from Hadoop as plain text files.

All existing indices in elasticsearch are deleted and a new index for today's data is created.

200GB data is imported into the new index.

Import takes about 3.5 hours and then elasticsearch serves the app for next 24 hours till next import process kicks in and everything is repeated again.

We are using elasticsearch-php SDK to handle the bulk import if that info can help.

The log data that we are importing is the following format

id: 30459,
age: 45,
country: US
page_view_count: 4657

Obviously, the data contains fields like a country that will never change, the age that will seldom change and a view count that might change once in a while.

Roughly out of 200GB, I can say about 80-85% data doesn't change when compared with the data from yesterday.

I have following two questions for the experts.

1- It seems obvious that as most of the data is same so we should just use upsert, and I tried that but the upsert process takes way too long, sometimes even more than 8+ hours. (I did use the doc style update command too to make sure if data is identical, it is skipped but the same result). Do you think I am skipping something obvious in the upsert processor missing some elasticsearch flag that I should toggle before upsert and revert back to the process is done? What flags, settings, cluster or node params should I check to make bulk upsert faster(at least it should be faster than fresh indexing..no?)

2- Deleting the index and doing a fresh import every day seems counter-intuitive but that process finishes in time and services are online within 4 hours. Why is import faster than upsert in this specific case? I believe it should not be because of all the work that has to be done to index 200gb data.

Can you point me in the right direction?

Waku-2
  • 1,136
  • 2
  • 13
  • 26

1 Answers1

0

why upsert does not help

You suppose that updating a single value will only touch a part of document, but on the contrary, whole object will be re-indexed because in ES they are immutable. This is how update API works.

Why is it slower?

When writing the whole new bunch of documents ES only does 2 things: index and write on disk (roughly speaking).

When updating partially it does three things: retrieve, index, write.

what can be done to speed indexing up

There are some general recommendations for tuning indexing performance, I would recommend to take a look at index.refresh_interval first, you might consider enlarging it or disabling completely.

Just a little remark, you don't have to delete an index while the new one is being created, you may use aliases to point to the current index.

Hope that helps!

Nikolay Vasiliev
  • 5,656
  • 22
  • 31