I need to decide whether to use elasticsearch import or update/upsert based on the following workflow.
Elasticsearch is serving as the backend of a business intelligence application.
Every day we collect roughly 200GB of log and analytics data from Hadoop as plain text files.
All existing indices in elasticsearch are deleted and a new index for today's data is created.
200GB data is imported into the new index.
Import takes about 3.5 hours and then elasticsearch serves the app for next 24 hours till next import process kicks in and everything is repeated again.
We are using elasticsearch-php SDK to handle the bulk import if that info can help.
The log data that we are importing is the following format
id: 30459,
age: 45,
country: US
page_view_count: 4657
Obviously, the data contains fields like a country that will never change, the age that will seldom change and a view count that might change once in a while.
Roughly out of 200GB, I can say about 80-85% data doesn't change when compared with the data from yesterday.
I have following two questions for the experts.
1- It seems obvious that as most of the data is same so we should just use upsert, and I tried that but the upsert process takes way too long, sometimes even more than 8+ hours. (I did use the doc style update command too to make sure if data is identical, it is skipped but the same result). Do you think I am skipping something obvious in the upsert processor missing some elasticsearch flag that I should toggle before upsert and revert back to the process is done? What flags, settings, cluster or node params should I check to make bulk upsert faster(at least it should be faster than fresh indexing..no?)
2- Deleting the index and doing a fresh import every day seems counter-intuitive but that process finishes in time and services are online within 4 hours. Why is import faster than upsert in this specific case? I believe it should not be because of all the work that has to be done to index 200gb data.
Can you point me in the right direction?