I have to store some millions of hotel rooms with some requirements:
- Hotel gives the numbers of identical rooms available - daily
- Price can change daily, this data are only stored in es, not indexed
- The index will only be used for search (no for monitoring) using the hotel's Geolocation
- Size: Let s say about 50k hotels, 10 rooms each, 1 year+ Availability => 200 millions
So we have to manage on a "daily" level. Each time a room is booked, on our application, the numbers of rooms should be updated, we also store "cache" from the partner (other hotel providers) working worldwide, we request them at a regular interval to update our cache.
I am pretty familiar with the elastic search, but I still hesitate between 2 mappings, I removed some fields (breakfast, amenities, smoking...) to keep it simple:
The first one, 1 document by room, each of them contains 365 children (one by day)
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
}
}
},
"availability": {
"_parent": {
"type": "room"
},
"properties": {
"date": {
"type": "date",
"format": "date"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
- Update, reindex will be isolated on the child level
- Only one index
- Adding future availabilities is easy (just adding child documents in a room)
cons:
- Query will be a little slower, because of the join (looping of availability children)
- Childs AND parents need to be returned, so the query would have to include an inner_hits.
- A lot of hotels create temporary rooms (for vacation, local event...), only available 1 month a year, for example, this add useless rooms for the 11 remaining months in the index.
The second: I create one index by month (Jan, Feb...) using nested documents instead of children.
{
"mappings": {
"room": {
"properties": {
"room_code": {
"type": "keyword"
},
"hotel_id": {
"type": "keyword"
},
"isCancellable": {
"type": "boolean"
},
"location": {
"type": "geo_point"
}
"price_summary": {
"type": "keyword",
"index": false
},
"availability": {
"type": "nested"
}
}
},
"availability": {
"properties": {
"day_of_month": {
"type": "integer"
},
"number_available": {
"type": "integer"
},
"overwrite_price_summary": {
"type": "keyword",
"index": false
}
}
}
}
}
pros:
- Faster, no join, smaller index
- Resolve the issue of the temporary room, thanks to the 12 monthly index
cons:
- Update, booking a room for 1 night will make reindex the room documents (of the matching month)
- If a customer is looking for a room with a check-in on the 31st March, for example, we will have to query 2 index, March and April
For the search/query, the second option is better in theory. The main problem is about the updates of the rooms: According to my production, about 30 million daily availabilities change / 24 hours. I also have to read/compare and update if needed, cache from the partner, about 130 million of reading / possible update every (one update for 10 reads) 12 hours (in means). I have 6 other indexed fields in my mappings on room level, this is not a lot, so maybe a nested solution is ok...
So, which one is the best?
note: I read this How to store date range data in elastic search (aws) and search for a range? But my case is a little different because of the daily information.
Any help/advice is welcome.