3

I have to store some millions of hotel rooms with some requirements:

  • Hotel gives the numbers of identical rooms available - daily
  • Price can change daily, this data are only stored in es, not indexed
  • The index will only be used for search (no for monitoring) using the hotel's Geolocation
  • Size: Let s say about 50k hotels, 10 rooms each, 1 year+ Availability => 200 millions

So we have to manage on a "daily" level. Each time a room is booked, on our application, the numbers of rooms should be updated, we also store "cache" from the partner (other hotel providers) working worldwide, we request them at a regular interval to update our cache.

I am pretty familiar with the elastic search, but I still hesitate between 2 mappings, I removed some fields (breakfast, amenities, smoking...) to keep it simple:

The first one, 1 document by room, each of them contains 365 children (one by day)

{
  "mappings": {
    "room": {
      "properties": {
        "room_code": {
          "type": "keyword"
        },
        "hotel_id": {
          "type": "keyword"
        },
        "isCancellable": {
          "type": "boolean"
        },
        "location": {
            "type": "geo_point"
        }
        "price_summary": {
            "type": "keyword",
            "index": false
        }
      }
    },
    "availability": {
      "_parent": {
        "type": "room"
      },
      "properties": {
        "date": {
          "type": "date",
          "format": "date"
        },
        "number_available": {
          "type": "integer"
        },
        "overwrite_price_summary": {
          "type": "keyword",
          "index": false
        }
      }
    }
  }
}

pros:

  • Update, reindex will be isolated on the child level
  • Only one index
  • Adding future availabilities is easy (just adding child documents in a room)

cons:

  • Query will be a little slower, because of the join (looping of availability children)
  • Childs AND parents need to be returned, so the query would have to include an inner_hits.
  • A lot of hotels create temporary rooms (for vacation, local event...), only available 1 month a year, for example, this add useless rooms for the 11 remaining months in the index.

The second: I create one index by month (Jan, Feb...) using nested documents instead of children.

{
  "mappings": {
    "room": {
      "properties": {
        "room_code": {
          "type": "keyword"
        },
        "hotel_id": {
          "type": "keyword"
        },
        "isCancellable": {
          "type": "boolean"
        },
        "location": {
            "type": "geo_point"
        }
        "price_summary": {
            "type": "keyword",
            "index": false
        },
        "availability": {
            "type": "nested"
        }
      }
    },
    "availability": {
      "properties": {
        "day_of_month": {
          "type": "integer"
        },
        "number_available": {
          "type": "integer"
        },
        "overwrite_price_summary": {
          "type": "keyword",
          "index": false
        }
      }
    }
  }
}

pros:

  • Faster, no join, smaller index
  • Resolve the issue of the temporary room, thanks to the 12 monthly index

cons:

  • Update, booking a room for 1 night will make reindex the room documents (of the matching month)
  • If a customer is looking for a room with a check-in on the 31st March, for example, we will have to query 2 index, March and April

For the search/query, the second option is better in theory. The main problem is about the updates of the rooms: According to my production, about 30 million daily availabilities change / 24 hours. I also have to read/compare and update if needed, cache from the partner, about 130 million of reading / possible update every (one update for 10 reads) 12 hours (in means). I have 6 other indexed fields in my mappings on room level, this is not a lot, so maybe a nested solution is ok...

So, which one is the best?

note: I read this How to store date range data in elastic search (aws) and search for a range? But my case is a little different because of the daily information.

Any help/advice is welcome.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
LeBigCat
  • 1,737
  • 1
  • 11
  • 16
  • That is so specific and has tradeoffs going in both directions that it's rather unlikely anyone will be able to give a concrete answer (unless they had exactly the same scenario). That's why there is https://github.com/elastic/rally: You need to try it. – xeraa Dec 25 '20 at 21:37
  • Yeah, in fact if this is our data, nested model should be ok, because it wont be updated so often, but for the external data with push from partners, maybe raven or a managed queue could control the amount of updates querries send to es. If some people got mappings with similar child mappings and could share some bench, would be appreciated. – LeBigCat Dec 27 '20 at 01:38

0 Answers0