0

I am using an AWS Opensearch cluster to store some historical data in indices every day. (An index for each date: 2023.03.24, 2023.03.23 etc..). Each index has 1:1 primary to replica shard ratio, and approx. 10 million records stored per index. Recently we have made a change to store the entire data for each record in another field in the same record as a gzip compressed binary blob. (gzip compress the json data record, and then base64 encode it to create binary blob, and store it in a separate field in the same record. Let's say the new field is called 'compressed' with mapping type as keyword)

However, after making the change, our index size have jumped almost 500%!. This is not expected, as even with 0% compression, it should have at most doubled. Can anyone think of what could be the possible reasons for this massive increase, and how we can store that kind of data in Opensearch/Elasticsearch more memory efficiently?

Stokes
  • 58
  • 1
  • 6
  • Interesting case, I can help you to investigate it. Q1- Can you share the mapping, and index sizes? Q2- what is your main aim to copy all fields into one field? Q3- How did you copy all fields into one field, with copy_to?https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html – Musab Dogan Mar 25 '23 at 12:13
  • @MusabDogan The index sizes before the insertion of compressed field was around 70-100 GB. After inserting, the size is 500-600 GB. I copied and gzipped the fields manually and put it in the compressed field before indexing. So let's say the record is: Previous record: { "field_1": "value1", "field_2": "value2"} New record: {"field_1":"value1", "field_2": "value2", "compressed": base64encode(gzip(Previous record)) } The mapping for the compressed field is like this: "mapping": { "compressed": {"type": "text", "fields":{ "keyword": {"type": "keyword","ignore_above": 256}}}} – Stokes Mar 27 '23 at 19:46

0 Answers0