1

I am currently changing my ElasticSearch schema. I previously had one type Product in my index with a nested field Product.users. And I now wants to get 2 different indices, one for Product, an other one for User and make links between both in code.

I use reindex API to reindex all my Product documents to the new index, removing the Product.users field using script:

ctx._source.remove('users');

But I don't know how to reindex all my Product.users documents to the new User index as in script I'll get an ArrayList of users and I want to create one User document for each.

Does anyone knows how to achieve that?

LordWeedlle
  • 75
  • 10

3 Answers3

1

For those who may face this situation, I finally ended up reindexing users nested field using both scroll and bulk APIs.

  • I used scroll API to get batches of Product documents
  • For each batch iterate over those Product documents
  • For each document iterate over Product.users
  • Create a new User document and add it to a bulk
  • Send the bulk when I end iterating over Product batch

Doing the job <3

LordWeedlle
  • 75
  • 10
0

What you need is called ETL (Extract, Transform, Load).

Most the time, this is more handy to write a small python script that does exactly what you want, but, with elasticsearch, there is one I love: Apache Spark + elasticsearch4hadoop plugin.

Also, sometime logstash can do the trick, but with Spark you have:

  • SQL syntax or support Java/Scala/Python code
  • read/write elasticsearch very fast because distributed worker (1 ES shard = 1 Spark worker)
  • fault tolerant (a worker crash ? no problem)
  • clustering (ideal if you have billion of documents)

Use with Apache Zeppelin (a notebook with Spark packaged & ready), you will love it!

Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
0

The simplest solution I can think of is to run the reindex command twice. Once selecting the Product fields and re indexing into the newProduct index and once for the user:

POST _reindex
{
  "source": {
    "index": "Product",
    "type": "_doc",
    "_source": ["fields", "to keep in", "new Products"]
    "query": {
        "match_all": {}
    }
  },
  "dest": {
    "index": "new_Products"
  }
}

Then you should be able to do the re-index again on the new_User table by selecting Product.users only in the 2nd re-index

Muhammad Ali
  • 712
  • 7
  • 14