0

I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..

Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..

Would it be a python loop appending each doc to a new index? Looking for any guidance. Thanks

ColeGulledge
  • 393
  • 1
  • 2
  • 12

2 Answers2

0

Are the documents really large, or can you add them into an jsonl file for bulk ingestion? In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?

I'd do it in Pandas, but here is an idea in ES parlance. Whatever you do, do use the _bulk API, or the job will never finish.

You can run a query based upon as file as per GET my_index/_search?_file="myquery_file"

You can put all the ids into a file, myquery_file, as below:

{
  "query": {
    "ids" : {
      "values" : ["1", "4", "100"]
    }
  },
  "format": "jsonl"
}

and output as jsonl to ingest.
You can do the above for the reindex API.

{
  "source": {
    "index": "source",
    **"query": {
      "match": {
        "company": "cat"
      }
    }**
  },
  "dest": {
    "index": "dest",
    "routing": "=cat"
  }
}
  • so you're saying search for my specific docs -> IDS into a json value.. and then reindex based on the ids.? – ColeGulledge Dec 13 '22 at 13:18
  • How many documents are we talking about? If not too many, just script a query in a text file and add the ids computationally. Otherwise, use Python to create an iterator on the query results, filter the ids you want, pack them into bulk (ingests), and do it that way. You can also use Pandas to filter the query results based upon the vector of doc ids and then pack them into bulk inserts. – svanschalkwyk Dec 14 '22 at 16:04
  • Yeah we're talking ~2 million.. Its a tough question, because I have to filter the index for VERY specific values of a specific field.. There is no real pattern to simplify the query, its finding a needle in a haystack.. So if i get the ids of all the docs i need based on the specific field, could I then bulk insert.? – ColeGulledge Dec 14 '22 at 17:55
  • You have to query to get the appropriate docs prior to the insert. How big is an individual doc? Looks like a job for Python and maybe Pandas. – svanschalkwyk Dec 15 '22 at 18:39
  • yeah because I had to query individually, may as well insert with a reindex while im there.. see above answer. thanks for the help – ColeGulledge Dec 15 '22 at 18:59
0

Unfortunately,

I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..

df = pd.read_csv('C://code//part_1_final.csv')


offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one.. 

missedDocs = []



for i in offsets:
    print(i)
    try: 
        client.reindex({
            "source": {
                "index": "<source_index>,
                "query": {
                "bool": {
                    "must": [
                        { "match" : {"<index_filed_1>": "1" }}, 
                        { "match" : {"<index_with_that_needs_values_to_match": i }}

                    ]

                }
                }
            },
            "dest": {
                "index": "<dest_index>"
            }
        })
    except KeyError: 

        print('error')
        #missedDocs.append(query)
        print('DOC ERROR')


ColeGulledge
  • 393
  • 1
  • 2
  • 12