Using an unique identifier in Solr indexed field name

Question

I have the following structure in my documents:

doc: 1

{
  "123e4567-e89b-12d3-a456-426655440000": {
    "order_id": "100",
    "qty": 27
  },
  "321e7654-e89b-21d3-a654-426655441111": {
    "order_id": "234",
    "qty": 12
  }
}

doc: 2

{
  "123e4567-e89b-12d3-a456-426655440000": {
    "order_id": "101",
    "qty": 27
  },
  "789ab763-a56b-87bb-a654-873655442222": {
    "order_id": "345",
    "qty": 23
  }
}

Where uuid in the document root represents a customer identifier and the nested object represents an order the customer made.

The only query I care about is simple query by single field on customer identifier and order identifier, to find their orders:

customer_idx?q=*:*&fq=123e4567-e89b-12d3-a456-426655440000.order_id:*&sort=123e4567-e89b-12d3-a456-426655440000.order_id asc&rows=3

or particular one:

customer_idx?q=*:*&fq=123e4567-e89b-12d3-a456-426655440000.order_id:101&rows=1

Question. Would it be ok to define the dynamicField on customer identifier? From performance perspective. In this case I will end up with hundreds of thousands or millions fields for particular schema.

<dynamicField name="*.order_id" type="string" indexed="true" stored="false" multiValued="false" />

I understand that large number of indexed fields would have impact on performance and memory consumption if I would use many of them in single query since Lucene creates an array of one item per document for every field I query or sort on. But would it be a problem if having hundreds of thousands or millions fields, I'll just query on one of them at the same time?

If not, what would be a better solution?

Thanks.

UPDATE: updated query examples. Added filter, sort and limit. In case it matter.

could you elaborate, with query like this q=123e4567-e89b-12d3-a456-426655440000:* you already have tons of fields, right? — Mysterion, Dec 10 '17 at 16:15
Yes, there are few hundreds of thousands indexed fields by now, and performance of Solr looks ok and queries like this ultra fast so far. — boxx, Dec 10 '17 at 16:31
I'm more curious, how you're figuring out which field (e.g. uuid) to query? — Mysterion, Dec 10 '17 at 17:34

score 0 · Answer 1 · answered Dec 10 '17 at 19:57

0

The main problem with queries like these come when you start to sort the result set. The FieldCache (which you may be able to avoid if you're using docValues now) will get populated with an int (the docid) for each document in the index about its position, and even if just a small amount of documents has a field, the whole array will be generated. There was a patch available to create a sparse list instead, only listing those documents that do contain the field.

Anyhow, the easy fix is to transform your data structure to only use a single field for each query type:

customer_id:123e4567-e89b-12d3-a456-426655440000
customer_id_order_id:123e4567-e89b-12d3-a456-426655440000_101

.. so you get one cache for each field regardless of how many fields you have.

You can also break your documents into two separate documents, one for each customer/order_id combination, and thus, query them as regular documents (instead of having two values inside each document).

answered Dec 10 '17 at 19:57

MatsLindh

49,529
4
53
84

If I understand right the problem you described is only relevant for large result set. If I say that there is up to 10 orders per customer by design (result set contains 10 items maximim in the worst case) would it still be a problem with FieldCache? Could you please elaborate more about how it works or share a link to documentation. – boxx Dec 11 '17 at 06:19
If you're not sorting or faceting on the values, you won't have a problem - the FieldCache is used primarily for sorting (and is a Lucene concept that Solr can't do anything about). In those cases the amount of documents in the result set isn't relevant, just the total index size. DocValues changes this as well, IIRC. Otherwise you'll probably have a greater amount of cache evictions from Solr's caches, since Solr must keep more different result sets available in memory (you can benchmark this). If you're not seeing any issues, keep using it as you do and replace it if problems arise. – MatsLindh Dec 11 '17 at 08:16
@EliJohnes That would depend on your internal usage of Solr and I can't really answer that - note that using cursorMarks (and maybe a few other features) requires you to use a tie breaker - which usually will be a sort by the unique id field. Make sure that your id field is a field type that can utilize doc values and enable those in that case. – MatsLindh Jun 06 '22 at 07:57
Thanks MatsLindh. After checking the code closely, I see that we are explicitly setting a sort param based on id atlast(I think as a tie breaker). And I guess this is how the id is getting to the field cache. Please let me know your thoughts on this. – Eli Johnes Jun 06 '22 at 09:02
That would probably be the cause, yes. – MatsLindh Jun 06 '22 at 09:33
Hi @MatsLindh, Even after removing the id field from sorting and faceting logic, i can still see the id field occupying fieldcache. Please let me know your inputs on the same. – Eli Johnes Jul 18 '22 at 08:16

Using an unique identifier in Solr indexed field name

1 Answers1