Vespa - Proton: Custom bucketing & Query

Question

References:

id scheme

Format: id:<namespace>:<document-type>:<key/value-pairs>:<user-specified>

http://docs.vespa.ai/documentation/content/buckets.html
http://docs.vespa.ai/documentation/content/idealstate.html

its possible to structure data in user defined bucketing logic by using 32 LSB in document-id format (n / g selections).

however, the query logic isn't very clear on how to route queries to a specific bucket range based on a decision taken in advance.

e.g., it is possible to split data into a time range (start-time/end-time) if i can define n (a number) compressing the range. all documents tagged such will end up in same bucket (that will follow its course of split on number of documents / size as configured).

however, how do i write a search query on data indexed in such manner? is it possible to indicate the processor to choose a specific bucket, or range of buckets (in case distribution algorithm might have moved buckets)?

Jon · Accepted Answer · 2017-10-12T14:33:56.090

4

You can choose one bucket in a query by specifying the streaming.groupname query property.

Either in the http request by adding

&streaming.groupname=[group]

or in a Searcher by

query.properties().set("streaming.groupname","[group]").

If you want multiple buckets, use the parameter streaming.selection instead, which accepts any document selection expression: http://docs.vespa.ai/documentation/reference/document-select-language.html

To specify e.g two buckets, use set streaming.selection (in the HTTP request or a Searcher) to

id.group=="[group1]" and id.group=="[group2]"

See http://docs.vespa.ai/documentation/streaming-search.html

Note that streaming search should only be used when each query only need to search one or a few buckets. It avoids building reverse indexes, which is cheaper in that special case (only).

edited Oct 12 '17 at 14:33

answered Oct 11 '17 at 07:32

Jon

2,043
11
9

As one can reason, the streaming mode is quite costly and does not support stemming. with a custom bucketing criteria while indexing very large dataset (> 10 billion documents), employing key-value pairs seems a good idea. however, doing so forces me to use streaming search and the size of data i want to operate on will not go well with high cost. any suggestions to index in a manner capable of having custom buckets (colocating the documents) and derive best performance and least size? – shwetank Oct 11 '17 at 08:25
more specifically, if i won't go by key-value colocation scheme, how do i optimize query/search to target an exhaustive set that i can potentially define while indexing itself? e.g. all video-ads purchased in a time-range (which is a known and growing set on an ad-server). if this data is let loose into open bucketing scheme, what would be least costly lookup keeping 0 visibility-delay in mind. – shwetank Oct 11 '17 at 08:28
Streaming search avoids reverse indexes, which means sense when you only ever search a small chunk of the total data in each query, and those chunks can be predetermined (typically used for personal data). – Jon Oct 12 '17 at 14:31
More discussion about your use case in https://github.com/vespa-engine/vespa/issues/3709#issuecomment-335912442 – Jon Oct 12 '17 at 14:32

score 0 · Answer 2 · answered Oct 11 '17 at 08:03

The &streaming.* parameters is described here http://docs.vespa.ai/documentation/reference/search-api-reference.html#streaming.groupname

This only applies to document types which are configured with mode=streaming, for default mode which is index you cannot control the query routing http://docs.vespa.ai/documentation/reference/services-content.html#document

Vespa - Proton: Custom bucketing & Query

2 Answers2