Faceting on part of a string

Question

Let's say I've got some documents in an index. One of the fields is a url. Something like...

{"Url": "Server1/Some/Path/A.doc"},
{"Url": "Server1/Some/OtherPath/B.doc"},
{"Url": "Server1/Some/C.doc"},
{"Url": "Server2/A.doc"},
{"Url": "Server2/Some/Path/B.doc"}

I'm trying to extract counts by paths for my search results. This would presumably be query-per-branch.

Eg:

Initial query:
    Server1: 3
    Server2: 2

Server1 Query:
    Some: 3

Server1/Some Query:
    Path: 1
    OtherPath: 1

Now I can broadly see 2 ways to approach this and I'm not a great fan of either.

Option 1: Scripting. mvel seems to be limited to mathematical operations (at least I can't find a string split in the docs) so this would have to be in Java. That's possible but it feels like a lot of overhead if there are a lot of records.

Option 2: Store the path parts alongside the document...

{"Url": ..., "Parts": ["1|Server1","2|Some","3|Path"]},
{"Url": ..., "Parts": ["1|Server1","2|Some","3|OtherPath"]},
{"Url": ..., "Parts": ["1|Server1","2|Some"]},
{"Url": ..., "Parts": ["1|Server2"]},
{"Url": ..., "Parts": ["1|Server2","2|Some","3|Path"]}

This way I could do something like. Urls starting with 'Server1/Some', facet on parts starting with 3|. This feels so horribly hackish.

What's a good way to do this? I can do as much pre-processing as required but need the counts to be coming from ES as it's the count of results from a query that is important.

Geert-Jan · Answer 1 · 2013-05-14T18:59:33.227

0

Given a doc with url /a/b/c

have a multivalued field url and input (using preprocessing) values: /a, /a/b, /a/b/c

edit

When you want to contrain showing counts to paths of a certain depth you could design multiple multivalued fields as described above. Each field would represent a particular depth.

The ES-client should contain logic to decide which depth (and thus which field) to query for facets.

Still feels like a hack though, and indeed without control of data you could end up with lots of fields for this.

edited May 14 '13 at 18:59

answered May 14 '13 at 14:20

Geert-Jan

18,623
16
75
137

That's fine but if I want to get (say) counts per server there's no way to do that as I don't know which of the multivalues I need - only to dump the _entire_ index - effectively building the whole tree in one go. Hence my addition of a counter in my `Parts` suggestion – Basic May 14 '13 at 16:00
Don't think you would need preprocessing, just use the [path hierarchy tokenizer](http://www.elasticsearch.org/guide/reference/index-modules/analysis/pathhierarchy-tokenizer/) which gives you the same result in terms of indexed tokens. – javanna May 14 '13 at 18:36
1

@Basic not sure why this is not fine. Do you want to get the second level not only for a specific server but for all servers? Maybe have a look [here](http://www.springyweb.com/2012/01/hierarchical-faceting-with-elastic.html). – javanna May 14 '13 at 18:38
@Basic: do you mean you want to automatically retrieve all possible paths on a certain level? For instance filtering on `/Server1` retrieve all paths of depth =2? Then indeed the above wouldn't suffice, as doing a termfacet would also return al paths with depth> 2 as well as `/Server` itself. – Geert-Jan May 14 '13 at 18:43
@javanna That link seems to be describing my option #2 with a better way of tokenizing. Nice to know someone else had the same thought. I wasn't aware of that tokenizer either, thanks. As to why this answer isn't quite the fit, can you give an example of how I'd get counts for either: A) each server (but ont all subdirectories) or b) for all folders directly under a given path eg `/Server1/*` but not `/Server1/*/*/...`? Most likely I'm missing something. – Basic May 14 '13 at 18:43
@Geert-Jan That's it exactly `=2` not `>2`. I'm dealing with large numbers of documents and don't want to load the entire "tree" until the users drills down. – Basic May 14 '13 at 18:44
@Basic: yeah a pretty common use-case I would think. I'm pretty dumbfounded that I don't see an easy answer :S. Let me think it over a bit – Geert-Jan May 14 '13 at 18:46
this is getting a comment-nightmare, but if your hierarchy is relatively flat (depth N is not that large) you could have several fields, 1 per depth and have logic in your client to decide which field to query. This logic could probably be moved to ES with some scripting as well. Still feels hackish I admit. – Geert-Jan May 14 '13 at 18:49
In this case, we're crawling documents from large corporate networks and have no control over depth. Windows boxes have a max total path length of 256 so presumably, worst-case with insane 1-character directory names, <128. *nix on the other hand allows up to 4k absolute paths... Of course, any nutter who let their directories be built that way deserves a kicking but that's not under my control. Feel free to continue commenting on the OP and thanks for your assistance – Basic May 14 '13 at 18:54
Just to go ahead with this comment nightmare :) [This](https://github.com/elasticsearch/elasticsearch/issues/1076) elasticsearch feature, hopefully coming with 1.0 is interesting. As GeertJan said having a field per level would be a better solution in your case. Have a look [here](http://jaibeermalik.wordpress.com/2013/03/19/elasticsearch-faceted-search-for-hierarchical-data/) then. – javanna May 14 '13 at 19:28

Faceting on part of a string

1 Answers1