0

I'm currently embedding Elasticsearch as a Search-Interface into an existing application. The application is a classical 3-tier-application with a oracle SQL database.

I have the Entity 'Person' (database table), with the following attributes:

  • first Name
  • last Name
  • full name (contains first name and last name concatenated)
  • person-Nr.
  • company Name
  • A list of addresses with: street, zipcode, city, phone and email.

So far, I put it 1:1 into elasticsearch, for every db-column a property in elasticsearch. Synchronisation and full-load of the data is no problem. But I'm struggling providing a "good" search experience, as there are many different things to pay attention to:

  • Fuzzy Search (tolerance of one or two edit distance)
  • Wildcard search (if I type "Ange", it should also find results with "Angelina")
  • E-Mail-Address search (I'm already using uax_url_email tokenizer in combination with the keyword datatype)

As far as I can tell, multi_match, type cross_fields would be a good choice, but it can't do fuzzy-search and wildcard. type best_fields is also no option, because it can't do wildcard-search (as far as I know?). most_fields is also not suited and phrase matching can't do fuzziness.

Because of that, I'm currently using simple_query_string, example:

In the search field, I enter Tom fisher: The query in simple_query_string is:

(tom* | tom~1)+(fisher* | fisher~1)

My question now is, would it be a bad idea, to just have on field "entity_content", which contains the content of all fields? This would be like as I had a .txt document with all information about the person.

  • What are the advantages/disadvantages?
skeeks
  • 359
  • 2
  • 8

1 Answers1

0

By default Elastic had _all field, which is already catch-all field, e.g. all the information is stored into this field, without respect of where it comes from.

The _all field can be useful, especially when exploring new data using simple filtering. However, by concatenating field values into one big string, the _all field loses the distinction between short fields (more relevant) and long fields (less relevant). For use cases where search relevance is important, it is better to query individual fields specifically.

The _all field is not free: it requires extra CPU cycles and uses more disk space. If not needed, it can be completely disabled or customised on a per-field basis.

Mysterion
  • 9,050
  • 3
  • 30
  • 52