0

I'm trying to find a solution that allows students to search jobs based on a role query.

I've managed to get exactly what I want using cross_fields, but I loose fuzzyness.

Here is my dataset for testing purposes:

POST _bulk
{"index":{"_index":"duarte-search-role","_id":"Job 1"}}
{"title":"Marine Biologist 1","overview":"Marine Biologist","opportunity_type_name":"Graduate Job","expired":false}
{"index":{"_index":"duarte-search-role","_id":"Job 2"}}
{"title":"Marine Biologist 2","overview":"No keyword","opportunity_type_name":"Graduate Job","expired":false}
{"index":{"_index":"duarte-search-role","_id":"Job 3"}}
{"title":"Comparison Job 3","overview":"Marine Biologist","opportunity_type_name":"Graduate Job","expired":false}
{"index":{"_index":"duarte-search-role","_id":"Job 4"}}
{"title":"Marine Biologist 4","overview":"Marine Biologist","opportunity_type_name":"Internship","expired":false}
{"index":{"_index":"duarte-search-role","_id":"Job 5"}}
{"title":"Marine Biologist 5","overview":"No keyword","opportunity_type_name":"Internship","expired":false}
{"index":{"_index":"duarte-search-role","_id":"Job 6"}}
{"title":"Comparison Job 6","overview":"Marine Biologist","opportunity_type_name":"Internship","expired":false}

I want to search across all fields using fuzzyness.

For example if someone types "Marine Biologist"

  • Job 1 comes first because it has the word in both title and overview
  • Job 5 comes after for the same reason
  • Job 2 comes after because it has the word in the title
  • etc

If someone searches for "Graduate Marine Biologist"

  • Job 1 comes first because it has the word "Marine Biologist" in both title and overview and it has "Graduate" in the opportunity type.
  • Job 2 comes second because it has the word "Marine Biologist" in the title and "Graduate" in the opportunity type.
  • etc

If someone searches for "Marine Biologist Internship"

  • Job 4 comes first because it has the word "Marine Biologist" in both title and overview and it has "Internship" in the opportunity type.
  • Job 5 comes second because it has the word "Marine Biologist" in the title and "Internship" in the opportunity type.
  • etc

I can achieve perfect results like the above using this

GET /search-role/_search?search_type=dfs_query_then_fetch
{ 
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "Marine Biologist Internship",
            "fields": [
              "title^100",
              "overview^50",
              "opportunity_type_name^30"
            ],
            "operator": "and",
            "type": "cross_fields",
            "tie_breaker": 1
          }
        }
      ],
      "filter": [
        {
          "term": {
            "expired": false
          }
        }
      ]
    }
  },
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    },
    {
      "application_close_date": {
        "order": "asc"
      }
    }
  ],
  "from": 0,
  "size": 8
}

The problem is cross_fields doesn't support fuzzyness and I want to support things like spelling errors, etc instead of having to rely on the student to type perfect match words.

Is there a way that I can rewrite the above in Opensearch to achive the same, but still have fuzzyness?

Thanks!

kyuubi
  • 23
  • 4
  • `...cross_fields is usually only useful on short string fields that all have a boost of 1. Otherwise boosts, term freqs and length normalization contribute to the score in such a way that the blending of term statistics is not meaningful anymore.` Since all your fields have a different boost, it's interesting that you get "perfect" results. Have you tried the [`combined_fields` query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-combined-fields-query.html)? Note that it also doesn't support fuzziness, but just curious. – Val Jul 21 '23 at 11:16
  • Unfortunately Opensearch doesn't support combined_fields as I did come across it. I'm not experienced enough with Opensearch to confirm in regards to what you're stating regarding boosts. All I can say is that, in regards to the acceptance criteria I stated (and similar test data sets) cross_fields is the only one that yields the correct results. And boosts work as expected as well. – kyuubi Jul 22 '23 at 13:03

0 Answers0