0

Setup:

  • Rails 4
  • MySQL
  • ThinkingSphinx

I have a model (Record) in my app with almost 500 million rows. This model has 32 fields, but the only two I care about for a particular Sphinx search are name and token. name is what I am searching against using Sphinx, and token is what I want returned to perform other actions in Rails with.

My indices set up is:

ThinkingSphinx::Index.define :records, :with => :real_time do
  # fields
  indexes name
  indexes token

  # attributes
  has token, as: :token_attr, type: :string
  # < several additional attributes >
end

What I want to do is query Sphinx on :records matching against name and have it return distinct token strings in an array.

Here's what I have:

Record.search("red", indices: %w(records), max_matches: num_tokens_i_need, group_by: :token_attr)

... where num_tokens_i_need is generally somewhere in the thousands (less than 10,000)

The above query takes between 5-8 minutes to complete. However, when I simply do:

Record.search("red", indices: %w(records), max_matches: num_tokens_i_need).map(&:token).uniq

The search is incredibly fast (returning several million records in a couple hundred milliseconds), but I don't get back num_tokens_i_need due to the .uniq call.

Basically what I need to do is have a fast Sphinx search which gives me back an exact number of distinct token for a given term (such as "red").

If seeing my sphinx.conf or anything else would be helpful, please let me know.

CDub
  • 13,146
  • 4
  • 51
  • 68

1 Answers1

0

The Sphinx docs note that grouping is done in memory, so, to get grouped search results, every single document's attributes needs to be in memory at some point. Given there are several million documents in your Record index, I'm guessing this is the cause of the slowness.

Keep in mind that in your second example, millions of records may match your query, but they're not all being returned by Sphinx (and the matching is purely done on fields, attributes aren't involved), which is part of why that query is much faster.

Some thoughts on better ways forward:

  • If you're just wanting the tokens from Record instances where the name matches exactly, then SQL is probably a better tool for this job. Even with partial matches, using your database's fuzzy matching may be quicker.
  • If you're just after the number of tokens, rather than the token values, then Sphinx really isn't the right tool for the job. It's not built with aggregation in mind, hence why it's not tuned towards the query you're running.
  • If the keyword values (in your examples, red) are a known set (rather than user-provided), perhaps you can cache the values and recalculate them on a regular basis (once a day?).

None of these are clear winners, but hopefully they help you find a better solution.

pat
  • 16,116
  • 5
  • 40
  • 46