0

We have an issue around deduplication when our data is spread across multiple indexes, and there exists a particular id in more than one index.

When doing a straight select, we get X records back, but when we do a group by, we will get counts that add up to more than X. We have, as stated above, tracked this back to the offending id existing in more than one index.

Sphinx is smart enough to deduplicate the records when doing the straight select, but doesn't when bucketing them for a group by.

Of course it would be better to not have the duplicates, and we'll hopefully find a way to deal with that, but for the time being, I'm wondering if there is a way to tell sphinx to do the deduplication on group by as well?

Adam Morgan
  • 425
  • 1
  • 3
  • 17
  • I dont know sphinx, but im assuming its removing the 'dupes' because the data is identical. In your group by query, all the columns are probably not the same, which is why they show up. Im sure this can be solved by tweaking your SQL query, assuming your datamodel is solid. – lrossy Oct 16 '15 at 17:51
  • I think the dups are removed in the aggregate group by b/c we specifically bucket against the id column, but with the straight select, it doesn't look at the ids... i could probably add in a distinct() i guess, just think that'll slow it down a lot. maybe the only option tho. – Adam Morgan Oct 16 '15 at 19:00

0 Answers0