-1

I am building a web app for a client who has ~1M rows of data in their MySQL database. The data looks approximately like this:

Products

+---------+-----------+---------+-------------+---------+----------+
| Id      | Name      | Size    | Colour      | 4 Other attributes |
+---------+-----------+---------+-------------+---------+----------+
|    1    | product 1 |  S,M,L  | red, green  | ... other options  |
|    2    | product 2 |  XL     | blue, green | ... other options  |
| ................................................................ |

On the Products page in my app, there is:

  • Sidebar column that holds filters (6 filter groups from the attributes above with their respective options)
  • Main area where users see the product results (50 at a time with paging)

When the user clicks on a filter option (e.g. colour=blue), I a query to update the results in the main area (e.g. all products that have colour=blue).

What query can I perform to fetch the remaining filter options for the other filter groups?

For example, if the user selects colour=blue, that will reduce the number of options in other filter groups such as size (if you look at the sample data above, the only option for size that will remain will be XL).

I can't get the filter options from all results of the colour=blue query, because there are 100k+ of these and I only fetch 50 at a time.

Step 1: All options available.

Sizes
[ ] S
[ ] M
[ ] L
[ ] XL

Colours
[ ] red
[ ] green
[ ] blue

Step 2: User selects "blue", sizes options are updated.

Sizes
[ ] XL

Colours
[ ] red
[ ] green
[x] blue

Here is an example from another site doing something similar. (This site also has millions of products listed.)

Step 1: Initial load, no filters selected.

enter image description here

Step 2: One of the filter options is selected. The results on the right update and the filter options in the other filter groups in the sidebar update too.

enter image description here

philipxy
  • 14,867
  • 6
  • 39
  • 83
reddfox
  • 167
  • 1
  • 1
  • 9
  • 1
    Really you need a refactor of the database. Sizes stored in a sizes table and links to that made through a xref table `product_size` Ditto with color – RiggsFolly Feb 09 '23 at 12:46
  • To re-iterate the earlier comments, your primary challenge here is dealing with denomormalised data, meaning you won't be able to take advantage of indexes or seek operations, and you'll always just be running table scans. [Something like this](https://dbfiddle.uk/aVonq298) would be a lot better in terms of efficient searching, and if a load of old application code is reliant on the existing structure you can always reconstruct this with a view to stop existing select queries breaking while you refactor. – GarethD Feb 09 '23 at 12:57
  • Thanks! Good point on refactoring the DB. It will indeed create a lot of side effects, but I can handle them if needed. How do I handle my main problem though - filtering/updating the filter options? (e.g. user selects size M and the colour filter group updates to only show available options from the filtered results) Do I need another table for that too? There are currently 6 different filter groups and the client might add more in the future. (Edited the question to visualise this better) – reddfox Feb 09 '23 at 13:47
  • In a project where I worked on I faced a similar problem. I solved splitting queries to get the paged data from the ones that give grouped count. You can use a single endpoint that execute different queries, or use two endpoints one to get the paged results and one tp retrieve count of items for each category (in this case you could first get the paged result and the update the filter pane). Anyway a good refactoring of the db is essential to get good performance. I don't think that loading all the dataset from the database is the solution. – Roberto Ferraris Feb 09 '23 at 14:08
  • 2
    You need to go back to your client and (politely!) get them to revisit their requirements. Having the app display a large volume of records has no benefit to the users so the generic requirement to show all records returned by any filter needs to be challenged and a more sensible requirement agreed e.g. as each filter option is selected disable any other options that do not apply to the resulting dataset - but don’t display the dataset until all options selected and the user presses “Apply” – NickW Feb 09 '23 at 14:35
  • The real database has ~ 1M products with 6 attribute/filter groups having 20-250 options available in each (e.g. different brands, different types of products, etc.). All combinations between these are possible. They also add new products and attributes (filter options) every day. – reddfox Feb 09 '23 at 17:55
  • 2
    1M rows is like nothing. Simply do the initial query and filter the resulting 100k rows programmatically on the server. Probably you are using wrong routines (e.g. fetch every row one by one instead of all at once). When your data set gets really large (~100M) then the trick is to use product categories. For example in category "DVDs" you will never have a size (S/M/L). So for every category you have max. 20 attributes. Also the SQL layout needed to change. When it comes to big data (> 1000M?) you have to split products by category and use multiple servers. – zomega Feb 11 '23 at 20:23
  • What exactly is the question? PS It's not clear what you are workin on as a deliverable given exactly what, or queries it would involve. or how you are stuck querying or otherwise codceing for it or relevant parts of it, or what exactly is your question. What are these steps?--Are they trying to illustrate "when users apply any of the filters"? How is "because some queries can return 100k+ rows" a problem? [mre] Please clarify via edits, not comments. Please don't insert "EDIT"s/"UPDATE"s, just make your post the best presentation as of edit time. [ask] [Help] – philipxy Feb 12 '23 at 04:02
  • There is no "best". [Strategy for “Which is better” questions](https://meta.stackexchange.com/q/204461/266284) [Why was my "shopping list" question closed?](https://meta.stackexchange.com/q/158809/266284) [Why is “Can someone help me?” not an actual question?](https://meta.stackoverflow.com/q/284236/3404097) – philipxy Feb 12 '23 at 04:03
  • [Is storing a delimited list in a database column really that bad?](https://stackoverflow.com/q/3653462/3404097) Googling 'site:stackoverflow.com sql delete value from list in column' ... 2nd hit [How to delete specific value from a column in SQL row?](https://stackoverflow.com/q/43813512/3404097) Again: What exactly is your specific researched non-duplicate question, with [mre] if stuck coding? – philipxy Feb 12 '23 at 10:37
  • I am referring to updating the UI. This is a React application. I perform API requests to fetch data (via SQL queries) and then use it to update the UI. I don't perform any updates to the SQL tables at this point but can change them if refactoring is needed (outside of the scope of the React app). If this is unclear, I will update my question again to try to make it more clear. – reddfox Feb 12 '23 at 11:22
  • 1
    **"Please clarify via edits, not comments."** – philipxy Feb 12 '23 at 11:30
  • 1
    What backend technology are you using for your API here, this is something that is going to be best handled in between the database and the front end, i.e. where you can aggregate and summarise your data without needing to send millions of records between the back end and front end. – GarethD Feb 13 '23 at 16:19
  • Currently using PHP to pass data to React via the API endpoints. – reddfox Feb 13 '23 at 19:52

3 Answers3

3

The question "What query can I perform to fetch the remaining filter options for the other filter groups?"

Short answer

Use a brute force query that fetches the data and tallies the results.

Long Answer

Here are some tips on what needs to be done. It provides some suggestions on improving performance.

  1. Categorize the products into maybe 100 different groupings (CDs, dresses, cameras, auto parts, etc). This will (1) become the first column of the PRIMARY KEY and (2) limit the heavy queries that will follow.
  2. Study the data to find what groupings are relevant (price, size, color, F-stop, etc). Be sure to re-study the data periodically. It will look stupid if the RAM chips start at 1MB and stop at 256MB. Be aware that words like "size" have different meanings for T-shirts vs disk drives, and no meaning for things with a single size.
  3. Brute force search the entire group whenever the checks (or unchecks) a filtering box.
  4. Split the data into 2 tables (and lots of ancillary tables). One has just the columns that are deemed worth filtering on. One has all the other info (description, URLs of images, etc). The first table will be used for searching and for rebuilding the groupings.
  5. Keep track of what people filter on. (Who cares about the color the RAM chips!) Use this to eliminate useless filters. It will be harder to discover new filters (eg, when disk drives added "SSD" option).
  6. Be aware that users will want to check multiple checkboxes in each filter. (eg, 2 price ranges or 2 sizes)
  7. Be sure to plan the "pagination" carefully; that can be a performance killer.
  8. For filtering, don't "normalize". That keep "XL" as "XL" not a number. However, you may need a number to order the options in a sensible way (XS<S<M<L<XL). Do keep numbers as numbers (eg, megabytes) so that they can be easily tossed into the buckets.

The performance will not be great. But I suggested some things to avoid a big table scan:

  • Force the user to start with a grouping that drastically limits the amount of data to scan.
  • Index it appropriately so that the scan is of "consecutive" rows.
  • Vertically split -- search values vs the rest of the stuff.

Maybe...

For counts (left side of page)

SELECT  COUNT(*) AS total,
        SUM(genre = 'Hip Hop') AS "Hip Hop",
        ...
    FROM search_table
    WHERE category = 'CD'
      AND artist = '...'   -- already picked items

For the first 20 to show (main part of page)

SELECT  i.*
    FROM search_table AS s
    JOIN info AS i  ON s.id = s.id
    WHERE s.category = 'CD'
      AND s.artist = '...'   -- already picked items
    ORDER BY s.price ASC     -- user-chosen sort order
    LIMIT 0, 20              -- for pagination

This is, however, complicated by not needing/wanting a column color for CDs, nor a column genre for most other items. I would solve this in either of 2 ways:

  • 100 search tables (one per category), each with appropriate search categories as columns (with suitable datatypes).

      PRIMARY KEY (price, id),
      INDEX(id)
    
  • A single table with most of the search criteria in a JSON string.

      PRIMARY KEY (category, price, id),
      INDEX(id)
    

Notes:

  • In either approach, there would probably be a column for price, assuming that most categories need a filter on such.

  • In either approach, the filtering by table or by PK would shrink down the search to about 1/100 of the rows.

A third option is to PARTITION BY either KEY or LIST and have 100 partitions. Perhaps fewer; for example, various types of clothing might work well in the same grouping. Or media (CD, etc) or vehicles (cars, trucks).

PRIMARY KEY(price, cat, id),  -- cat not needed first
INDEX(id)
Rick James
  • 135,179
  • 13
  • 127
  • 222
0

To restrict the list of sizes if the user selects a colour that does not have all sizes, generate a filter query that holds all possible combinations.

You would get back JSON that looks like:

{filters:[
  {
    color:green,
    size:sm
  },
  {
    color:green,
    size:md
  },
  {
    color:green,
    size:lg
  },
  {
    color:green,
    size:xl
  },
  {
    color:blue,
    size:xl
  },
  ...
]

On the client have an array of selected colours and an array of selected sizes. When these get updated redraw the filter lists using filterOptions.filter(f => {..only give me f where both the fields exist in their respective filter arrays}) and then use this filtered array to filter again into the relevant dimensions. If you have more than two dimensions this can grow quite quickly so if the dimension are different depending on product type you might need another layer of abstraction to make the filtering generic.

philipxy
  • 14,867
  • 6
  • 39
  • 83
Aaron Reese
  • 544
  • 6
  • 18
  • The others are also right, the database structure is going to give you horrible code logic and performance issues. – Aaron Reese Feb 14 '23 at 11:08
  • Thanks. There will be more than 2 dimensions indeed and each has 20-250 possible options, so the number of combinations will be significant. This is why I also posted an example from another site doing something similar, as they also have millions of products. I will definitely look into refactoring the DB as well. The FE is based on React, if that is relevant. – reddfox Feb 14 '23 at 17:27
0

Perform 2 queries, calling the same stored routine. The first (param_query_mode=1) is called to get the main result set, and the second (param_query_mode=2) is called to build the new filter set. They happen simultaneously as AJAX queries.

Both queries use the same initial stage which builds a set of IDs from the product table based on the initial search criteria and the current filters. The results represent all matching records and are stored in a temp table containing just IDs.

Then the first query (param_query_mode=1) builds a result set based on the temp table but filtering out all but the required 50 (or whatever--parameter controlled). The result set contains what fields are needed to display the these 50 records.

The second query (param_query_mode=2) also builds its result set based on the temp table, but this result set contains what is needed to display the filters for all the results from stage one.

I do not have to deal with tables with millions of records, but the approach works well in my context.

philipxy
  • 14,867
  • 6
  • 39
  • 83
Nikkorian
  • 770
  • 4
  • 10