4

I'm developing a PHP application for real-estate portal, and I have MySQL database with Properties table. This table has about 500,000 rows and about 20 columns for property features. Assume that each feature is an integer.

Examples of features:

  • number of rooms in the apartment (1-10)
  • type of building (1-20)
  • condition of building (1-10)

I need to implement a web page with a multi-filter real-time navigation panel. The idea is that users can select several features to be equal to particular values. For example: 1-2 rooms ("1","2"), building is of 196x ("6"), state is average, good or excellent ("3","4","5"), in particular City.

The key requirement is the ability for users to see the number of matching properties near each feature filter, taking into account the currently selected filters.

This is the example how it should look like:

Rooms: 1[X], 2[X], 3[ ] (15000 more), 4[ ] (10000 more)
State: bad[ ] (1000 more), average[X], excellent[X]
Year: 1950[ ] (19000), 1960[ ] (20000), 1970[ ] (18000)
City: A[ ] (25000), B[ ] (18000), C[ ] (30000)
Price: <100000[ ] (20000), >100000[ ] (30000)

The "[ ]" stands for empty checkbox (i.e. filter value is not selected), and the "[X]" stands for checked checkbox (i.e. filter value is selected).

When user selects a particular feature value, lets say City=London, the numbers should change because they are now limited by the previous selections:

Rooms: 1[X], 2[X], 3[ ] (5000 more), 4[ ] (5000 more)
State: bad (1000 more), average[X], excellent[X]
Year: 1950 (19000), 1960 (20000), 1970 (18000)
City: A[X], B (+4000), C (+3000)
Price: <100000 (5000), >100000 (6000)

I have tried to use the following SQL query for each feature (X):

SELECT FeatureX, COUNT(*) AS num FROM properties WHERE selectedFeature1=Val1 AND selectedFeature2=Val2 ... AND selectedFeatureN=ValN GROUP BY featureX;

However a set of these queries take several seconds, and I need this to work in real-time i.e <200 ms on backend.

I have tried to keep the whole table in shared memory, but unserialize() of an array with 500,000 records also takes about 1s.

The only solution I found is to develop a standalone application (for example in NodeJS) which keeps data in memory as an array of objects and provides API to the main PHP application. This app refreshes data from MySQL database every minute.

However before starting to implement this I would like to ask if there is any solution based on MYSQL itself for this kind of task?

If not, is there a purely PHP solution?

Do you have any generic recommendations how to approach for this kind of task?

  • Consider pre-computation and/or heavy caching. There's probably no need to actually run this query every single click on a live database; a slightly outdated cached/pre-computed copy is probably good enough. Beyond this, everything is indexable; either using MySQL indices, or by constructing special indices just for the specific combinations of filters and their possible counts (not in MySQL in that case). The one thing I'd get away from is the idea that you need to construct a single `WHERE` clause that matches the selected filters; take the data apart further and you gain many options. – deceze Feb 22 '16 at 14:56
  • The issue with caching or pre-computation is that there are too many combinations of 20 existing feature columns, each of them has about 10 different values: 10 ^ 20. We need a real-time solution with TTL about 1 minute. – Anatoliy Kir Feb 22 '16 at 15:04
  • TTL of 1 minute? How many new or changed listings will happen in 1 minute? I suspect 1 hour would be "good enough". – Rick James Feb 22 '16 at 23:50
  • Well, the concern for 1 minute is the use case when a user submits his/her property ad to the database and would like to be able to find it using the filter in order to make sure it can be discovered by others. – Anatoliy Kir Feb 23 '16 at 07:50

1 Answers1

0

Not everything can be solved in a database.

You must look at all 500K rows. It is unlikely for any index to help with more than a small fraction of the possible queries. So... I suggest you keep the entire data in RAM and have some fast processing (eg C++) to run through all of them.

The data should be (in MySQL terms) TINYINT UNSIGNED something like char unsigned. Probably the data can be stored in 20 bytes, one byte per feature? That's only 10MB in C++, 30-100MB on MySQL, 400MB in PHP. MySQL could store the data in BINARY(20) to simplify the fetch at the expense of insert/update.

Use MySQL as the "source of truth" an periodically load it into your processing engine (you suggested 1 minute and Node.js). Then focus on optimizing the counting.

Most languages these days are 'interpretive', hence some amount of overhead. C or C++ are two of the few that are running at 'machine' level. I'm pretty sure they can do the 10M calculations in under 200ms; I doubt if any interpretive language can.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thank you, it looks like this is acceptable solution! I think we will move forward with C++ and in-RAM database. Would you recommend any good/modern C++ framework to implement multi-threaded server that can periodically update data from MySQL and in the same time answer API requests? How about Boost Asio? – Anatoliy Kir Feb 23 '16 at 07:54
  • Consider having 2 _processes_ running with a pair of shared memory segments (the db). The reloader would load the new data into a segment, then SIGnal the other one to switch to it (by changing a global pointer it uses). – Rick James Feb 23 '16 at 15:40