1

I have a tool witch compare one string with, on average - 250k strings from database. Two tables are used during compare process - categories and categories_strings. In string table there is around 2.5 million rows while pivot - categories_string contains of 7 million rows.

My query is pretty simple, selecting strings columns, joining pivot table, adding where clause to specify category and setting limit of 10 000. I run this query in a loop, every batch is 10 000 strings. To execute whole script faster I use Seek Method instead of MySQL offset which was a way too slow on huge offsets. Then, comparing by common algorithms such us simple text, levenshtein etc. is perfomed on each batch. This part is simple.

The question starts here.

On my laptop (lenovo x230) whole process for i.e. 250k string compared takes: 7,4 seconds to load SQL, 13,3 seconds to compare all rows. And then 0,1 second sorting and transforming for view.

I've also small dedicated server. Same PHP version, same MySQL. Web server doesn't matter, as I run it from command line right now. As on my laptop it takes +- 20 seconds in total, on the server it is... 120 seconds.

So, what is the most important factor for a long running PHP program which have impact on execution time? All I can think of is CPU, which on the dedicated server is worse, it is Intel(R) Atom(TM) CPU N2800 @ 1.86GHz. Memory comsumption is pretty low, about 2-4%. CPU usage, however is around 60% on my laptop and 99,7 - 100% on the server.

Is CPU the most importing factor in this case? Is there any way to split it for example into several processes which in total would take less? Despite all, how to monitor CPU usage, which part of script is most consuming.

ficus
  • 196
  • 2
  • 17
  • 1
    Does the report need to be generated in real-time? You could have it prepare the report offline as a cron-job running, say, every 20 minutes and the interactive portion simply fetches the pre-prepared report. Without more details of exactly what your data looks like, how you're processing it and what you're attempting to achieve it's difficult to give more than vague suggestions – GordonM Jan 10 '19 at 09:44
  • Report will be part of a service, each user will be able to compare any string in any category. Then, of course I will queue all requests, fire some events on finish etc. The main matter for now is how to inspect CPU usage of such script or if I am correct that CPU has a huge impact on long running PHP process, whether I should invest in better server with better CPU and i.e. less memory (just for comparing service). – ficus Jan 10 '19 at 09:50
  • What fields are in the tables, what's the query and what indexes are in place for the tables involved? – SpacePhoenix Jan 10 '19 at 10:11
  • @SpacePhoenix Pivot table contain two columns with id's a primary key. In strings table there is a primary key and unique keys on two columns which are included in query. Query is pretty much explained above, it's really as simple as it can be. Select two columns, join pivot, added where category = id of category and limit result to 10 000 rows. I am quite happy with db performance tho. It can be probably slighty improved but still, I do not think db matters in the question itself. Thank you for your answer :-) – ficus Jan 10 '19 at 11:04
  • For each string are you running the query a single time, then having PHP do the checking? – SpacePhoenix Jan 10 '19 at 19:34
  • No, there is one query for every 10 000 rows (strings). Then I compare each string in the batch, when finished - another query is executed (next 10 000 rows) and so on. Simple loop with query chunks. – ficus Jan 11 '19 at 08:25
  • When the required search string is found within the string, using a given method, do you then move to the next string or continue using each method to search the string? If you do it in chunks of 5,000 strings, what affect does it have on CPU usage, memory usage and time to complete the task (also give it a try in chunks of 2,500 strings). How often do strings get added or modified? – SpacePhoenix Jan 13 '19 at 03:17
  • I tried by smaller and bigger chunks. It works best at 10k~, as in smaller chunks, total comparing time is almost the same but database work is longer. It goes like that: get batch of 10k, compare everystring by 4 algorythms, save results to array, get another 10k, repeat. Diffrent chunk size doesn't affect CPU much, at first it was done in just one query and total time was only slighty bigger than now. Hardest part for CPU is comparing process Strings are being added every day, they never change - only some of their attributes sometimes do. – ficus Jan 14 '19 at 08:40

0 Answers0