1

We're running a site for booking salmon fishing licenses. The site has no problem handling the traffic 364 days a year. The 365th day is when the license sale opens, and that's where the problem occurs. The servers are struggling more and more each year due to increased traffic, and we have to further optimize our booking query.

The licenses are divided into many different types (tbl_license_types), and each license type are connected to one or more fishing zones (tbl_zones).

Each license type can have a seasonal quota, which is a single value set as an integer field in tbl_license_types.

Each zone can have a daily quota, a weekly quota and a seasonal quota. The daily quota is the same for all days, and the seasonal quota of course is a single value. Daily and seasonal are therefore integer fields in tbl_zones. The weekly quota however differs by week, therefore those are specified in the separate tbl_weekly_quotas.

Bookings can be for one or more consecutive dates, but are only stated as From_date and To_date in tbl_shopping_cart (and tbl_bookings). For each booking attempt made by a user, the quotas have to be checked against already allowed bookings in both tbl_shopping_cart and tbl_bookings.

To be able to count/group by date, we use a view called view_season_calendar with a single column containing all the dates of the current season.

In the beginning we used a transaction where we first made a query to check the quotas, and if quotas allowed we would use a second query to insert the booking to tbl_bookings.

However that gave a lot of deadlocks under relatively moderate traffic, so we redesigned it to a single query (pseudo-code):

INSERT INTO tbl_bookings (_booking_)
WHERE _lowest_found_quota_ >= _requested_number_of_licenses_

where _lowest_found_quota_ is a ~330 lines long SELECT with multiple subqueries and the related tables being joined multiple times in order to check all quotas.

Example: User wants to book License type A, for zones 5 and 6, from 2020-05-19 to 2020-05-25. The system needs to

  • count previous bookings of license type A against the license type A seasonal quota.
  • count previous bookings in zone 5 for each of the 6 dates against zone 5 daily quota.
  • same count for zone 6.
  • count previous bookings in zone 5 for each of the two weeks the dates are part of, against zone 5 weekly quota.
  • same count for zone 6.
  • count all previous bookings in zone 5 for the current season against zone 5 seasonal quota.
  • same count for zone 6.

If all are within quotas, insert the booking.

As I said this was working well earlier, but due to higher traffic load we need to optimize this query further now. I have some thoughts on how to do this;

  1. Using isolation level READ UNCOMMITTED on each booking until quotas for the requested zones/license type are nearly full, then fallback to the default REPEATABLE READ. As long as there's a lot left of the quota, the count doesn't need to be 100% correct. This will greatly reduce lock waits and deadlocks, right?
  2. Creating one or more views which keeps count of all bookings for each date, week, zone and license type, then using those views in the WHERE clauses of the insert.
  3. If doing nr 2, use READ UNCOMMITTED in the views. If views report relevant quota near full, cancel the INSERT and start a new one with the design we're using today. (Hopefully traffic levels are coming down before quotas are becoming full)

I would greatly appreciate thoughts on how the query can be done as efficient as possible.

aanders77
  • 620
  • 8
  • 22

2 Answers2

1
  • See if you can upgrade the AWS starting a day before the season opens, then downgrade after the rush. It's a small price to pay for what might be a plenty big performance boost.

  • Rather than the long complex query for counting, decrement some counters as you go. (This may or may not help, so play around with the idea.)

  • Your web server has some limit on the number of connections it will handle; limit that rather than letting 2K users get into MySQL and stumble over each other. Think of what a grocery store is like when the aisles are so crowded that no one is getting finished!

  • Be sure to use "transactions", but don't let them be too broad. If they encompass too many things, the traffic will degenerate to single file (and/or transactions will abort with deadlocks).

  • Do as much as you can outside of transactions -- such as collecting and checking user names/addresses, etc. If you do this after issuing the license, be ready to undo the license if something goes wrong. (This should be done in code, not via ROLLBACK.

(More)

  • VIEWs are syntactic sugar; they do not provide any performance or isolation. OTOH, if you make "materialized" views, there might be something useful.

  • A long "History list" is a potential performance problem (especially CPU). This can happen when lots of connections are in the middle of a transaction at the same time -- each needs to hang onto its 'snapshot' of the dataset.

  • Whereever possible terminate transactions as soon as possible -- even if you turn around and start a new one. An example in Data Warehousing is to do the 'normalization' before starting the main transaction. (Probably this example does not apply to your app.)

  • Ponder having a background task computing the quotas. The hope is that the regular tasks can run faster by not having the computation inside their transactions.

  • A technique used in the reservation industry: (And this sounds somewhat like your item 1.) Plow ahead with minimal locking. At the last moment, have the shortest possible transaction to make the reservation and verify that the room (plane seat, etc) is still available.

  • If the whole task can be split into (1) read some stuff, then (2) write (and reread to verify that the thing is still available), then... If the read step is heavier than the write step, add more Slaves ('replicas') and use them for the read step. Reserve the Master for the write step. Note that Replicas are easy (and cheap) to add and toss for a brief time.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • These are good suggestions. I did upgrade the AWS a few day before (as we do every season), but apparently not enough this year. Regarding your last two bullets we already did that some seasons ago, and I don't think we have much room for improvement there. I think maybe no. 3 is the most important one, because the single EC2 instance were cruising at 20-30% cpu while both RDS instances were struggling at 100%. – aanders77 May 22 '20 at 06:14
  • @aanders77 Additional information request - two sets - primary and read replica. RAM size, # cores, any SSD or NVME devices on MySQL Host server? Post on pastebin.com and share the links. From your SSH login root, Text results of: B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; D) SHOW FULL PROCESSLIST; F) SHOW ENGINE INNODB STATUS; G) SELECT name, count FROM information_schema.innodb_metrics ORDER BY name; for server workload tuning analysis to provide suggestions. – Wilson Hauck May 23 '20 at 12:41
  • @WilsonHauck I've closed down the readreplica now, but both db servers were instance type db.t3.small (details here: https://aws.amazon.com/rds/instance-types/). Pastebin: https://pastebin.com/fPiQvM3s Thanks! – aanders77 May 25 '20 at 07:24
  • @aanders77 What is your storage type from this list? Amazon RDS provides three volume types to best meet the needs of your database workloads: General Purpose (SSD), Provisioned IOPS (SSD), and Magnetic. Analysis in process. Thanks – Wilson Hauck May 28 '20 at 11:54
  • Is that a innodb_buffer_pool_size of 1GB on a "server" with only 2GB of RAM? Dangerously high setting. Is it swapping? – Rick James May 28 '20 at 21:52
  • (Meanwhile, that pastebin is hard to use because key and value are on separate lines.) – Rick James May 28 '20 at 21:54
  • @WilsonHauck sorry, I had a few really busy days. The storage is general purpose SSD. – aanders77 May 29 '20 at 20:27
  • @RickJames yes, you are right about the innodb_buffer_pool_size. But I was monitoring the AWS console during the peak, and there was plenty of "freeable memory" left at all times, it never went below 828M. – aanders77 May 29 '20 at 20:30
1

Rate Per Second = RPS

Suggestions to consider for your AWS Parameters group

innodb_lru_scan_depth=100  # from 1024 to conserve 90% of CPU cycles used for function every second
innodb_flush_neighbors=2  # from 1 to reduce time required to lower innodb_buffer_pool_pages_dirty count when busy
read_rnd_buffer_size=128K  # from 512K to reduce handler_read_rnd_next RPS of 3,227
innodb_io_capacity=1900  # from 200 to use more of SSD IOPS
log_slow_verbosity=query_plan,explain  # to make slow query log more useful
have_symlink=NO  # from YES for some protection from ransomware 

You will find these changes will cause transactions to complete processing quicker. For additional assistance, view profile, Network profile for contact info and free downloadable Utility Scripts to assist with performance tuning. On our FAQ page you will find "Q. How can I find JOINS or QUERIES not using indexes?" to assist in reducing select_scan RPhr of 1,173. Com_rollback averages 1 every ~ 2,700 seconds, usually correctable with consistent read order in maintenance queries.

Wilson Hauck
  • 2,094
  • 1
  • 11
  • 19