-1

I need to profile a web application from a system performance standpoint and find out for typical user actions (accessing the home page, logging in...) where the delay comes from. The website uses the following components:

-apache (serving a PHP Drupal website) running on port 80 on the public interface
-MySQL (website database) running on port 3306 on the loopback
-solr (used by website for search) running on port 8983 on the loopback
-LDAP (used by website for authentication) running on port 389 on the loopback
-Alfresco (used by website as a DMS) running on port 8080 on the loopback

One solution would obvsiouly be to trace the PHP code for the various calls that are made to solr,LDAP, Alfresco, and MySQL. I'm no PHP expert but it sounds like a time consuming-task.

As an alternative I was thinking the following: since communication between the web application (apache) and the other components (solr,LDAP, Alfresco,MySQL) occurs on the loopback we can discard the network as being a source of delay. Therefore if we add up the time connections between these components remain opened, we get a pretty good idea as to how long the web application spends on solr, LDAP, Alfresco and Solr. The good thing with this method is that from a network standpoint, it doesn't make a difference whether it's LDAP, MySQL or something else. Accordingly, once you've find a way to measure the delay for 1 component, you can apply the same method for the others.

Q1: Do you think measuring network connection times is a good way of identifying where delays are coming from or are there any pitfalls (i.e. keepalive)?

If so, one option could be to write a bash script which uses tcpdump in output mode and grep on the [S] and [F] / [R] flags and calculate the time difference between the start and end of connections based on the tcpdump timestamps. The good thing is that the script could spit out a table with a summary of the time spend on each component and that script could be re-used for every test. The bad thing is that writing the script could take a long time.

Q2: Is there anything to bear in mind if using that tcpdump script option?

The other option I can think of is to use tcpdump in binary mode with -w then review the trace manually in wireshark by setting the Time Display Format to Seconds since Previous Displayed Packet and filtering on tcp SYN and ACK and counting up manually. The good thing about this option is that I have a binary trace that I can review for other stuff and I'm less likely to go wrong with the time calculation. The bad thing is I will have to repeat the time calculation manually for every trace I make.

Q3: Is there anything to bear in mind if using that wireshark option?

Q4: Can you think of more ways of finding out network connection times?

Max
  • 3,523
  • 16
  • 53
  • 71
  • 1
    This time no other comment from me, but I must say this: It seems you have a very heavy software stack. Drupal is not very lightweight. Solr can be a resource beast. Alfresco sure requires horse-power. LDAP can be either fast or slow, depending on what LDAP server you use and how you have configured it. MySQL can be slow if proper indexes & other tuning is not in place. I hope you have a powerful server, or otherwise it might not be any individual piece of software causing the lag, but the overall work your server has to constantly make. – Janne Pikkarainen Nov 29 '11 at 13:52
  • I know it's bad to have all this on one box. We have plan to come up with a better infra with all components on separate servers. Do you think the tcpdump/wireshark ideas are good? – Max Nov 29 '11 at 15:23

1 Answers1

1

The only way to really do this, whether local or remote, is to profile each component:

  1. Have Drupal print the start and end times of each page generation.
  2. Have MySQL print the time for each query.
  3. print the "start" and "end" times before/after each solr, LDAP, or alfresco request.

Delays attributable to Drupal/Apache are equal to the total time minus the delay of each sub-component.
Delays attributable to MySQL are equal to the query duration plus a (relatively) constant factor.
Delays attributable to each other component are equal to their end/out time minus their start/in time.

Yes doing this this will in fact make your application EVEN SLOWER, but it will also point out the components that are responsible for the most lag so you can look deeper into them and optimize.

Note that running everything on the same system also introduces resource (CPU, Disk, RAM) contention issues which you should check before even thinking about profiling: Your results will never be accurate if your system is at or over its working capacity.


Delays attributable to the network (especially connections to "yourself") are almost certainly negligible -- pursue these only after eliminating all other possibilities.

voretaq7
  • 79,879
  • 17
  • 130
  • 214