6

I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.

One page in my app returns a large amount of data showing all Associations all Employees may be a part of.

So right now, I am sending to the client essentially a bunch of objects that look like:

{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}

It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.

I am wondering if it is worth me building some kind of lookup index that I also send to the client like

{ 1: Guid1, 2: Guid2 ... } 

and replacing all of the Guids in the objects I send down with those ints,

or if simply gzipping the response will compress it enough that this extra effort is not worth it?

Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).

Davis Dimitriov
  • 4,159
  • 3
  • 31
  • 45

6 Answers6

6

Your wrote at the end of your question the following

Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).

I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?

The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.

You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.

If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.

If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like

123:"7EDBB95752554b83A4C40DF664905735",

you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.

In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.

Oleg
  • 220,925
  • 34
  • 403
  • 798
  • Sometimes the developer is given requirements they cannot change, despite the logic of an alternative approach. I think Davis is pretty clear in indicating that's the situation here. – Random Mar 27 '12 at 15:30
  • @Random: If one can change the format of the server response like replacing there with the index in the `[Guid1, Guid2, ...]` array then one *do* can change the protocol between of communication between the server and the client. We know too few information about the problem. I wanted to mention that transferring of 30,000 total guids for one page is definitively *too much as require to display existing information on the page*. I suppose if one analyse the problem more under the aspect one can reduce the size of transferred data in many times. – Oleg Mar 27 '12 at 15:39
  • I don't necessarily disagree. And the information in your answer is useful. I'm only stating that since Davis appears to understand this also, it limits the applicability of your answer to his specific problem. – Random Mar 27 '12 at 15:49
  • @Oleg I agree with this in general, but it is 100% critical that 30,000 Guids be transferred on this page. I originally tried to limit this and received massive push back from clients. – Davis Dimitriov Mar 30 '12 at 13:49
  • @DavisDimitriov: Could you explain how the data will be used? You posted just too few information. For example I can imagine that the page have some editing possibility. If the user select a row in the table he can click on edit button for editing (it can be for example double-click for editing). In any way if the user start the editing, or select the row on double-click on the row the client can make Ajax request the data *for the row only from the server*. If the server can send the data quickly enough (less then 0,5 sec for example) you will need to hold only the row id (`assocation_id`). – Oleg Mar 30 '12 at 15:23
  • @DavisDimitriov: What I just wrote is just an example. My experience is that the customers have interest only for performance and functionality of the client and have no interest about the details of the implementation. Nevertheless I try be carefully because I don't know your application at all. – Oleg Mar 30 '12 at 15:26
2

The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.

IMHO, the extra effort of writing the custom coding will not be worth it.

Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.

gzm0
  • 14,752
  • 1
  • 36
  • 64
1

if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.

Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.

To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.

Tarion
  • 16,283
  • 13
  • 71
  • 107
  • It turns out after doing some testing that it is about 50% more data to transfer the whole thing gzip'd than it is to do the dictionary compression. Unfortunately quite substantial – Davis Dimitriov Mar 30 '12 at 13:51
0

So what you are trying to accomplish is Dictionary compression, right? http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right? It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?

amdmax
  • 771
  • 3
  • 14
  • Small integers **serialized as JSON** take less place as a half of place of GUID and mush less then GUID. Compare `"{7EDBB957-5255-4b83-A4C4-0DF664905735}"` or `"7EDBB95752554b83A4C40DF664905735"` with `499` (34 or 3 characters). – Oleg Mar 21 '12 at 09:03
0

I do not know how dynamic is your data, but I would

  • on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs

  • on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)

Then its all down to associative arrays manipulations which is straightforward in JS.

Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.

Bruno Grieder
  • 28,128
  • 8
  • 69
  • 101
0

I would start by answering the following questions:

What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?

What are the current performance metrics? How far are you from the requirements?

You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?

The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.

In my experience if you're not TOO far from the goal, performance requirements are often trial and error.

If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.

What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?

kevingallagher
  • 908
  • 1
  • 8
  • 9