7

I am trying to process millions of records from my table (size is about 30 GB) and I am currently doing it using paging (mysql 5.1.36). The query I use in my for loop is

select blobCol from large_table 
where name= 'someKey' and city= 'otherKey' 
order by name
LIMIT <pageNumber*pageSize>, <pageSize>

This works perfectly fine for about 500K records. I have a page size of 5000 that I am using and after page 100, the queries start slowing down dramatically. The first ~80 pages are extracted in a 2-3 seconds but after around page 130, each page takes about 30 seconds to retrieve, at least until page 200. One of my queries has about 900 pages and that would take too long.

The table structure is (type is MyISAM)
    name char(11)
    id int // col1 & col2 is a composite key
    city varchar(80) // indexed
    blobCol longblob

what can i do to speed it up? The explain for the query shows this

select_type: SIMPLE
possible_keys: city
key : city
type: ref
key_len: 242
ref: const
rows: 4293720
Extra: using where; using filesort

In case it helps, the my.cnf for my server (24 GB ram, 2 quad core procs) has these entries

  key_buffer_size = 6144M
  max_connections = 20
  max_allowed_packet = 32M
  table_open_cache = 1024
  sort_buffer_size = 256M
  read_buffer_size = 128M
  read_rnd_buffer_size = 512M
  myisam_sort_buffer_size = 128M
  thread_cache_size = 16
  tmp_table_size = 128M
  max_heap_table_size = 64M
randomThought
  • 6,203
  • 15
  • 56
  • 72
  • I would suggest look into hibernate (projections)[http://docs.jboss.org/hibernate/core/3.3/reference/en/html/querycriteria.html#querycriteria-projection].Also found this on SO which does not use projections http://stackoverflow.com/questions/168084/is-there-a-more-efficient-way-of-making-pagination-in-hibernate-than-executing-se – Shahzeb Sep 13 '11 at 01:02
  • I am running this through the createSQLQuery api, not createQuery which uses hiberate entities. I tried using the createCriteria methods and it caused my JVM to throw out of memory exceptions since it was keeping all that data in memory. This at least works, though is a tiny bit slow – randomThought Sep 13 '11 at 01:07

3 Answers3

2

Here is what I did, and reduced the total execution time by a factor of 10.

What I realized form the execution plan of my original query was that it was using filesort for sorting all results and ignoring the indexes. That is a bit of a waste.

My test database: 5 M records, 20 GB size. table structure same as in the question

Instead of getting blobCol directly in the first query, i first get the value of 'name' for beginning of every page. Run this query indefinitely until it returns 0 results. Every time, add the result to a list

SELECT name
FROM my_table
where id = <anyId> // I use the id column for partitioning so I need this here
order by name
limit <pageSize * pageNumber>, 1

Sine page number is not previously known, start with value 0 and keep incrementing until the query returns null. You can also do a select count(*) but that itself might take long and will not help optimize anything. Each query took about 2 seconds to run once the page number exceeded ~60.

For me, the page size was 5000 so I got a list of 'name' strings at position 0, 5001, 10001, 15001 and so on. The number of pages turned out to be 1000 and storing a list of 1000 results in memory is not expensive.

Now, iterate through the list and run this query

SELECT blobCol
FROM my_table
where name >= <pageHeader>
and name < <nextPageHeader>
and city="<any string>"
and id= 1

This will run N times, where N = size of list obtained previously. Since 'name' is the primary key col, and 'city' is also indexed, EXPLAIN shows that this calculation is performed in memory using the index.

Now, each query takes 1 second to run, instead of the original 30-40. So combining the pre-processing time of 2 seconds per page, total time per page is 3-4 seconds instead of 30-40.

If anyone has a better solution or if there is something glaringly wrong with this one, please let me know

randomThought
  • 6,203
  • 15
  • 56
  • 72
0

You can make your query more exact so the limit is lower.

SELECT col1,col2, col4 
FROM large_table
WHERE col1>"SomeKey" OR 
(col1="SomeKey" AND col2>="OtherKey")
ORDER BY col1,col2
LIMIT PageSize

but update "SomeKey" and "OtherKey" after each database call.

psr
  • 2,870
  • 18
  • 22
  • what do you mean by "the limit is lower"? and I need to query results with col3, which is indexed, but is not one of the primary keys. wouldnt > make it match more than the intended string and how is it better than just putting = ? – randomThought Sep 13 '11 at 01:21
  • Where you use LIMIT PageSize instead of LIMIT , because each time you pass the ID you left off at. In this case you pass 2 values because your comment says it's a 2 column composite key. I'm assuming it's a clustered index on the primary key and not hashed though. If it's hashed you have to do Where City>'somecity' OR (City='somecity' AND col1>'SomeKey) OR (City='somecity' AND col1='SomeKey' AND col2>'OtherKey'). This ought to be very fast if you have the ability to pass new values of 'somecity','SomeKey' and 'OtherKey' for each page. – psr Sep 13 '11 at 16:57
0

I've tried the same in the past with an Oracle 10g database and got the same result (my table had 60 million rows). First pages were retrieved quickly but as page number increased, query got too slow. There's not much you can do with indexes as they look correct and I'm not sure about what you can achieve by tuning database configuration. I guess I had different requirements, but the only solution I found was to dump data to files. If you have a limited set of values for col1, you can get rid of col1 and generate n tables, one for each known value of col1. If col1 is unknown, then I don't know the solution to this. You can retrieve small sets of data from very large tables, but retrieving large sets of data takes a lot of time and pagination doesn't help you at all. You have to preprocess by dumping to files or generating other tables to partition data.

martincho
  • 4,517
  • 7
  • 32
  • 42
  • every col1 value is unique for this recordset unfortunately. the table is partitioned on the int col2 column (~20 partitions). in every partition, the col1 column in unique. I have only tested this with 1 partition till now. I'm not sure whats going to happen to performance once other partitions start getting filled in. – randomThought Sep 13 '11 at 02:04
  • mysql has ORDER BY optimizations but they dont work if the where and the order by keys are different, as in my case. – randomThought Sep 13 '11 at 02:08
  • Yeah, whatever solution you find, the key is to spend time preprocessing somehow. This looks simple but it is not at all. – martincho Sep 13 '11 at 02:19
  • actually, I did just find a solution that reduces retrieval time to 5 seconds instead of 30. I'll post it once I run it over my data set and test its actually working. – randomThought Sep 13 '11 at 02:22
  • I added my solution. It might work well for your use case as well if the index requirements are similar – randomThought Sep 13 '11 at 06:41