I am trying to process millions of records from my table (size is about 30 GB) and I am currently doing it using paging (mysql 5.1.36). The query I use in my for loop is
select blobCol from large_table where name= 'someKey' and city= 'otherKey' order by name LIMIT <pageNumber*pageSize>, <pageSize>
This works perfectly fine for about 500K records. I have a page size of 5000 that I am using and after page 100, the queries start slowing down dramatically. The first ~80 pages are extracted in a 2-3 seconds but after around page 130, each page takes about 30 seconds to retrieve, at least until page 200. One of my queries has about 900 pages and that would take too long.
The table structure is (type is MyISAM) name char(11) id int // col1 & col2 is a composite key city varchar(80) // indexed blobCol longblob
what can i do to speed it up? The explain for the query shows this
select_type: SIMPLE possible_keys: city key : city type: ref key_len: 242 ref: const rows: 4293720 Extra: using where; using filesort
In case it helps, the my.cnf for my server (24 GB ram, 2 quad core procs) has these entries
key_buffer_size = 6144M max_connections = 20 max_allowed_packet = 32M table_open_cache = 1024 sort_buffer_size = 256M read_buffer_size = 128M read_rnd_buffer_size = 512M myisam_sort_buffer_size = 128M thread_cache_size = 16 tmp_table_size = 128M max_heap_table_size = 64M
Here is what I did, and reduced the total execution time by a factor of 10.
What I realized form the execution plan of my original query was that it was using filesort for sorting all results and ignoring the indexes. That is a bit of a waste.
My test database: 5 M records, 20 GB size. table structure same as in the question
Instead of getting blobCol directly in the first query, i first get the value of 'name' for beginning of every page. Run this query indefinitely until it returns 0 results. Every time, add the result to a list
SELECT name FROM my_table where id = <anyId> // I use the id column for partitioning so I need this here order by name limit <pageSize * pageNumber>, 1
Sine page number is not previously known, start with value 0 and keep incrementing until the query returns null. You can also do a select count(*) but that itself might take long and will not help optimize anything. Each query took about 2 seconds to run once the page number exceeded ~60.
For me, the page size was 5000 so I got a list of 'name' strings at position 0, 5001, 10001, 15001 and so on. The number of pages turned out to be 1000 and storing a list of 1000 results in memory is not expensive.
Now, iterate through the list and run this query
SELECT blobCol FROM my_table where name >= <pageHeader> and name < <nextPageHeader> and city="<any string>" and id= 1
This will run N times, where N = size of list obtained previously. Since 'name' is the primary key col, and 'city' is also indexed, EXPLAIN shows that this calculation is performed in memory using the index.
Now, each query takes 1 second to run, instead of the original 30-40. So combining the pre-processing time of 2 seconds per page, total time per page is 3-4 seconds instead of 30-40.
If anyone has a better solution or if there is something glaringly wrong with this one, please let me know