I <3 Indexes!
Posted by nuxi on 2009-Nov-30 at 21:04:58 in Computers (Login to reply)
My friend recently asked for some assistance optimizing some SQL. You might want to read the code on his page, I don't have a pretty syntax highlighter in my blog. I don't usually use MySQL because I prefer Postgres, but the offer a free beer prompted me to take a peek anyway.
His database schema was as follows:
CREATE TABLE IF NOT EXISTS `MyDatabase`.`TRK` (
`id` int( 11 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`latitude` float NULL ,
`longitude` float NULL ,
`altitude` float NULL ,
`date` date NULL ,
`time` time NULL
) ENGINE = MYISAM DEFAULT CHARSET latin1 AUTO_INCREMENT=1 ;
CREATE UNIQUE INDEX lldt ON `MyDatabase`.`TRK` (`latitude`,`longitude`,`date`,`time`);
And the query that needed help is:
SELECT latitude, longitude, date, time, CONCAT(date, ' ', time) AS trk_dt_0,
'$dat1 $tim1' AS trk_dt_1 FROM TRK WHERE UNIX_TIMESTAMP(CONCAT(date, ' ', time)) >
UNIX_TIMESTAMP('$dat1 $tim1') ORDER BY UNIX_TIMESTAMP(trk_dt_0) ASC LIMIT 1;
Generally speaking the first step of optimization should be ensuring that the WHERE clause of your query is using an index. In this case, the WHERE clause is not even directly working on fields of the database. A short description of how this query runs would be:
1. Retrieve a row from the database
2. Combine the date and time fields
3. Compare them against the specified value
4. If it matches add to the result set
5. Repeat until end of database reached
6. Order the results
7. Select the first one
So the first step in optimizing this query is to just combine date and time fields in the database. The two halves can easily be seperated in the PHP application if necessary.
CREATE TABLE IF NOT EXISTS `MyDatabase`.`TRK` (
`id` int( 11 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`latitude` float NULL ,
`longitude` float NULL ,
`altitude` float NULL ,
`datetime` datetime NULL ,
) ENGINE = MYISAM DEFAULT CHARSET latin1 AUTO_INCREMENT=1 ;
CREATE UNIQUE INDEX lldt ON `MyDatabase`.`TRK` (`latitude`,`longitude`,`datetime`);
Obviously this had to be done through a more complex conversion script for the live database. You would do that by adding the new column, running a script to copy the data from the seperate date and time fields into the combined field, and then delete the two old columns. So the new SQL query looks like this:
SELECT latitude, longitude, datetime, '$dat1 $tim1' AS trk_dt_1 FROM TRK WHERE
datetime > '$dat1 $tim1' ORDER BY datetime ASC LIMIT 1;
At the very least we have less typing for the programmer, but this alone apparently brought the runtime from over 11 hours to 45 minutes! Consider the steps now used:
1. Retrieve a row from the database
2. Compare the datetime against the specified value
3. If it matches add to the result set
4. Repeat until end of database reached
5. Order the results
6. Select the first one
The next step in optimization is to actually add our index to the datetime field.
CREATE UNIQUE INDEX dt ON `MyDatabase`.`TRK` (`datetime`);
Now the WHERE clause of the SQL query operates against the index instead of the full table. Searching an index is much faster than searching the database as a whole. A full row only needs to be retrieved if it is part of the result set.
So now the steps taken by the database engine are:
1. Retrieve a row from the index
2. Compare the datetime against the specified limit
3. If it matches add to full row from the database to the result set
4. Repeat until end of index reached
5. Order the results
6. Select the first one
Another benefit in this case is that our results are also sorted by the index and generally speaking the index itself is sorted...
1. Retrieve a row from the index
2. Compare the datetime against the specified value
3. If it matches, return the full row from the database and exit
4. Repeat until end of index reached
Now the first row that matches will cause the search to just exit. This optimization brought the overall runtime from 45 minutes to 3 seconds! Thats right, it apparently ran in 3 seconds.
My friend Peter suggested another optimization that he has told me about in the past. One SELECT query that returns 10 rows is much faster than 10 SELECT queries that return one row each. The application in this case is actually requesting every row of the database twice! First it requests every database row, and then for each one it requests the row immediately following it chronologically. With a little redesign the SQL query in the loop could be removed:
The redesigned PHP with only a single SQL query
Another minor optimization you can usually perform is defining fields as NOT NULL unless you actually need to store null data. I read a blog post once that claimed NULL was the biggest mistake in SQL design. Whether or not that is true is way above my head, but you can definitely squeak a tiny bit more performance out by not allowing NULL if you don't have to.
CREATE TABLE IF NOT EXISTS `MyDatabase`.`TRK` (
`id` int( 11 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`latitude` float NOT NULL DEFAULT '0',
`longitude` float NOT NULL DEFAULT '0',
`altitude` float NOT NULL DEFAULT '0',
`datetime` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
) ENGINE = MYISAM DEFAULT CHARSET latin1 AUTO_INCREMENT=1 ;
What I've mentioned above is nearly universally true, although creating a lot of different indexes will eventually slow down your database. What I mention below is more in the deep magic category, it may help or hurt your performance. You should always explore the benefits using things like the EXPLAIN command. EXPLAIN is not a standard SQL command, so here is the Postgres and MySQL documentation.
One trick on tables with lots of fields where we regularly need a very small subset of the data is to simply put all the fields in our index. My friend's database already had such an index, but the EXPLAIN command showed it was unused.
CREATE UNIQUE INDEX lldt ON `MyDatabase`.`TRK` (`latitude`,`longitude`, `datetime`);
By redefining this index with datetime first, MySQL will use it when running the search.
CREATE UNIQUE INDEX lldt ON `MyDatabase`.`TRK` (`datetime`, `latitude`,`longitude`);
So now the operation of the database is:
1. Retrieve a row from the index
2. Compare the datetime against the specified value
3. If it matches, return the row from the index and exit
4. Repeat until end of index reached
Combining the two indexes on my friend's database will help the performance of INSERT, UPDATE, and DELETE, but it may end up slowing down SELECT. Only by actually testing it would you be able to tell if it was worth it.