Fast MySQL Range Queries on MaxMind GeoIP Tables
A few weeks ago I read Jeremy Cole’s post on querying MaxMind GeoIP tables but I didn’t know what all that geometric magic was about so I dropped a comment about how we do it here on WordPress.com. (Actually, Nikolay beat me to it.) Jeremy ran some benchmarks and added them to his post. He discovered that my query performed favorably.
Today I saw an article referencing that comment and I wished I had published it here, so here it goes. There is a bonus at the end to make it worth your while if you witnessed the original discussion.
The basic problem is this: you have a MySQL table with columns that define the upper and lower bounds of mutually exclusive integer ranges and you need the row for which a given integer fits within the range and you need it fast.
The basic solution is this: you create an index on the upper bound column and find the first row for which that value is greater than or equal to the given value.
The logic is this: MySQL scans the integer index in ascending order. Every range below the matching range will have an upper bound less than the given value. The first range with an upper bound not less than the given value will include that value if the ranges are contiguous.
Assuming contiguous ranges (no possibility of falling between ranges) this query will find the correct row very quickly:
SELECT * FROM ip2loc WHERE ip_to >= 123456789 LIMIT 1
The MySQL server can find the row with an index scan, a sufficiently fast operation. I can’t think of a faster way to get the row (except maybe reversing the scan when the number is known to be in the upper half of the entire range).
The bonus is this: because the time to scan the index is related to the length of the index, you should keep the index as small as possible. Nikolay found that our GeoIP table had gaps between some ranges and decided to rectify this condition by filling in the gaps with “no country” rows, ensuring that the query would return “no country” instead of a wrong country. I would advise against doing that because it lengthens the index and adds precious query time. Instead, check that the found range’s lower bound is less than or equal to the given value after you have retrieved the row.