Fast MySQL Range Queries on MaxMind GeoIP Tables

A few weeks ago I read Jeremy Cole’s post on querying MaxMind GeoIP tables but I didn’t know what all that geometric magic was about so I dropped a comment about how we do it here on WordPress.com. (Actually, Nikolay beat me to it.) Jeremy ran some benchmarks and added them to his post. He discovered that my query performed favorably.

Today I saw an article referencing that comment and I wished I had published it here, so here it goes. There is a bonus at the end to make it worth your while if you witnessed the original discussion.

The basic problem is this: you have a MySQL table with columns that define the upper and lower bounds of mutually exclusive integer ranges and you need the row for which a given integer fits within the range and you need it fast.

The basic solution is this: you create an index on the upper bound column and find the first row for which that value is greater than or equal to the given value.

The logic is this: MySQL scans the integer index in ascending order. Every range below the matching range will have an upper bound less than the given value. The first range with an upper bound not less than the given value will include that value if the ranges are contiguous.

Assuming contiguous ranges (no possibility of falling between ranges) this query will find the correct row very quickly:

SELECT * FROM ip2loc WHERE ip_to >= 123456789 LIMIT 1

The MySQL server can find the row with an index scan, a sufficiently fast operation. I can’t think of a faster way to get the row (except maybe reversing the scan when the number is known to be in the upper half of the entire range).

The bonus is this: because the time to scan the index is related to the length of the index, you should keep the index as small as possible. Nikolay found that our GeoIP table had gaps between some ranges and decided to rectify this condition by filling in the gaps with “no country” rows, ensuring that the query would return “no country” instead of a wrong country. I would advise against doing that because it lengthens the index and adds precious query time. Instead, check that the found range’s lower bound is less than or equal to the given value after you have retrieved the row.

Previous Post

6 Comments

  1. I’ve actually used this technique for a couple years now after realizing not only does it speed up the search but reduces the db table to just over 1mb which can be cached far better (and dropping the unneeded columns)

    However I discovered it’s a little more accurate and gives better “missing” results if you do it backwards using the ENDING column and descend – searching backwards essentially. MySQL does it just as fast, and if not found, the next lower result is better.

    I’ve had to manually patch the maxmind ranges about two dozen times now. The free db has several holes and inaccuracies, especially with ISPs like AOL. It also lists EU for several spots that should be more country specific. We should share results when we find holes and make a group project for patches?

  2. Tim

     /  July 12, 2008

    We also refer to the FAQs in another company for some useful insights about the steps required to speed up the query performance.

    http://www.ip2location.com/faqs-ip-country.aspx

  3. I love it when I’m googling up a problem and the answer turns up on a co-worker’s blog. Even better when it means the code I need is already in our repo.

  4. Joao

     /  September 21, 2010

    Try the following and you will be amazed as I was.

    Add a primary index to (ip_to, ip_from, country), assuming your table has the 3 basic fields (ip_from, ip_to, country), and execute the following query:

    SELECT `country` FROM `ip2country` WHERE 123456789 BETWEEN `ip_from` AND `ip_to` LIMIT 1;

    Note that the index has to start with ip_to and not ip_from… and by adding both ip_from and country to the same index mysql will use the “Using index” optimization.

    The other trick is to add the LIMIT 1 together with BETWEEN! If you don’t it will be slower, don’t ask me why.

    In order to use the “Using index” optimization the index has to be on all 3 fields (ip_to, ip_from, country)… or just (ip_to) if you don’t care about the optimization or are using MEMORY tables which don’t use that optimization anyway. Btw, MEMORY tables are faster than MyISAM, but make sure the index is of type “BTREE” and not “HASH” as that’s the default for memory tables.

    • John

       /  November 18, 2010

      I used the “limit 1″ trick with great success. I have indexes only on the individual fields (ip_from, ip_to) and saw a great performance increase just by adding the limit 1. Thanks.

  5. I just posted a more detailed explanation to my previous post.

Follow

Get every new post delivered to your Inbox.

Join 1,670 other followers

%d bloggers like this: