Atomically update serialized PHP arrays in MySQL

Okay, okay, it’s hard to find a use case for this when it’s so obvious that the correct way to handle one-to-many is with JOIN. But if you’re already committed to your schema and you decide you need to append serialized PHP data to a row atomically, you can cons serialized values with this query:

INSERT INTO tbl
  …
  serialized = "i:1;"
  ON DUPLICATE KEY UPDATE
    serialized = CONCAT(
      'a:3:{i:0;s:4:"cons";i:1;',
      VALUES(serialized),
      'i:2;',
      serialized,
      '}'
    )

After you have performed this three times with the serialized values 1, 2, and 3, the row contains this:

'a:3:{i:0;s:4:"cons";i:1;i:3;i:2;a:3:{i:0;s:4:"cons";i:1;i:2;i:2;a:2:{i:0;s:4:"cons";i:1;i:1;}}}'

After unserializing, deconstruct it with this function:

function decons($list) {
    $res = array();
    while ( $list != array() ) {
        if ( $list[0] === 'cons' ) {
            array_unshift( $res, $list[1] );
            $list = $list[2];
        } else {
            array_unshift( $res, $list );
            break;
        }
    }
    return $res;
}

The result:

array(1, 2, 3)

I haven’t actually used it (probably never will) but you are welcome to try this at home!

Proving that this is stupid is left as an exercise for the reader.

Idea drop: db::multi_query

We usually do things sequentially in PHP. Any point where we can get two things done at the same time is an opportunity to reduce the total execution time. It is generally safe to execute processes in parallel when no process has a side effect that may impact any other process. Most MySQL SELECT statements fall into this class. Under certain conditions it might help to run these queries in parallel.

So I got to thinking: does PHP provide a function to issue a query and return immediately, instead of waiting while the database churns? Such a function would let me spread a batch of queries across several servers and then collect the results. It should not involve forking or making OS calls. Maybe mysql_unbuffered_query is that function.

Assuming it is, here is the basis for a parallel query system in PHP. It is obviously incomplete. I may complete it and try it out when I need to query a dataset partitioned across separate MySQL instances. The function connect_reserved($query) returns a locked MySQL link identifier, opening new connections as needed. The function release($query) removes the lock, returning the link to the pool of available database connections.

function multi_query($queries);
    foreach ( $queries as $i => $query ) {
        $link = connect_reserved($query);
        $res[$i] = mysql_unbuffered_query($query, $link);
        $ret[$i] = array();
    }

    do {
        foreach ( $res as $i => $r ) {
            if ( $row = mysql_fetch_row($r) ) {
                $ret[$i][] = $row;
            } else {
                release($queries[$i]);
                unset($res[$i]);
            }
        }
    } while ( count($res) );

    return $ret;
}

I have not used mysql_unbuffered_query. For best results, it should return as soon as the server determines that the query is valid. I have assumed that it can return while the database is still looking for records. (If it can not, this whole idea should be forgotten.) This oversimplified diagram helps illustrate how the queries run in parallel. The green line shows the beneficial overlap of query processing time.

quasi-parallel queries in PHP

Fast MySQL Range Queries on MaxMind GeoIP Tables

A few weeks ago I read Jeremy Cole’s post on querying MaxMind GeoIP tables but I didn’t know what all that geometric magic was about so I dropped a comment about how we do it here on WordPress.com. (Actually, Nikolay beat me to it.) Jeremy ran some benchmarks and added them to his post. He discovered that my query performed favorably.

Today I saw an article referencing that comment and I wished I had published it here, so here it goes. There is a bonus at the end to make it worth your while if you witnessed the original discussion.

The basic problem is this: you have a MySQL table with columns that define the upper and lower bounds of mutually exclusive integer ranges and you need the row for which a given integer fits within the range and you need it fast.

The basic solution is this: you create an index on the upper bound column and find the first row for which that value is greater than or equal to the given value.

The logic is this: MySQL scans the integer index in ascending order. Every range below the matching range will have an upper bound less than the given value. The first range with an upper bound not less than the given value will include that value if the ranges are contiguous.

Assuming contiguous ranges (no possibility of falling between ranges) this query will find the correct row very quickly:

SELECT * FROM ip2loc WHERE ip_to >= 123456789 LIMIT 1

The MySQL server can find the row with an index scan, a sufficiently fast operation. I can’t think of a faster way to get the row (except maybe reversing the scan when the number is known to be in the upper half of the entire range).

The bonus is this: because the time to scan the index is related to the length of the index, you should keep the index as small as possible. Nikolay found that our GeoIP table had gaps between some ranges and decided to rectify this condition by filling in the gaps with “no country” rows, ensuring that the query would return “no country” instead of a wrong country. I would advise against doing that because it lengthens the index and adds precious query time. Instead, check that the found range’s lower bound is less than or equal to the given value after you have retrieved the row.

Follow

Get every new post delivered to your Inbox.

Join 1,665 other followers