Updates from February, 2012 Toggle Comment Threads | Keyboard Shortcuts

  • Andy Skelton 3:45 pm on February 14, 2012 Permalink  

    Atomically update serialized PHP arrays in MySQL 

    Okay, okay, it’s hard to find a use case for this when it’s so obvious that the correct way to handle one-to-many is with JOIN. But if you’re already committed to your schema and you decide you need to append serialized PHP data to a row atomically, you can cons serialized values with this query:

    INSERT INTO tbl
      …
      serialized = "i:1;"
      ON DUPLICATE KEY UPDATE
        serialized = CONCAT(
          'a:3:{i:0;s:4:"cons";i:1;',
          VALUES(serialized),
          'i:2;',
          serialized,
          '}'
        )

    After you have performed this three times with the serialized values 1, 2, and 3, the row contains this:

    'a:3:{i:0;s:4:"cons";i:1;i:3;i:2;a:3:{i:0;s:4:"cons";i:1;i:2;i:2;a:2:{i:0;s:4:"cons";i:1;i:1;}}}'

    After unserializing, deconstruct it with this function:

    function decons($list) {
        $res = array();
        while ( $list != array() ) {
            if ( $list[0] === 'cons' ) {
                array_unshift( $res, $list[1] );
                $list = $list[2];
            } else {
                array_unshift( $res, $list );
                break;
            }
        }
        return $res;
    }

    The result:

    array(1, 2, 3)

    I haven’t actually used it (probably never will) but you are welcome to try this at home!

    Proving that this is stupid is left as an exercise for the reader.

     
  • Andy Skelton 5:46 pm on February 11, 2009 Permalink  

    Idea drop: db::multi_query 

    We usually do things sequentially in PHP. Any point where we can get two things done at the same time is an opportunity to reduce the total execution time. It is generally safe to execute processes in parallel when no process has a side effect that may impact any other process. Most MySQL SELECT statements fall into this class. Under certain conditions it might help to run these queries in parallel.

    So I got to thinking: does PHP provide a function to issue a query and return immediately, instead of waiting while the database churns? Such a function would let me spread a batch of queries across several servers and then collect the results. It should not involve forking or making OS calls. Maybe mysql_unbuffered_query is that function.

    Assuming it is, here is the basis for a parallel query system in PHP. It is obviously incomplete. I may complete it and try it out when I need to query a dataset partitioned across separate MySQL instances. The function connect_reserved($query) returns a locked MySQL link identifier, opening new connections as needed. The function release($query) removes the lock, returning the link to the pool of available database connections.

    function multi_query($queries);
        foreach ( $queries as $i => $query ) {
            $link = connect_reserved($query);
            $res[$i] = mysql_unbuffered_query($query, $link);
            $ret[$i] = array();
        }
    
        do {
            foreach ( $res as $i => $r ) {
                if ( $row = mysql_fetch_row($r) ) {
                    $ret[$i][] = $row;
                } else {
                    release($queries[$i]);
                    unset($res[$i]);
                }
            }
        } while ( count($res) );
    
        return $ret;
    }

    I have not used mysql_unbuffered_query. For best results, it should return as soon as the server determines that the query is valid. I have assumed that it can return while the database is still looking for records. (If it can not, this whole idea should be forgotten.) This oversimplified diagram helps illustrate how the queries run in parallel. The green line shows the beneficial overlap of query processing time.

    quasi-parallel queries in PHP

     
    • Otto 6:28 pm on February 11, 2009 Permalink

      Don’t think that’ll work.

      “Note: The benefits of mysql_unbuffered_query() come at a cost: You cannot use mysql_num_rows() and mysql_data_seek() on a result set returned from mysql_unbuffered_query(). You also have to fetch all result rows from an unbuffered SQL query, before you can send a new SQL query to MySQL.”

      So you can’t send a new query until you’ve fetched all the previous rows.

    • Otto 6:33 pm on February 11, 2009 Permalink

      Additional: Whoops, I didn’t see that you were using multiple DB connections. In that case, yes, this’ll work, but you’ll have to have one DB connection per parallel query.

      Also, when you do an unbuffered query, then it does wait for the database search to actually occur, and it even retrieves the first row of the result set into php for you. It just doesn’t retrieve the remaining rows from mySQL until you actually ask for them with a fetch_row.

    • Andy 6:59 pm on February 11, 2009 Permalink

      What about this:

      On the other hand, you can start working on the result set immediately after the first row has been retrieved: you don’t have to wait until the complete SQL query has been performed.

    • apokalyptik 1:30 am on February 12, 2009 Permalink

      its times like these when i really wish php had some form of threading…

    • apokalyptik 2:28 am on February 12, 2009 Permalink

      you might be able to emulate something like this with pcntl_fork()….

      i dont know how well this will paste… but… this illustrates the idea…

      $query ) {
      $sockets[$idx] = fork_query($query, $idx);
      $results[$idx][‘query’] = $query;
      $results[$idx][‘rows’] = array();
      }
      while ( true ) {
      $poll = array_values($sockets);
      if ( false === socket_select($poll, $w=null, $e=null, 0) )
      break;
      foreach ( $poll as $idx => $sock ) {
      $res = unserialize(base64_decode(socket_read($sock, 80960, PHP_NORMAL_READ)));
      if ( $res === false ) {
      socket_close($sock);
      break;
      }
      $id = $res[0];
      $results[$id][‘rows’][] = $res[1];
      }
      foreach( $sockets as $idx => $val ) {
      if ( ‘Socket’ != get_resource_type($val) )
      unset($sockets[$idx]);
      }
      if ( !count($sockets) )
      break;
      }
      return $results;
      }

      print_r(multi_query(array(“blah1”, “blah2”, “blah3”)));

      ?>

    • apokalyptik 2:29 am on February 12, 2009 Permalink

      uhh… how about this time with less suckage… http://pastebin.com/f7c242f82

    • Felix Geisendörfer 6:09 am on February 12, 2009 Permalink

      apokalyptik: pcntl_fork is not perfect but you can do some threading-type-of-stuff with it.

    • Joseph Scott 10:16 pm on March 23, 2009 Permalink

      The MySQLi Poll function (when built with mysqlnd) might be good for this:

      http://us.php.net/manual/en/mysqli.poll.php

  • Andy Skelton 5:47 am on December 16, 2007 Permalink
    Tags: GeoIP, MaxMind   

    Fast MySQL Range Queries on MaxMind GeoIP Tables 

    A few weeks ago I read Jeremy Cole’s post on querying MaxMind GeoIP tables but I didn’t know what all that geometric magic was about so I dropped a comment about how we do it here on WordPress.com. (Actually, Nikolay beat me to it.) Jeremy ran some benchmarks and added them to his post. He discovered that my query performed favorably.

    Today I saw an article referencing that comment and I wished I had published it here, so here it goes. There is a bonus at the end to make it worth your while if you witnessed the original discussion.

    The basic problem is this: you have a MySQL table with columns that define the upper and lower bounds of mutually exclusive integer ranges and you need the row for which a given integer fits within the range and you need it fast.

    The basic solution is this: you create an index on the upper bound column and find the first row for which that value is greater than or equal to the given value.

    The logic is this: MySQL scans the integer index in ascending order. Every range below the matching range will have an upper bound less than the given value. The first range with an upper bound not less than the given value will include that value if the ranges are contiguous.

    Assuming contiguous ranges (no possibility of falling between ranges) this query will find the correct row very quickly:

    SELECT * FROM ip2loc WHERE ip_to >= 123456789 LIMIT 1

    The MySQL server can find the row with an index scan, a sufficiently fast operation. I can’t think of a faster way to get the row (except maybe reversing the scan when the number is known to be in the upper half of the entire range).

    The bonus is this: because the time to scan the index is related to the length of the index, you should keep the index as small as possible. Nikolay found that our GeoIP table had gaps between some ranges and decided to rectify this condition by filling in the gaps with “no country” rows, ensuring that the query would return “no country” instead of a wrong country. I would advise against doing that because it lengthens the index and adds precious query time. Instead, check that the found range’s lower bound is less than or equal to the given value after you have retrieved the row.

     
    • _ck_ 12:59 pm on March 17, 2008 Permalink | Reply

      I’ve actually used this technique for a couple years now after realizing not only does it speed up the search but reduces the db table to just over 1mb which can be cached far better (and dropping the unneeded columns)

      However I discovered it’s a little more accurate and gives better “missing” results if you do it backwards using the ENDING column and descend – searching backwards essentially. MySQL does it just as fast, and if not found, the next lower result is better.

      I’ve had to manually patch the maxmind ranges about two dozen times now. The free db has several holes and inaccuracies, especially with ISPs like AOL. It also lists EU for several spots that should be more country specific. We should share results when we find holes and make a group project for patches?

    • Tim 5:06 am on July 12, 2008 Permalink | Reply

      We also refer to the FAQs in another company for some useful insights about the steps required to speed up the query performance.

      http://www.ip2location.com/faqs-ip-country.aspx

    • Alex 11:29 pm on January 28, 2009 Permalink | Reply

      I love it when I’m googling up a problem and the answer turns up on a co-worker’s blog. Even better when it means the code I need is already in our repo.

    • Joao 3:51 pm on September 21, 2010 Permalink | Reply

      Try the following and you will be amazed as I was.

      Add a primary index to (ip_to, ip_from, country), assuming your table has the 3 basic fields (ip_from, ip_to, country), and execute the following query:

      SELECT `country` FROM `ip2country` WHERE 123456789 BETWEEN `ip_from` AND `ip_to` LIMIT 1;

      Note that the index has to start with ip_to and not ip_from… and by adding both ip_from and country to the same index mysql will use the “Using index” optimization.

      The other trick is to add the LIMIT 1 together with BETWEEN! If you don’t it will be slower, don’t ask me why.

      In order to use the “Using index” optimization the index has to be on all 3 fields (ip_to, ip_from, country)… or just (ip_to) if you don’t care about the optimization or are using MEMORY tables which don’t use that optimization anyway. Btw, MEMORY tables are faster than MyISAM, but make sure the index is of type “BTREE” and not “HASH” as that’s the default for memory tables.

      • John 12:59 am on November 18, 2010 Permalink | Reply

        I used the “limit 1” trick with great success. I have indexes only on the individual fields (ip_from, ip_to) and saw a great performance increase just by adding the limit 1. Thanks.

    • joaoptm 7:59 am on September 22, 2010 Permalink | Reply

      I just posted a more detailed explanation to my previous post.

c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
Follow

Get every new post delivered to your Inbox.

Join 2,058 other followers