Automatically open remote files in local emacs

I prefer to edit text locally in emacs. Most of the files I edit reside on remote servers so I use TRAMP to open remote files locally. What kills me is using emacs remotely via terminal when a shell command invokes $EDITOR (e.g. svn commit). With my new setup, the default editor on the remote machine is my local emacs. I love this.

First, I configure SSH to forward a remote port to my machine. This means that whenever the remote machine tries to connect to itself on that port (localhost:9999) it actually connects to port 9999 on my local OSX machine. I like to keep these details in my ssh_config file (my local ~/.ssh/config):

Host des
User wpdev
ControlMaster auto
ControlPath ~/.ssh/des.sock
RemoteForward 9999 localhost:9999

(I use abbreviated hostnames to save keystrokes. There is a matching entry in my hosts file.)

Second, I configure my local emacs to start the server and copy the server file to the remote host. The server file tells emacsclient how to connect to the server. Adding this to emacs-startup-hook adds a few seconds to my emacs startup time but I rarely start emacs more than once in a day so that’s fine. This is in my local ~/.emacs:

(setq server-use-tcp t
      server-port    9999)
(defun server-start-and-copy ()
  (server-start)
  (copy-file "~/.emacs.d/server/server" "/des:.emacs.d/server/server" t))
(add-hook 'emacs-startup-hook 'server-start-and-copy)

Third, I create a bash script on the remote host which calls emacsclient with the necessary TRAMP path prefixed to its arguments. (If you try running emacsclient remotely without the TRAMP path you’ll get an empty emacs buffer.) Here is the script I put in remote ~/bin/ec and then chmod +x:

#!/bin/bash

params=()
for p in "$@"; do
    if [ "$p" == "-n" ]; then
        params+=( "$p" )
    elif [ "${p:0:1}" == "+" ]; then
        params+=( "$p" )
    else
        params+=( "/ssh:des:"$(readlink -f $p) )
    fi
done
emacsclient "${params[@]}"

Finally, I set up $EDITOR on the remote machine. I also add my bin directory to $PATH so I can invoke ec. This is in my remote ~/.bashrc:

export PATH=~/bin:$PATH
export EDITOR=~/bin/ec

That’s it! More elegant solutions are possible but my new tool is sufficiently sharp and I have work to do!

Batcache for WordPress

[I meant to publicize this after a period of quiet testing and feedback but the watchdogs at WLTC upended the kitten bag and forced my hand. Batcache comes with all the usual disclaimers. If you try it on a production server expect the moon to fall on your head.]

People say WordPress can’t perform under pressure. The way most people set it up, that’s true. For those who host their blog for $7.99 a month (do they also run Vista on an 8086?) the best bet is to serve static pages rather than dynamic pages. Donncha’s WP-Super-Cache does that brilliantly. I’ve seen it raise a server’s capacity for blog traffic by one hundred times or more. It’s a cheapskate’s dream.

WP-Super-Cache is good for anyone with a single web server with a writable wp-content/cache directory. To them, the majority, I say use WP-Super-Cache. What about enterprises with multiple servers that don’t share disk space? If you can’t or won’t use file-based caching, I have something for you. It’s based on what WordPress.com uses. It’s Batcache.

Batcache will protect you

Batcache implements a very simplistic caching model that shields your database and web servers from traffic spikes: after a document has been requested X times in Y seconds, the document is cached for Z seconds and all new users are served the cached copy.

New users are defined as anybody who hasn’t interacted with your domain—once they’ve left a comment or logged in, their cookies will ensure they get fresh pages. People arriving from Digg won’t notice that the comments are a minute or two behind but they’ll appreciate your site being up.

You don’t need PHP skills to install Batcache but you do have to get Memcached working first. That can be easy or hard. We use Memcached because it’s awesome. Once you know how to install it you can create the same kind of distributed, persistent cache that underpin web giants like WordPress.com and Facebook.

What Batcache does

The first thing Batcache does is decide whether the visitor is eligible to receive cached documents. If their cookies don’t show evidence of previous interaction on that domain they are eligible. Next it decides whether the request is eligible for caching. For example, Batcache won’t interfere when a comment is being posted.

If the visitor and the request are eligible, Batcache enters its traffic metering routine. By default it looks for URLs that receive more than two hits from unrecognized users in two minutes. When a URL’s traffic crosses that threshold, Batcache caches the document for five minutes. You can configure these numbers any way you like, or turn off traffic metering and send documents right to the cache.

Once a document has been cached, it is served to eligible visitors until it expires. This is one place where Batcache is different. Most other caches delete cached documents as soon as the underlying data changes. Batcache doesn’t care if it’s serving old data because “old” is relative (and configurable).

What Batcache doesn’t do

It doesn’t guarantee a current document. I repeat this because reliable cache invalidation is a typical feature that was purposefully omitted from Batcache. There is a routine in the included plugin that tries to trigger regeneration of updated and commented posts but in some situations a document will still live in the cache until it expires. This routine will be improved over time but it is only an afterthought.

Batcache doesn’t automatically know the difference between document variants. Variants exist when two requests for the same URL can yield two different documents. Common examples are user agent-dependent variants formatted for mobile devices and referrer-dependent variants with Google search terms highlighted. In these cases you MUST take extra steps to inform Batcache about variants to avoid serving a variant to the wrong audience. The source code includes examples of how to turn off caching of uncommon variants (search term highlighting) or cache common variants separately (mobile versions).

Where Batcache is going

I want to make Batcache easier to configure by adding a configuration page and storing the main settings in memcached as well as the database. This way you won’t have to deploy a code change to update the configuration. However, conditional configurations (e.g. “never cache URLs matching some pattern”) and variant detection will probably always live in PHP.

I want to have Batcache serve correct headers more reliably. On some servers it can detect the headers that were sent with a newly generated page and serve them again from the cache. But when that doesn’t work you will have to take extra steps to serve certain headers. For example you must specify the Content-Encoding header in the Batcache configuration or add it to php.ini. I want this sort of thing to be done automatically for all server setups.

I know that Batcache is not ideal for most WordPress installations. It saves us a lot of headaches and expense at WordPress.com, so maybe it can help other large installations. If you try it, I want to hear from you whether it worked and how well. I am also keen to see what new configurations and modifications you use.

As always, this software is provided without claims or warrantees. It’s so experimental that it doesn’t even have a version number! Until the project grows to need its own blog, keep an eye on the Trac browser for updates.

Fast MySQL Range Queries on MaxMind GeoIP Tables

A few weeks ago I read Jeremy Cole’s post on querying MaxMind GeoIP tables but I didn’t know what all that geometric magic was about so I dropped a comment about how we do it here on WordPress.com. (Actually, Nikolay beat me to it.) Jeremy ran some benchmarks and added them to his post. He discovered that my query performed favorably.

Today I saw an article referencing that comment and I wished I had published it here, so here it goes. There is a bonus at the end to make it worth your while if you witnessed the original discussion.

The basic problem is this: you have a MySQL table with columns that define the upper and lower bounds of mutually exclusive integer ranges and you need the row for which a given integer fits within the range and you need it fast.

The basic solution is this: you create an index on the upper bound column and find the first row for which that value is greater than or equal to the given value.

The logic is this: MySQL scans the integer index in ascending order. Every range below the matching range will have an upper bound less than the given value. The first range with an upper bound not less than the given value will include that value if the ranges are contiguous.

Assuming contiguous ranges (no possibility of falling between ranges) this query will find the correct row very quickly:

SELECT * FROM ip2loc WHERE ip_to >= 123456789 LIMIT 1

The MySQL server can find the row with an index scan, a sufficiently fast operation. I can’t think of a faster way to get the row (except maybe reversing the scan when the number is known to be in the upper half of the entire range).

The bonus is this: because the time to scan the index is related to the length of the index, you should keep the index as small as possible. Nikolay found that our GeoIP table had gaps between some ranges and decided to rectify this condition by filling in the gaps with “no country” rows, ensuring that the query would return “no country” instead of a wrong country. I would advise against doing that because it lengthens the index and adds precious query time. Instead, check that the found range’s lower bound is less than or equal to the given value after you have retrieved the row.

Free tech gear

Here’s a clever WordPress site: Take My Tech is giving away used gadgets by selecting a random commenter when the number of comments on a gadget reaches one hundred. The site will pay for shipping (and hopefully make the owner a few bucks) with the revenues from advertisements. It is a bit of a gamble for the owner—craigslist or ebay would be the more obvious choices—but I find it clever all the same.

The lottery process, including checking for duplicate comment emails/IP’s, could be automated with a plugin. I suggested building on top of my own Cap Comments plugin.

Proposal: Multipart Web Requests

Here’s a little idea that might improve the web for everyone. I don’t know how to draft or submit a Request For Comments—I could have read RFC 2026 (BCP 9) but I wasn’t interested—but if anyone would like to see this through, I hope you’ll contact me.

We could improve the overall performance and reduce the request load on most of the world wide web if servers supported a way of sending in one response some or all of the resources that will certainly or likely be requested (GET) as a result of parsing the requested resource.

A sufficient implementation might be possible using existing RFCs. Perhaps a media range of “Multipart” in the Accept request-header field could be used to announce that the client can accept such responses.

The ideal implementation might include optimizations for client and proxy caches, such as a Cached request-header whose value would specify for exclusion from the response any already-cached items and their LastModified, ETag, or other conditional request-header values as appropriate.

Static files served in this manner might be parsed by the web server in order to discover which, if any, other resources (images, audio, stylesheets, DTD’s, etc.) should be sent. The server might take cues from a saved list, as from a cache of previous results of parsing or from a manually generated list.

Servers generating resources dynamically might take cues from the program state, or the page generation scripts might cue the server by passing data directly or as a value of an Include response-header. When a proxy detects the Include response-header along with a single-part response, it may assume that the server was incapable of providing a multi-part response and convert the response into a multi-part response if the proxy has a valid copy or wishes to pre-fetch a copy of any or all of the resources indicated by Include.

A typical request might proceed in this way:

  1. Client requests /index.html from example.com with Accept: Multipart
  2. Server finds index.html and discovers that its display will require a PNG file as a background.
  3. Server responds:
    {response-headers}
    {body-part (text/html)}
    –boundary
    {body-part (image/png)}
    –boundary–

In the preceeding example, there is no explicit request for the background image but the client receives it in the payload of the initial response.

Specialized user agents may use the Accept request-header to specify which types of media they prefer to receive but there ought to be a way for agents to specify which types of media they prefer not to receive in multi-part responses. For example, a screen reader probably would want the server to bundle audio but not images. A Reject request-header could indicate unwanted media types or a zero q-value in the Accept header might serve to prevent the server or any proxies from attaching these types.

A proxy handling a non-multi-part request from a client may request resources in multi-part mode and then cache and serve the individual parts as if each had been requested singly. Proxies may construct multi-part responses from parts retrieved individually and they may append additional parts according to any cue, such as by rendering web pages with the engine of their choice (perhaps taking a hint from the User-Agent request-header).

The existence of an entity in a multi-part response should not cause the user agent to display, execute, or otherwise handle the entity. User agents accepting multi-part responses should not store or execute any part which was not used during the course of rendering or interacting with the requested resource.

Does anybody else think that something like this could be beneficial?

DJ Parrot

I found this system for controlling your computer by whistling at it: whistle while you work. This is something I would like to try out… if I could debug the sound system in my Linux box. It hasn’t worked since an update got installed a few months ago. Anyway, it seems like a fun thing to use.

What would I do with this? I don’t know, probably nothing great. Maybe control my music player from across the room. I like the idea of controlling a computer remotely without any electronic devices on my person. Would it still work if the baseline noise level were as loud as I like my music?

This would be a cool interface for a pet-controlled computer application. Parrots can whistle. If I had a pet parrot I’d find things that he could control with a whistle. How about a music player? Let the bird have a whistle for Next Track, Start and Stop. Let the parrot dictate the playlist. Give him a last.fm account.