Real-time WordPress.com subscription

Going nowhere fastSometimes RSS isn’t fast enough. We’ve been experimenting with faster blog subscription delivery using Jabber to push the messages. When you use Jabber to subscribe to blogs you get the news as soon as it is published. Now almost every post and comment on WordPress.com blogs is published this way and you can subscribe to these streams using almost any Jabber client. Messages are delivered typically within one second of publication. You can also publish to your blogs by typing instant messages. Soon comments will be appearing in Jabber chat rooms.

Let me break it down. Jabber (XMPP) is an instant messaging protocol. There are dozens of free clients (programs) that can connect to Jabber services, thousands of Jabber servers, and millions of daily Jabber users around the world. Even some phones can connect to it, including the iPhone with an appropriate app. So you can probably use Jabber. The primary exception is people behind company or government firewalls that block XMPP ports, but we’ll have a web-based solution for them soon.

Before I give you a link I have to tell you that this service is experimental. If you use it, you are a tester so please wear your white lab coat. A few dozen people have been using it for several months with very few hiccups, but hiccups are still to be expected. Even so, most of us at Automattic rely on it daily to surface and accelerate the discussions on our private blogs. Finally I must tell you that we have not yet worked out all the business angles, so the feature set and limitations may change to accommodate our inevitable need to feed the monkey.

Now I give you im.wordpress.com, which I demonstrated at the CrunchUp last week. Every WordPress.com account is automatically linked to a Jabber account on im.wordpress.com. We have compiled instructions for setting up some popular Jabber clients.

If you want my personal recommendation for a Jabber client, my choice for Mac OS X is Adium. If you already use iChat, just stick with iChat. Some of my coworkers who run Windows have chosen Pidgin. I also sometimes use Psi, which is available for Mac, Windows, and Linux.

Here is some info for people familiar with XMPP. This service is based on XEP-0060 (Publish-Subscribe) acting as a front-end for WordPress blogs. It started as a simple firehose for our commercial partners and grew from there. People subscribing with Jabber clients don’t need Pubsub. They send simple commands to a chat bot and their items are delivered as XHTML-IM from the blog’s URL. The bot speaks XEP-0060 on their behalf. If you can speak XEP-0060, you can connect to pubsub.im.wordpress.com and subscribe to nodes. The nodes for this blog are /blogs/andy.wordpress.com/ for posts, /blogs/andy.wordpress.com/comments/ for all comments, and /blogs/andy.wordpress.com/2009/07/16/real-time-wordpress-com-subscription/ for comments on this post. Node discovery and item discovery and retrieval are not implemented. Reasonable subscription and traffic limits will be imposed. If you are looking for a complete feed of all our blogs and comments, try the firehose.

Batcache for WordPress

[I meant to publicize this after a period of quiet testing and feedback but the watchdogs at WLTC upended the kitten bag and forced my hand. Batcache comes with all the usual disclaimers. If you try it on a production server expect the moon to fall on your head.]

People say WordPress can’t perform under pressure. The way most people set it up, that’s true. For those who host their blog for $7.99 a month (do they also run Vista on an 8086?) the best bet is to serve static pages rather than dynamic pages. Donncha’s WP-Super-Cache does that brilliantly. I’ve seen it raise a server’s capacity for blog traffic by one hundred times or more. It’s a cheapskate’s dream.

WP-Super-Cache is good for anyone with a single web server with a writable wp-content/cache directory. To them, the majority, I say use WP-Super-Cache. What about enterprises with multiple servers that don’t share disk space? If you can’t or won’t use file-based caching, I have something for you. It’s based on what WordPress.com uses. It’s Batcache.

Batcache will protect you

Batcache implements a very simplistic caching model that shields your database and web servers from traffic spikes: after a document has been requested X times in Y seconds, the document is cached for Z seconds and all new users are served the cached copy.

New users are defined as anybody who hasn’t interacted with your domain—once they’ve left a comment or logged in, their cookies will ensure they get fresh pages. People arriving from Digg won’t notice that the comments are a minute or two behind but they’ll appreciate your site being up.

You don’t need PHP skills to install Batcache but you do have to get Memcached working first. That can be easy or hard. We use Memcached because it’s awesome. Once you know how to install it you can create the same kind of distributed, persistent cache that underpin web giants like WordPress.com and Facebook.

What Batcache does

The first thing Batcache does is decide whether the visitor is eligible to receive cached documents. If their cookies don’t show evidence of previous interaction on that domain they are eligible. Next it decides whether the request is eligible for caching. For example, Batcache won’t interfere when a comment is being posted.

If the visitor and the request are eligible, Batcache enters its traffic metering routine. By default it looks for URLs that receive more than two hits from unrecognized users in two minutes. When a URL’s traffic crosses that threshold, Batcache caches the document for five minutes. You can configure these numbers any way you like, or turn off traffic metering and send documents right to the cache.

Once a document has been cached, it is served to eligible visitors until it expires. This is one place where Batcache is different. Most other caches delete cached documents as soon as the underlying data changes. Batcache doesn’t care if it’s serving old data because “old” is relative (and configurable).

What Batcache doesn’t do

It doesn’t guarantee a current document. I repeat this because reliable cache invalidation is a typical feature that was purposefully omitted from Batcache. There is a routine in the included plugin that tries to trigger regeneration of updated and commented posts but in some situations a document will still live in the cache until it expires. This routine will be improved over time but it is only an afterthought.

Batcache doesn’t automatically know the difference between document variants. Variants exist when two requests for the same URL can yield two different documents. Common examples are user agent-dependent variants formatted for mobile devices and referrer-dependent variants with Google search terms highlighted. In these cases you MUST take extra steps to inform Batcache about variants to avoid serving a variant to the wrong audience. The source code includes examples of how to turn off caching of uncommon variants (search term highlighting) or cache common variants separately (mobile versions).

Where Batcache is going

I want to make Batcache easier to configure by adding a configuration page and storing the main settings in memcached as well as the database. This way you won’t have to deploy a code change to update the configuration. However, conditional configurations (e.g. “never cache URLs matching some pattern”) and variant detection will probably always live in PHP.

I want to have Batcache serve correct headers more reliably. On some servers it can detect the headers that were sent with a newly generated page and serve them again from the cache. But when that doesn’t work you will have to take extra steps to serve certain headers. For example you must specify the Content-Encoding header in the Batcache configuration or add it to php.ini. I want this sort of thing to be done automatically for all server setups.

I know that Batcache is not ideal for most WordPress installations. It saves us a lot of headaches and expense at WordPress.com, so maybe it can help other large installations. If you try it, I want to hear from you whether it worked and how well. I am also keen to see what new configurations and modifications you use.

As always, this software is provided without claims or warrantees. It’s so experimental that it doesn’t even have a version number! Until the project grows to need its own blog, keep an eye on the Trac browser for updates.

Automattic Stats for self-hosted WordPress

The new Automattic Stats plugin is available for download. It lets self-hosted WordPress bloggers use the exact same traffic metrics system we provide to WordPress.com users. It tracks post and page views, referrers, search terms, and clicks on your external links. It takes moments to install if you already have a WordPress blog and a WordPress.com API key. And it’s totally free.

Although the code is almost exclusively my work, I must give thanks for Matt‘s guidance, Barry‘s systems wrangling, and Rudy’s barbecue, each of which were indispensable.

The rest of this post will cover technical details of the system, how it works and why it’s cool. If you have a question I didn’t answer, leave a comment and I’ll do my best to answer.

How does it work?

The plugin adds a tiny image to your blog () and that image is hosted on our servers. Every time your blog is viewed by a browser with javascript enabled, the browser downloads that image and we see a new line in our server logs. We then process the server logs and insert the data into a big MySQL database that we use to generate the lists and charts on your stats page.

There’s a little more to it than that. The plugin adds the post ID and referrer to the image URL so we know what the visitor is looking at and where they came from. We examine the referrer and if it looks like a search engine, we sift out the search terms and save those instead. Our servers also communicate with your blog from time to time, such as when you update the title of a post.

What makes it fast?

When you run your own stats system, your blog server has to do a lot of extra work to track each visit. We take that load off your server to keep it snappy.

By serving the javascript from stats.wordpress.com, we take advantage of the browser cache so that no matter how many blogs are visited, the script is only loaded once per week.

Clicks are reported asynchronously. Rather than the more common method of mangling URLs and forcing the visitor to wait during a redirect, the click stats are tracked using elements of AJAX. Your hrefs are safe and your visitors experience no delays.

The tracking gif loads fast because WordPress.com infrastructure just plain rocks.

What’s with the smiley face?

When we started developing stats for WordPress.com in 2005, Matt thought it would be cute. That’s his artwork.

No doubt, people will want to hide the smiley face. There are wrong ways to do this. Basically, anything that causes the image not to be loaded by the browser will break your stats.

Applying “display:none” to the image will break your stats. Don’t do it. If you want to hide the smiley face, add this to your stylesheet:

img#wpstats{width:0px;height:0px;overflow:hidden}

Why do my links point to WordPress.com?

All stats reports are rendered by our servers. We designed it this way for a lot of reasons. It’s faster this way because your server doesn’t have to connect to our server every time you look at your stats. It’s also better because we can update the reporting UI without forcing you to upgrade your plugin.

How much traffic can you handle?

The stats hardware is currently handling millions of views every day and we’re nowhere near capacity. We built this system with growth in mind. The software is ready to run on as many servers as we allocate for the purpose. Growing pains are inevitable but if we’ve done our job, you will never feel them.

Can I install this on my non-WordPress sites?

The short answer is that the system only supports WordPress blogs.

The long answer is that anyone with a thorough understanding of WordPress and XMLRPC could clone the plugin to work with other blogging platforms. I can’t prevent it, I won’t discourage it, I do expect it, and I don’t even mind it. There are pitfalls, however, and I do not plan to document the requirements. Here be dragons.

Anyone found abusing the system, causing undue loads on the servers, or inflicting headaches on me or Barry or anyone else, will be subject to having their API Key revoked and their name written in giant, fiery letters across the night sky to be cursed by all who see it. Please don’t abuse this free service.

Why does the date change before/after midnight?

To keep things fast and consistent, we are ignoring time zones and keeping all stats in UTC.

SXSW Begins

I’m at SXSW with Matt and Barry, waiting for Glenda’s panel “How to Rawk SXSW” to begin. We’re working on the site whenever we get a chance but the real life is pretty interesting down here. So cut us a little slack. 😉

Sandbox 0.6.1 Live

We just deployed Sandbox 0.6.1 as a zip file and here on WordPress.com. These minor markup changes may affect your blog’s appearance:

  • There are now previous/next navigation links above the content as well as below. These can be selected with #nav-above and #nav-below.
  • The dl/dt/dd at the top of the archives have been simplified. No more crazy default styles to contend with.
  • The entry-meta separators now have a “metasep” class. This makes Asides much easier to do!
  • Whitespace has been removed from #globalnav to accommodate a problem in Internet Explorer.
  • The pesky abbr tag is now wrapped in a div with an “entry-date” class.

In reviewing category.php, Scott noted that it didn’t make sense for each post to have links to the current category archive. With some custom code, we now display “Also posted in…” if there are any other categories to show. See it on my WordPress category archive page.

One more change comes to mind: I removed a whole lot of code almost every Sandbox template file. I’ll tell you about it in another post.

If you have published a Sandbox stylesheet, thank you and please update it as soon as possible!

Upcoming Sandbox markup changes

This is a small warning to all who have adopted Sandbox and begun to create your stylesheet:

The markup is going to change tonight. The changes will be small but noticable. We’re changing some minor classes, adding some classes, removing some presentational text and removing the definition lists. These changes will go live on WordPress.com in a few hours. I’ll (probably) post more details and a link to a SVN repo and a download when it’s up.