Links

pmuellr is Patrick Mueller

other pmuellr thangs: home page, twitter, flickr, github

Sunday, June 14, 2009

offline web application cache abuse

I've been a user of hand-held computers since 1997, when IBM bought me a Palm Pilot to use in a customer demo I was putting together. Since then, I've been through a number of Palm devices, and up till last week had been using a Nokia N800 device. This week, I caved, and bought myself an iPod Touch.

One of the main uses I've had for these devices is reading. On the Palm, I used the wonderful iSilo program to convert HTML to some binary format that the Palm reader rendered as good as it could - good enough - even on a 160x160 pixel display. On the N800, HTML content could just be copied to an SD disk, and then you could view it via the built-in web browser using file: URLs.

It turns out that I've been able to find plenty of material over the years, in HTML, that renders well enough on my hand-held devices. Either ready made, or built via scraping, or whatever.

what to use on an iPhone or iPod Touch

So, what are the options for reading on the iPhone/iPod Touch? Stanza is the only generic reader I've looked at so far, and it's not bad. It's particularly nice to have access to FeedBooks and Project Gutenberg books, which can be downloaded and then read later if you aren't online. Compared to Google Books which appears to only support reading while online. There's also Kindle, but I don't have any need to buy books, there's plenty of free content to be had. More importantly with Kindle is the 1995 level of HTML support, and that's a big problem for me; I like to read technical content with code samples, etc.

Beyond generic readers, there are also content-specific readers from content producers like the New York Times, the BBC, and AP News. The BBC one is especially strange. When displaying a full article, they seem to be displaying the same content that's on their web page. Various navigation and other links eating up space in the left and right side of the page. You can double-tap the middle content to get that to zoom, but why wouldn't they do that in the first place? If you actually happen to click a link on the page, even to another story on the BBC, it exits their app, and launches a browser on that page. Amazing, and not in a good sense.

Stanza seems like the most useful reader at present, at least for literature. I'm not really happy about the page-turning metaphor; I prefer to scroll pages vertically. I can probably learn to live with it.

But how can I get my own content on there, or more generally, rando HTML content on there?

EPUB

If you look at the formats of "documents" that Stanza supports, one particular format is EPUB. I've not been able to make my way through the morass that is the specifications surrounding this format, but if you're interested, here's how to get started: an EPUB file is a .zip file, expand with your favorite zip utility; you'll find a number of XML files inside, you can kinda get a clue for how everything fits together by browsing those files.

The "meat" of an EPUB document consists of XHTML, CSS, and image files. Hmmm. Sounds like the web. Could you take some existing web site and easily "EPUB" it?

HTML5 Offline Application Cache

Turns out, there's something in HTML5, and more importantly, available on iPhone and iPod Touch devices, that can take an existing web site and make it available even if you aren't connected to a network. There is a W3C Working Group Note for this available as"Offline Web Applications". Apple has documentation on their support of this in the document "HTML 5 Offline Application Cache". See also the WHAT-WG version under development.

The basic idea is to do the least possible work. You need to create a new "manifest" file which lists all the files which should be cached for offline usage. And you need to identify that file in the <html> element of your document. That's all! The browser will arrange to cache all the content listed in the manifest.

I decided to try out how well this works with Mark Pilgrim's "Dive Into Python" book. I downloaded a copy of the book as separate HTML files, ran find on the result to create the manifest, whacked the <html> elements with a multi-file search-and-replace in my text editor, set up the files on my machine's local server, browsed to it from my iPod Touch, worked like a champ.

It seems to take 20-30 seconds to get all the files downloaded, and during that time there is no indication of what's going on in the browser. I was tail'ing my server's access log to tell when I was done. Presumably some of those events specified in the WHAT-WG version of the spec will help out with issues like that. After the files got downloaded, I could turn the wifi off on my device, and continue to browse the content.

"Dive Into Python" seemed like a good test case; non-trivial markup, especially code snippets; a significant amount of content (3MBs fully expanded), and presumably not really designed to be read on a hand-held device. I'm satisfied with the results, though I think most people will find the font sizes too small. The font sizes are too small for me when displaying in portrait mode, so make sure you try landscape as well. Tough decision point there, make the fonts bigger and then you'll either get text wrapping or have to do side-to-side scrolling.

I've got the files available on one of my servers, http://diveintopython-cached.muellerware.org/, which I'll leave up until my hosting provider kills me - each access to the page from an enabled browser will download the whole book (if it's not already been downloaded). Presumably this might work on other WebKit-based browsers, like the one on an Android device, or the Palm Pre.

On the iPhone or iPod Touch, after displaying the initial page, add a bookmark by clicking the + button, and then the "Add to Home Screen" button. Remember to wait 30 seconds or so after the initial page load before doing this, to make sure all the content gets downloaded. This will add the book as a new "app" on the main application screen. I've also done the appropriate horkiness to get the cover of Mark's book to show up as the application icon.

questions

This experience raises a number of questions regarding the HTML5 Offline Application Cache support:

  • How much space does the browser set aside for this cache? I assume there's a fixed upper limit for the cache, but does it compete with all other browser caching? Or is there a per-site or per-url cache limit?

  • How can I be informed when my cache content is thrown out of the cache? Is the entire set of files listed in the manifest thrown out atomically?

  • Is there enough functionality in the existing and proposed APIs to make reliable use of this capability?

I'm left wondering if it wouldn't just be better off to provide a full-function JavaScript API to the cache in general, although clearly this could be done as well as this declarative approach. My fear is the declarative approach might take you 80% of the way there, but leave out critical capability in that missing 20%.

More experimentation required. Have at it.

Or maybe this is just simple abuse of the capability. Instead of caching web sites, you could certainly get more control by having all the book content stored in a client-side SQL database, and then build an HTML application to navigate through it. It's just a lot more work.