Wednesday, August 12, 2009

$weeks[-1]; # Home Stretch

We're into the last few days of the 2009 GSoC. Things have been moving very well for me and yesterday I was able to land the last major feature I wanted to implement in MojoX::UserAgent before the end of the program: opportunistic request pipelining. That is, the UserAgent (UA) will pipeline your HTTP requests when possible, based on three different settings:
  1. the maximum number of connections per host;
  2. the maximum number of consecutive pipelined requests;
  3. the strategy you want to use (see below).
Three strategies are available. The first (and current default) one is not to pipeline at all. So, if you allow, say, 2 connections per host, then one request at a time will be sent on each of those connections, and any other request spooled into the UA beyond that is simply kept waiting until there is an available connection, on a first come first served basis. The second strategy is what I call "horizontal" mode. In this mode, the UA prefers to spread things out over as many connections as possible before starting to pipeline. So if you allow 2 connections per host, and three consecutive pipelined requests, and then spool up three requests (to the same host) and run them, you'll wind up with one request going out by itself on one connection and one two-request pipeline on the other. In contrast, when using the last strategy, which I call "vertical", the UA prefers to pipeline as much as possible before spreading things out over many connections. So in this case, spooling and running three requests to the same host (also with maxconnections=2 and maxpipereqs=3) will result in all three requests being pipelined over a single connection. I hope that made sense...

In any case, basic session cookies are there and auto-following of redirects is also there. There are still plenty of rough edges and things that are simply not there yet (eg no proxy support, cookies don't expire, and so on). But in the last few days I rather hope to write up some documentation and code up a single upstream change in Mojo that will have a positive impact on pipelining performance. Once that's done, I'll call it a day for GSoC (and publish some reflexions here). But I do plan to continue working (voluntarily and on an ongoing basis) on MojoX::UserAgent and distribute releases through CPAN. In fact, my PAUSE id request came through yesterday. And I have a few other ideas that may just turn into projects...

Monday, August 3, 2009

We now resume our regularly scheduled programming

1, 2, 1, 2 - is this thing on?

OK, it's about time I updated this blog. For a couple of weeks after the midterm checkpoint, some external events interfered significantly with my progress. But am now back to forging ahead, so to speak. I'll be adding one week to my schedule to compensate for the time lost - my original schedule went to the "suggested pencils down" date of August 10th, but I'll keep going full time until August 17th, the "firm" pencils down date. My main goal will be to get MojoX::UserAgent to a usable state. I would like it to be fully asynchronous (using the LWPng model as inspiration), with support for persistent connections, automatically following redirections, basic (ie session) cookies and (hopefully) opportunistic request pipelining.

I've already created the source repository, and in the last week have gotten to a point where it makes simple requests, invokes an asynchronous callback upon completion and automatically follows redirects. I have basic, in-memory cookie storage implemented, and am now working on per-request cookie retrieval. And I have discovered the "wondeful" world of cookie "standards" - or lack thereof. Multiple specifications, ambiguities and de-facto standard behavior established by dominant user-agents and even by quirks of major banking sites (example 1, example 2 (note that these read almost like entertaining short stories, at least for me (YMMV))). It even looks like the IETF's http-state working group is being resurected. Ah well, here I go...

Saturday, July 11, 2009

Midterm Reflexions: Just About On Track

They say time flies when you're having fun... I would tend to agree! I'm about to fill out the Google midterm survey, and thought I would post some midterm reflexions here. First I guess I should mention recent accomplishments, since I didn't post last weekend.

Recent Events
First up, I ran into a blog post by Mark Stosberg that discussed some issues with cookie handling in Mojo. I contacted Mark via email, and he was kind enough to provide me with further details. Testing revealed the issue was still present. I'm not nearly as familiar with the various cookie specs, so some reading was required, but in the end I determined that the problem wasn't specific to cookies, but rather to HTTP headers in general. Of course, nowadays, one hopes that all Web developers know enough to scrub user-provided data before sticking it into things like cookies, to avoid unpleasantness. However, the (simple) fix I committed puts some armor in Mojo's boots, so to speak, and will thus make it harder for Web developers to shoot themselves in the foot.

Next up was a dark corner of the spec that threw me for a loop... It again has to do with 100-continue. I should note that I think Mojo's already fairly ahead of the game here. Of the browsers, my reading leads me to believe only Opera supports 100-continue as of yet. And, for example, Mojo's client code already implements behavior coming in JDK7. That being said, the spec allows for a problematic situation for servers. Clients are told not to wait forever to be told to continue by the server. Therefore, if a server tells a client not to continue with a request, and receives more data on the connection, it could be a new request, or it could be that the client didn't wait long enough and started sending the body of the previous request before it received the message from the server not to do this. How do you tell which is which? In most cases, it should be obvious, but unfortunately it's easy to think of a case where it's impossible for the server to make the distinction. After some reflection and some discussion on IRC, I decided to implement the "most prudent" behavior - the server should close the connection after sending the response to a declined request for a 100-continue.

But implementing this wasn't trivial, and it took me a while to figure out the most elegant way to implement the desired behavior. Which leads me into some more general observations.

On debugging & patching
One thing that's particular about this project so far is that it's been mostly patches & fixes to the Mojo codebase. I've spent a lot more time doing this than I expected at the start. I don't mind - in fact, I quite enjoy this kind of work. But that means the linecount metric is really low. It's much better to spend a bit of extra time and implement a fix in a way that really agrees with the overall architecture of the software, and often that means you can really reduce the linecount of the fix. And if you didn't write the software in the first place, you need to make extra-sure you really understand what's going on. More time, less code, yet better? I think so, because I feel an elegant, small fix is less likely to introduce new problems than a more complex, bolted-on one. The goal is good (software) architecture.

On The Tools
You need a good base to start with though - it's difficult to renovate a building that is on the verge of falling down because as soon as you start to work on something, everything falls apart. If I may toot the team horn, I have found that Mojo is well-engineered software. It was theoretically possible that I might find a problem or an issue that would require significant refactoring. That has simply not been the case so far.

I'm fairly impressed with Git and Github. I'm still a bit nervous when using rebase, but I can see how it makes things much easier for project maintainers when contributors ask them to integrate their code. However, rebasing makes it harder for me to track my own work using the network graph, as it squashes the revision history and may even change the reference point at which a branch was created. So neither you nor I can easily see how long I worked on a branch or how many commits went into it.

Perl? Well, Perl5 is Perl5 and it's been like meeting with an old friend you haven't seen in a long time - mostly a fun experience. You know there are a few character flaws, but you've learned to live with them and can re-adjust when they pop-up. I must say that I do wish Perl6 was here by now. Rakudo is moving fast so maybe/hopefully by the end of the year. There seems to be some controversy in the community as to Perl5's release schedule and support policy... It's probably not appropriate at this point for me to take sides on that.

On The Community
The people I've been interacting with on a daily basis - both from Mojo and The Perl Foundation, have been nothing but helpful and present. Sebastian Riedel, in particular, has been fantastic, answering many questions, discussing spec, perl and design issues, and constructively criticizing my submitted patches.

On My Progress
Looking at my proposal's schedule, I'd say I'm just about on track. There is one significant discrepancy. Since I've spent a lot more time hacking Mojo than I thought I would, I've actually skipped one of my deliverables - the "blackbox test suite". My mentor has told me he doesn't feel that that's a big deal and Sebastian also said I've fixed many more things in Mojo than he thought I would. So if you look at the beginning of my list of deliverable:
  1. Whitebox tests using Test::More [3] and integrating into Mojo’s current testing framework. Most tests will be concentrated on the Mojo::Message class and focus on HTTP/1.1 [4] content parsing, with emphasis on edge cases.
  2. A blackbox test suite to run against Mojo’s built-in server using an appropriate HTTP/1.1 client library (most likely libcurl [5]). Tests to include some cases of adversarial stance (e.g. deliberately malformed requests).
  3. If necessary, patches to Mojo that enable it to pass the test suites.
I would say that I've delivered #1, skipped #2, because #3 has been much more important than expected. Also, many of the things I fixed were better suited to either whitebox testing or testing using raw telnet and/or a specially designed "fake server" to generate on purpose events that would otherwise be statistically rare occurences on a real network with a real server.

So there you go. Next step: MojoX::UserAgent!

Friday, July 10, 2009

State Transitions in Mojo

Next up will be a "midterm" post, but I just want to post this first, because... well... because it was more fun on a Friday night to do this analysis. (Yes, I could also have gone out. But you know, I did that last night! Yes, really!) As you'll see, fun is obviously in the eye of the beholder, because this is unlikely to interest anyone but a very small handful of Mojo developers.

So a lot of the classes in Mojo are stateful. They inherit from the Stateful class. However, that class is pretty barebones, and "states" are fundamentally just a string property on an object. So there's no single place you can go to see the list of allowable states, and no documentation (let alone verification) of allowed state transitions. The first step in addressing this is to collect some data on the various stateful subclasses and see how they actually behave. So I instrumented the Stateful class and had it print out all state transitions. I then ran the Mojo test suite (including some non-default tests), which generated a little over 2100 state transitions. I fed these results to Graph::Easy (which I discovered thanks to a blog post on MojoX::Routes::AsGraph by Marcus Ramberg) in order to generate state diagrams. We may therefore consider this post as a kind of "documenting-on-the-run"...

Here are the results, in increasing order of complexity.

First up, Mojo::Stateful:

Simple enough. The number on the arrow indicates the number of times that particular state transition occurred during the testing. (That gives an indication of how much "exercise" the current test suite gives each state transition.) Next comes Mojo::Headers:

is still pretty simple:

And Mojo::Message::Request is barely more complicated:

At the same level of complexity, Mojo::Filter::Chunked:

One more step up are Mojo::Content:

and Mojo::Content::Multipart:

Finally, here comes the punchline: the two classes I've been working most heavily on/with (click for full size).


And Mojo::Pipeline:

There you go! Simple as pie, right?

Monday, June 29, 2009

Week 5: Learning About the Chainsaw

So last Monday as I was doing some testing for proper handling of POSTs requests (or, more technically, non-idempotent requests) in a pipeline, Sebastian Riedel told me over IRC that really the highest priority now should be proper handling of unexpected 1xx responses in the client code. So I started to look at that, and it seemed a bit messy. Most HTTP implementations, including Mojo's, use a state machine premised on the model that you write out a request, and then you get a response. But with 1xx informational responses, the spec asks clients for a lot. Clients should be able to handle an arbitrary number of informational responses before the real response comes along. So now each request may generate multiple responses, though only one of them is final. Worse, those responses may come at any time, even before you're done sending the request.

I started implementing this in small steps. I first looked at the existing code for 100-Continue responses, that is the special case where the client explicitely asks to receive an informational response before sending the main body of a request. Understanding this would likely allow me to implement the more general case in the least intrusive way possible with respect to the codebase's existing architecture. In the process, I found some small issues with that code, made it more resilient, committed my changes to a new branch and pushed that out to Github. I then moved to the easier case of receiving an unexpected informational response: when it doesn't interrupt the writing out of the request. That wasn't too hard, and again after committing, I pushed out those changes to Github. There followed a flash of inspiration on how to handle the case when the writing of a request is interrupted: where I expected to have to write a delicate piece of code enacting exceptions to the normal flow of states, I wound up with a single line change! I was pretty happy with that. Add a POD update, a typo fix, some test cases and a last code change as I realized thanks to the tests that the code wasn't handling multiple subsequent informational responses properly.

So by the time I sent a pull request, I was about six commits out on my branch. Sebastian didn't like that. Multiple commits make it harder for him to evaluate the global diff with his master branch. I used Git to generate the overall diff and nopasted it. Not enough: a merge would still import my commit history and unduly "pollute" the commit history of the master branch. So I tried to merge my changes out to a new branch that would only be one commit away from master. No go, since a merge takes the commit history with it.

So after some research, the answer was to learn how to squash commits with git-rebase. I must say, doing this felt a little like learning to sculpt with a chainsaw - it feels a little risky. Both because you "lose" your commit history and because you're messing with the revision control software in a way that seems a little dangerous. But Git is apparently fully up to the task. And as long as no one is already dependent on your branch, you can even push your changes out to Github by applying a little --force. Github will appropriately update it's revision history network graph. Impressive! Maybe I'll eventually learn to trust the chainsaw...

Anyhow, that solved the problem, and Sebastian says from now on he'll insist on squashed commits before pull requests. And since this post is already way too long, I'll just add that my other accomplishment for the week was adding a "safe post" option to change how POST requests are handled in a Pipeline object. I also had to rebase that, and it still felt pretty weird, but Git seems pretty solid...

Monday, June 22, 2009

Week 4: About last week

I was hoping to be able to update my blog this weekend, but alas, life had other plans... Anyway, last week my main achievement was to teach the client side of Mojo about (pipelined) HEAD requests. The changes required were fairly significant. This is explained by the fact that in the Mojo architecture, pipelined requests are handled by a Pipeline object; in turn a Pipeline has a bunch of Transactions, each transaction has a Request and a Response, which are in turn Messages, which use Content filters. And all of these objects are Stateful. Changes had to trickle all the way down, and so I also had to make sure I undertood things all the way down. I could see three ways to implement my change, and after choosing one, implementing it and sending a pull request, I was pretty happy that Sebastian agreed that the method I picked made the most sense. Integrating into an open source project is not just about writing code, it is also about understanding the sense of design of the main architect(s) - at least if you're interested in getting your changes merged into the mainline project. Perl philosophy may have TIMTOWTDI at its heart, but some ways to do it are more equal than others, and software developpers tend to be opinionated...

Now I'm thinking more and more about MojoX::UserAgent. But one thing I am worried about is that I am diverging somewhat from my intial GSoC plan. Not in the overall direction and the goal to write a UserAgent, but in some of the intermediate deliverables. This is something I will need to discuss with my mentor to see how it should be handled...

Sunday, June 14, 2009

Week 3: Over To The Client-Side

Week 3 has been a bit slow, I must say, as obligations outside of GSoC intruded upon my coding time. The level of concentration required makes it difficult to debug when one doesn't get long uninterrupted stretches of time to do it. This week, I first spent a bit of time chasing a problem that has the daemon tests hanging on Ubuntu - and probably other Debian-derived distros. I traced things down to the way dash does signal handling, but the problem is "exotic enough" - a non-default test on a particular platform (see here for something similar) - that I didn't commit a fix, and simply moved on to client-side testing.

I suspected that the client code in Mojo might exhibit some issues simillar to those I found on the server-side when dealing with pipelined requests. Testing the client-side requires a bit more setup than server-side. When testing the server, one can simply use telnet as a low-level client and have full control over how requests are sent to the server. Now I had to set-up a simple "fakeserver" to server responses precisely how I wanted them to be sent in order to try to trigger bugs in the client. I sound found a problem, and I also discovered that the test infrastructure had no clean way to build a test case to demonstrate the issue, so I have started building that. And that's where I'm at: down a couple of commits on a branch, with a few more to come before I think the changes should be merged back into mainline Mojo...