Wednesday, August 12, 2009

$weeks[-1]; # Home Stretch

We're into the last few days of the 2009 GSoC. Things have been moving very well for me and yesterday I was able to land the last major feature I wanted to implement in MojoX::UserAgent before the end of the program: opportunistic request pipelining. That is, the UserAgent (UA) will pipeline your HTTP requests when possible, based on three different settings:
  1. the maximum number of connections per host;
  2. the maximum number of consecutive pipelined requests;
  3. the strategy you want to use (see below).
Three strategies are available. The first (and current default) one is not to pipeline at all. So, if you allow, say, 2 connections per host, then one request at a time will be sent on each of those connections, and any other request spooled into the UA beyond that is simply kept waiting until there is an available connection, on a first come first served basis. The second strategy is what I call "horizontal" mode. In this mode, the UA prefers to spread things out over as many connections as possible before starting to pipeline. So if you allow 2 connections per host, and three consecutive pipelined requests, and then spool up three requests (to the same host) and run them, you'll wind up with one request going out by itself on one connection and one two-request pipeline on the other. In contrast, when using the last strategy, which I call "vertical", the UA prefers to pipeline as much as possible before spreading things out over many connections. So in this case, spooling and running three requests to the same host (also with maxconnections=2 and maxpipereqs=3) will result in all three requests being pipelined over a single connection. I hope that made sense...

In any case, basic session cookies are there and auto-following of redirects is also there. There are still plenty of rough edges and things that are simply not there yet (eg no proxy support, cookies don't expire, and so on). But in the last few days I rather hope to write up some documentation and code up a single upstream change in Mojo that will have a positive impact on pipelining performance. Once that's done, I'll call it a day for GSoC (and publish some reflexions here). But I do plan to continue working (voluntarily and on an ongoing basis) on MojoX::UserAgent and distribute releases through CPAN. In fact, my PAUSE id request came through yesterday. And I have a few other ideas that may just turn into projects...

Monday, August 3, 2009

We now resume our regularly scheduled programming

1, 2, 1, 2 - is this thing on?

OK, it's about time I updated this blog. For a couple of weeks after the midterm checkpoint, some external events interfered significantly with my progress. But am now back to forging ahead, so to speak. I'll be adding one week to my schedule to compensate for the time lost - my original schedule went to the "suggested pencils down" date of August 10th, but I'll keep going full time until August 17th, the "firm" pencils down date. My main goal will be to get MojoX::UserAgent to a usable state. I would like it to be fully asynchronous (using the LWPng model as inspiration), with support for persistent connections, automatically following redirections, basic (ie session) cookies and (hopefully) opportunistic request pipelining.

I've already created the source repository, and in the last week have gotten to a point where it makes simple requests, invokes an asynchronous callback upon completion and automatically follows redirects. I have basic, in-memory cookie storage implemented, and am now working on per-request cookie retrieval. And I have discovered the "wondeful" world of cookie "standards" - or lack thereof. Multiple specifications, ambiguities and de-facto standard behavior established by dominant user-agents and even by quirks of major banking sites (example 1, example 2 (note that these read almost like entertaining short stories, at least for me (YMMV))). It even looks like the IETF's http-state working group is being resurected. Ah well, here I go...

Saturday, July 11, 2009

Midterm Reflexions: Just About On Track

They say time flies when you're having fun... I would tend to agree! I'm about to fill out the Google midterm survey, and thought I would post some midterm reflexions here. First I guess I should mention recent accomplishments, since I didn't post last weekend.

Recent Events
First up, I ran into a blog post by Mark Stosberg that discussed some issues with cookie handling in Mojo. I contacted Mark via email, and he was kind enough to provide me with further details. Testing revealed the issue was still present. I'm not nearly as familiar with the various cookie specs, so some reading was required, but in the end I determined that the problem wasn't specific to cookies, but rather to HTTP headers in general. Of course, nowadays, one hopes that all Web developers know enough to scrub user-provided data before sticking it into things like cookies, to avoid unpleasantness. However, the (simple) fix I committed puts some armor in Mojo's boots, so to speak, and will thus make it harder for Web developers to shoot themselves in the foot.

Next up was a dark corner of the spec that threw me for a loop... It again has to do with 100-continue. I should note that I think Mojo's already fairly ahead of the game here. Of the browsers, my reading leads me to believe only Opera supports 100-continue as of yet. And, for example, Mojo's client code already implements behavior coming in JDK7. That being said, the spec allows for a problematic situation for servers. Clients are told not to wait forever to be told to continue by the server. Therefore, if a server tells a client not to continue with a request, and receives more data on the connection, it could be a new request, or it could be that the client didn't wait long enough and started sending the body of the previous request before it received the message from the server not to do this. How do you tell which is which? In most cases, it should be obvious, but unfortunately it's easy to think of a case where it's impossible for the server to make the distinction. After some reflection and some discussion on IRC, I decided to implement the "most prudent" behavior - the server should close the connection after sending the response to a declined request for a 100-continue.

But implementing this wasn't trivial, and it took me a while to figure out the most elegant way to implement the desired behavior. Which leads me into some more general observations.

On debugging & patching
One thing that's particular about this project so far is that it's been mostly patches & fixes to the Mojo codebase. I've spent a lot more time doing this than I expected at the start. I don't mind - in fact, I quite enjoy this kind of work. But that means the linecount metric is really low. It's much better to spend a bit of extra time and implement a fix in a way that really agrees with the overall architecture of the software, and often that means you can really reduce the linecount of the fix. And if you didn't write the software in the first place, you need to make extra-sure you really understand what's going on. More time, less code, yet better? I think so, because I feel an elegant, small fix is less likely to introduce new problems than a more complex, bolted-on one. The goal is good (software) architecture.

On The Tools
You need a good base to start with though - it's difficult to renovate a building that is on the verge of falling down because as soon as you start to work on something, everything falls apart. If I may toot the team horn, I have found that Mojo is well-engineered software. It was theoretically possible that I might find a problem or an issue that would require significant refactoring. That has simply not been the case so far.

I'm fairly impressed with Git and Github. I'm still a bit nervous when using rebase, but I can see how it makes things much easier for project maintainers when contributors ask them to integrate their code. However, rebasing makes it harder for me to track my own work using the network graph, as it squashes the revision history and may even change the reference point at which a branch was created. So neither you nor I can easily see how long I worked on a branch or how many commits went into it.

Perl? Well, Perl5 is Perl5 and it's been like meeting with an old friend you haven't seen in a long time - mostly a fun experience. You know there are a few character flaws, but you've learned to live with them and can re-adjust when they pop-up. I must say that I do wish Perl6 was here by now. Rakudo is moving fast so maybe/hopefully by the end of the year. There seems to be some controversy in the community as to Perl5's release schedule and support policy... It's probably not appropriate at this point for me to take sides on that.

On The Community
The people I've been interacting with on a daily basis - both from Mojo and The Perl Foundation, have been nothing but helpful and present. Sebastian Riedel, in particular, has been fantastic, answering many questions, discussing spec, perl and design issues, and constructively criticizing my submitted patches.

On My Progress
Looking at my proposal's schedule, I'd say I'm just about on track. There is one significant discrepancy. Since I've spent a lot more time hacking Mojo than I thought I would, I've actually skipped one of my deliverables - the "blackbox test suite". My mentor has told me he doesn't feel that that's a big deal and Sebastian also said I've fixed many more things in Mojo than he thought I would. So if you look at the beginning of my list of deliverable:
  1. Whitebox tests using Test::More [3] and integrating into Mojo’s current testing framework. Most tests will be concentrated on the Mojo::Message class and focus on HTTP/1.1 [4] content parsing, with emphasis on edge cases.
  2. A blackbox test suite to run against Mojo’s built-in server using an appropriate HTTP/1.1 client library (most likely libcurl [5]). Tests to include some cases of adversarial stance (e.g. deliberately malformed requests).
  3. If necessary, patches to Mojo that enable it to pass the test suites.
I would say that I've delivered #1, skipped #2, because #3 has been much more important than expected. Also, many of the things I fixed were better suited to either whitebox testing or testing using raw telnet and/or a specially designed "fake server" to generate on purpose events that would otherwise be statistically rare occurences on a real network with a real server.

So there you go. Next step: MojoX::UserAgent!

Friday, July 10, 2009

State Transitions in Mojo

Next up will be a "midterm" post, but I just want to post this first, because... well... because it was more fun on a Friday night to do this analysis. (Yes, I could also have gone out. But you know, I did that last night! Yes, really!) As you'll see, fun is obviously in the eye of the beholder, because this is unlikely to interest anyone but a very small handful of Mojo developers.

So a lot of the classes in Mojo are stateful. They inherit from the Stateful class. However, that class is pretty barebones, and "states" are fundamentally just a string property on an object. So there's no single place you can go to see the list of allowable states, and no documentation (let alone verification) of allowed state transitions. The first step in addressing this is to collect some data on the various stateful subclasses and see how they actually behave. So I instrumented the Stateful class and had it print out all state transitions. I then ran the Mojo test suite (including some non-default tests), which generated a little over 2100 state transitions. I fed these results to Graph::Easy (which I discovered thanks to a blog post on MojoX::Routes::AsGraph by Marcus Ramberg) in order to generate state diagrams. We may therefore consider this post as a kind of "documenting-on-the-run"...

Here are the results, in increasing order of complexity.

First up, Mojo::Stateful:


Simple enough. The number on the arrow indicates the number of times that particular state transition occurred during the testing. (That gives an indication of how much "exercise" the current test suite gives each state transition.) Next comes Mojo::Headers:

Mojo::Message::Response
is still pretty simple:

And Mojo::Message::Request is barely more complicated:

At the same level of complexity, Mojo::Filter::Chunked:

One more step up are Mojo::Content:


and Mojo::Content::Multipart:

Finally, here comes the punchline: the two classes I've been working most heavily on/with (click for full size).

Mojo::Transaction:


And Mojo::Pipeline:

There you go! Simple as pie, right?

Monday, June 29, 2009

Week 5: Learning About the Chainsaw

So last Monday as I was doing some testing for proper handling of POSTs requests (or, more technically, non-idempotent requests) in a pipeline, Sebastian Riedel told me over IRC that really the highest priority now should be proper handling of unexpected 1xx responses in the client code. So I started to look at that, and it seemed a bit messy. Most HTTP implementations, including Mojo's, use a state machine premised on the model that you write out a request, and then you get a response. But with 1xx informational responses, the spec asks clients for a lot. Clients should be able to handle an arbitrary number of informational responses before the real response comes along. So now each request may generate multiple responses, though only one of them is final. Worse, those responses may come at any time, even before you're done sending the request.

I started implementing this in small steps. I first looked at the existing code for 100-Continue responses, that is the special case where the client explicitely asks to receive an informational response before sending the main body of a request. Understanding this would likely allow me to implement the more general case in the least intrusive way possible with respect to the codebase's existing architecture. In the process, I found some small issues with that code, made it more resilient, committed my changes to a new branch and pushed that out to Github. I then moved to the easier case of receiving an unexpected informational response: when it doesn't interrupt the writing out of the request. That wasn't too hard, and again after committing, I pushed out those changes to Github. There followed a flash of inspiration on how to handle the case when the writing of a request is interrupted: where I expected to have to write a delicate piece of code enacting exceptions to the normal flow of states, I wound up with a single line change! I was pretty happy with that. Add a POD update, a typo fix, some test cases and a last code change as I realized thanks to the tests that the code wasn't handling multiple subsequent informational responses properly.

So by the time I sent a pull request, I was about six commits out on my branch. Sebastian didn't like that. Multiple commits make it harder for him to evaluate the global diff with his master branch. I used Git to generate the overall diff and nopasted it. Not enough: a merge would still import my commit history and unduly "pollute" the commit history of the master branch. So I tried to merge my changes out to a new branch that would only be one commit away from master. No go, since a merge takes the commit history with it.

So after some research, the answer was to learn how to squash commits with git-rebase. I must say, doing this felt a little like learning to sculpt with a chainsaw - it feels a little risky. Both because you "lose" your commit history and because you're messing with the revision control software in a way that seems a little dangerous. But Git is apparently fully up to the task. And as long as no one is already dependent on your branch, you can even push your changes out to Github by applying a little --force. Github will appropriately update it's revision history network graph. Impressive! Maybe I'll eventually learn to trust the chainsaw...

Anyhow, that solved the problem, and Sebastian says from now on he'll insist on squashed commits before pull requests. And since this post is already way too long, I'll just add that my other accomplishment for the week was adding a "safe post" option to change how POST requests are handled in a Pipeline object. I also had to rebase that, and it still felt pretty weird, but Git seems pretty solid...

Monday, June 22, 2009

Week 4: About last week

I was hoping to be able to update my blog this weekend, but alas, life had other plans... Anyway, last week my main achievement was to teach the client side of Mojo about (pipelined) HEAD requests. The changes required were fairly significant. This is explained by the fact that in the Mojo architecture, pipelined requests are handled by a Pipeline object; in turn a Pipeline has a bunch of Transactions, each transaction has a Request and a Response, which are in turn Messages, which use Content filters. And all of these objects are Stateful. Changes had to trickle all the way down, and so I also had to make sure I undertood things all the way down. I could see three ways to implement my change, and after choosing one, implementing it and sending a pull request, I was pretty happy that Sebastian agreed that the method I picked made the most sense. Integrating into an open source project is not just about writing code, it is also about understanding the sense of design of the main architect(s) - at least if you're interested in getting your changes merged into the mainline project. Perl philosophy may have TIMTOWTDI at its heart, but some ways to do it are more equal than others, and software developpers tend to be opinionated...

Now I'm thinking more and more about MojoX::UserAgent. But one thing I am worried about is that I am diverging somewhat from my intial GSoC plan. Not in the overall direction and the goal to write a UserAgent, but in some of the intermediate deliverables. This is something I will need to discuss with my mentor to see how it should be handled...

Sunday, June 14, 2009

Week 3: Over To The Client-Side

Week 3 has been a bit slow, I must say, as obligations outside of GSoC intruded upon my coding time. The level of concentration required makes it difficult to debug when one doesn't get long uninterrupted stretches of time to do it. This week, I first spent a bit of time chasing a problem that has the daemon tests hanging on Ubuntu - and probably other Debian-derived distros. I traced things down to the way dash does signal handling, but the problem is "exotic enough" - a non-default test on a particular platform (see here for something similar) - that I didn't commit a fix, and simply moved on to client-side testing.

I suspected that the client code in Mojo might exhibit some issues simillar to those I found on the server-side when dealing with pipelined requests. Testing the client-side requires a bit more setup than server-side. When testing the server, one can simply use telnet as a low-level client and have full control over how requests are sent to the server. Now I had to set-up a simple "fakeserver" to server responses precisely how I wanted them to be sent in order to try to trigger bugs in the client. I sound found a problem, and I also discovered that the test infrastructure had no clean way to build a test case to demonstrate the issue, so I have started building that. And that's where I'm at: down a couple of commits on a branch, with a few more to come before I think the changes should be merged back into mainline Mojo...

Friday, June 5, 2009

Week 2: Ghosts, Deeper Bugs, First Feature...

Another week gone by. This week I learned two important, though obvious, lessons. The first lesson is: if a test case fails, do not touch the test case. The second lesson is: if a test case fails, do not touch the test case. At the beginning of the week I thought there were still problems with requests using chunked transfer-encoding with trailing headers. I was able to create a failing test case and proceeded to launch another (lengthy!) bug-hunt expedition. But by the time I found a fix, I'd modified the test case ever so slightly - where I should have created other, new cases instead. When I reported my "fix" to Sebastian Riedel, the first thing he did was try my test case, and promptly tell me that it was passing on his latest master! He prefaced this news by saying "You're not going to like this..." Indeed. Had I spent hours chasing and fixing a non-existent bug? I spent some time thinking I'd gone mad, before realizing what I'd thought was a trivial change in the test-case actually prevented the bug from being triggered. Once I realized this, Sebastian found a fix that was much more elegant that what I'd proposed. Tough start to the week, but hey, that's how you learn, right? I hope so...

I was able to redeem myself by tracking down and fixing a very difficult transient bug having to do with requests expecting a 100-Continue followed by other pipelined requests. Isolating the problem took two full days, and Sebastian commented over IRC that it doesn't get much more complicated than this, protocol-wise. Anyway, so my tally for the week is:
  • always reach trailing_headers state for chunked messages
    (not my fix, but my catch);
  • problem with duplicate start line on 100-continued request with a
    following pipeline;
  • 5-second delay problem when pipelining two requests, the second
    of which includes an Expect 100-Continue;
  • allow applications to override the default continue handler.
The last of these is my first feature(tte). It has allowed me to start testing what happens client-side when a server doesn't reply with a 100 Continue to a client that requests one. This has in turn led me to start thinking about how I will implement Mojo::UserAgent... I've also discovered some issues with the deamon tests on my version of Ubuntu which I've not been able to hunt down just yet. All in all, a more intense week than the first...

Friday, May 29, 2009

Week 1: The Thrill Of The Hunt

We're already at the end of the first week - hard to believe! Well for what it's worth, I think it's been a good week. I've spent a lot less time writing test cases than I thought I would, and that's because I spent a lot more time actually chasing real bugs in the Mojo codebase.

Some people don't like to debug code. In fact, I'd venture that most coders wouldn't rank that activity too highly. But I actually really like it! And I think I'm pretty good at it. I enjoy spelunking through code trying to figure out what makes it tick and also trying to throw odd things at a piece of code to see how it reacts. This week, I committed the following fixes to Mojo:
  • problem with 5-second delay on multiple request pipeline;
  • support HEAD requests on server-side;
  • chunked request parsing needs to add a Content-Length header;
  • problem on chunked request with trailing headers.
Discussions about the HTTP spec with Sebastian Riedel over IRC also prompted him to make a few commits having to do with the Transfer-Encoding header and to perform some clean-up in Mojo::Transaction and Mojo::Pipeline. See my Github network graph for all the gory details if you want.

One thing I've found is that, of course, since I am debugging a codebase I am just getting familiar with, I am much slower than someone like Sebastian or my mentor
Viacheslav Tikhanovskii. What this implies is that when I find a bug, I'm better off not telling them about it over IRC right away, because then they pounce all over it and I don't get the learning experience that chasing it down would have given me... :) I'm still learning my way around git, github and the Perl debugger, but all in all things are coming along nicely. Anyway, I feel I'm already having a positive impact on the project, if I may say so myself. And I'm having fun!

Saturday, May 23, 2009

My Project: HTTP/1.1 Compliance Testing and User-Agent Development for the Mojo Web Framework

[Note: The following is a slightly abbreviated version of the proposal I submitted.]

Abstract


When building Web applications, developers expect the tools they use to be fully interoperable through clean protocol implementations. This project aims to ensure full HTTP/1.1 compliance in the Mojo Web Framework, through both whitebox and blackbox testing. Time permitting, Mojo’s client code will also be exercised through the development of a smart User-Agent similar to LWP’s.

Benefits to the Perl/Open Source Community


Ensuring protocol compliance through extensive testing will help speed the adoption of Mojo, the next generation Web framework, by boosting developer confidence in the code’s correctness and robustness. Frameworks like Mojo [1] and Catalyst [2] bring cutting edge Web development to Perl, and enrich the palette of tools that the Open Source movement offers to Web developers. TIMTOWTDI, and this project will strengthen the position of Mojo, Perl and the Open Source movement in the world of Web development.

Deliverables

  1. Whitebox tests using Test::More [3] and integrating into Mojo’s current testing framework. Most tests will be concentrated on the Mojo::Message class and focus on HTTP/1.1 [4] content parsing, with emphasis on edge cases.
  2. A blackbox test suite to run against Mojo’s built-in server using an appropriate HTTP/1.1 client library (most likely libcurl [5]). Tests to include some cases of adversarial stance (e.g. deliberately malformed requests).
  3. If necessary, patches to Mojo that enable it to pass the test suites.
  4. If time permits, a smart User-Agent class (MojoX::UserAgent) similar to LWP’s [6] and exercising all features of HTTP/1.1 supported by the Mojo::Client class.
  5. If time permits, a port of the test suite implemented in "2." (above) using the new MojoX::UserAgent.

Project Details


Description & Goals
When it comes to protocol implementations, the devil is in the details, and edge cases can trigger bugs that are very difficult to trace for end-users. Therefore, thorough test coverage is paramount to instill user confidence. Weighing in at 176 pages, RFC2616 (the HTTP/1.1 specification document) defines a fairly complex protocol. As a result, many implementations are limited to a small subset of the functionality defined in the specification. However, the Mojo Web development framework explicitly aims to provide a “[f]ull stack HTTP 1.1 client/server implementation” [7].

This project’s primary goal is thus to increase the test coverage of Mojo against the HTTP/1.1 specification, through both whitebox and blackbox testing. A secondary goal is to leverage the advanced protocol features implemented by Mojo to develop a compelling User-Agent class that may (hopefully) outperform other similar Perl solutions.

Preliminary Investigations

Preliminary investigations for this project were conducted in the week of March 23-30th. It was found that Mojo already implements a nice whitebox testing framework (using Test::More), but that the coverage of the test cases needs to be improved. I have already submitted a small patch and test case to implement an HTTP/1.1 "SHOULD" requirement. Preliminary blackbox testing has uncovered a major issue with request pipelining that has lead Sebastian Riedel to embark into a significant refactoring of Mojo’s HTTP handling core.

Risks & Risk Mitigation

This project is dependent on the refactoring of some of Mojo’s core code, which has been undertaken by Sebastian Riedel in order to implement the main feature found to be broken in preliminary investigations (see above). This refactoring both makes the need for thorough testing more important, but also introduces the slight risk that refactoring might still be ongoing when the project starts. Also, since it has been a long time since I worked as a software developer (see Bio section), it is difficult for me to reliably make scheduling estimates.

In order to mitigate these risks, I have planned a fairly flexible schedule, with a focus on the more pressing needs first (i.e. testing) with new and exciting – but less central – features coming second. It is my hope that I will continue working on MojoX::UserAgent long after the GSoC 2009 is over. Even if that part of the work were to get pushed back, the project as a whole should still make a significant contribution to Mojo’s robustness and correctness.

Project Schedule

  • Preparation Stage (before May 23rd)
    Refresh my memory on RFC2616 (HTTP/1.1 spec); start playing with Mojo, git & the Perl debugger; hang on Mojo’s IRC channel and read its mailing list.
  • May 23rd - June 6th (two weeks)
    Write whitebox tests against Mojo::Message for proper request and response parsing. Leverage test framework in message.t using Test::More.
  • June 6th
    Submit code changes through Github.
  • June 7th - June 27th (three weeks)
    Find and evaluate suitability of HTTP/1.1 client libraries (LWP, libcurl, ??) for writing a blackbox test suite exercising all protocol-level features of Mojo. Implement blackbox test suite.
  • June 28th - July 6th (9 days)
    Patch Mojo code so that it passes all tests.
  • July 6th
    Submit code changes through Github.
    Midterm evaluation checkpoint.
  • July 6th - 27th (3 weeks)
    Start implementing MojoX::UserAgent class. This class will, at least initially, only support the HTTP protocol, but will include such features as persistent connection management, transparent redirection handling, request pipelining and session cookie support.
  • July 27th - August 10th (2 weeks)
    Port the blackbox test suite to MojoX::UserAgent.
  • August 10th
    'Pencils down' date. Submit code into Github.

Bio

My story likely diverges significantly from the average GSoC applicant profile, but then TIMTOWTDI, right? After working in (closed source) software development for over eight years, I opted several years ago to reorient myself towards... anthropology! I have completed a master’s degree in that field, and embarked on a PhD, but I am now getting the urge to code again.

I have always been a FOSS advocate and a Perl fan, so TPF seems like a natural fit for me to put my skills where my values are, and start contributing to Open Source. I am especially well-suited to work on HTTP/1.1 testing for Mojo, since one of my most vividly remembered accomplishments from my previous life as a software developer involved upgrading a Web security application based on an HTTP proxy model to support HTTP/1.1.

References

[1] http://mojolicious.org/
[2] http://www.catalystframework.org/
[3] http://search.cpan.org/~mschwern/Test-Simple/lib/Test/More.pm
[4] http://www.ietf.org/rfc/rfc2616.txt
[5] http://curl.haxx.se/
[6] http://search.cpan.org/~gaas/libwww-perl-5.825/lib/LWP.pm
[7] http://mojolicious.org/