John Fremlin's blog: teepeedee2 achieves 10k requests/second

Posted 2009-08-09 03:50:00 GMT

The humiliation of having teepeedee2 play second fiddle to C implementations was weighing heavily on my mind, so today I spent a few hours squeezing a bit more fat out of the HTTP processing.

One of the major motivating factors for making tpd2 was the idea from the C10k website that it should be possible to get much better performance out of a webserver than is currently normal. The 10k goal looked very far away at the beginning of the project, and many people said it was impossible from Lisp, which after all is a very dynamic language. Request throughput of a few
webservers (SVG image).

Yes, tpd2 has broken the 10k requests/s barrier on one core.

This is a big moment for me psychologically (and a testament to the excellent work done by the SBCL hackers on their Lisp implementation).

The significance is that teepeedee2 presents a new level of speed for dynamic websites. The processing of GET parameters, building up of dynamic HTML and so on take less than 0.1ms — on my laptop, probably even less on a modern server CPU. Additionally, because of its scalable timeouts and use of epoll, teepeedee2 can handle many AJAX polling clients extremely efficiently. This opens up a world of opportunity for interactive web applications that simply can't be implemented on traditional platforms.

The two competitive (but slower) web application frameworks — ULib and kloned are based on custom template languages with the possibility to embed arbitrary C++ code.

The biggest obstacle was that the automatic code transforms from cl-cont mean that simply using local functions (i.e. flets and labels) causes memory to be allocated at runtime (inefficient funcallable/cc objects are created). Therefore I fiddled with the HTTP parsing to do more inside a without-call/cc. The result was a huge match-bind for cl-irregsexp.

Given a program has a (correct) performance orientated design, it's generally not very useful to look at profiling data. except to locate performance bugs where the implementation does not meet the design (e.g. this issue with cl-cont), or to do micro-optimizations. I had mostly concentrated on getting a good architectural design for teepeedee2, and hadn't done much micro-optimization based on profiles till now. Based on the profile output from sb-profile, I inlined a few timeout related functions, a few miscellaneous functions and rewrote the IP address to string routine (these changes boosted about 10% or so).

The result is this

$ schedtool -a 0 -e ab -n 100000 -c10 http://localhost:3000/test?name=John
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd,
Licensed to The Apache Software Foundation,
Benchmarking localhost (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests
Server Software:        
Server Hostname:        localhost
Server Port:            3000
Document Path:          /test?name=John
Document Length:        19 bytes
Concurrency Level:      10
Time taken for tests:   8.839 seconds
Complete requests:      100000
Failed requests:        0
Write errors:           0
Total transferred:      5800000 bytes
HTML transferred:       1900000 bytes
Requests per second:    11313.29 [#/sec] (mean)
Time per request:       0.884 [ms] (mean)
Time per request:       0.088 [ms] (mean, across all concurrent requests)
Transfer rate:          640.79 [Kbytes/sec] received
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       8
Processing:     0    1   0.5      1      39
Waiting:        0    1   0.5      1      39
Total:          0    1   0.5      1      39
Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      2
 100%     39 (longest request)

I started tpd2 like this

schedtool -a 1 -e sbcl --load bench.lisp
where bench.lisp was
(handler-bind ((error (lambda(c) (declare (ignore c)) (invoke-restart 'CONTINUE))))
  (asdf:oos 'asdf:load-op 'teepeedee2))
(in-package #:tpd2.user)
(defsite *bench*)
(with-site (*bench*)
 (defpage "/test" (name) :create-frame nil
   (<h1 "Hello " name)))
(http-start-server 3000)

The hardware is my aging Panasonic Y7 laptop — an Intel(R) Core(TM)2 Duo CPU L7700 @ 1.80GHz, running Linux 2.6.31-5-generic #24-Ubuntu, and SBCL

The (now) fastest is an awesome framework called ULib by Stefano Casazza. It is in C++, uses select for portability(!) to MS Windows, and of course compiles dynamic pages to machine code. It once scored 11169.22/s, which is just a smidgeon less that teepeedee2, but normally scores much less (about 9k/s) — teepeedee2 is the fastest in my book. However, I hope to be able to blog more about Ulib because it's quite interesting and maybe Stefano will be able to improve it to topple teepeedee2 from the top spot.

I guess this means mission complete for teepeedee2. The external APIs need to be designed and documented if anybody wants to use it, and I would be delighted to accept patches.

UPDATE 20090819 — Kloned and Ulib do not use limited scripting languages. They can embed arbitrary C++. (Thanks to Stefano Barbato.)

UPDATE 20091028 — Added nginx's perl mode.

UPDATE 20091231 — Note that ULib is now the winner — dammit! :-(

Since you have macros to do your own object system, have you tried changing the macros to use CLOS instead and measure the performance difference? I think this would be extremely interesting to know.

Posted 2009-08-14 08:08:04 GMT by Anonymous

I do use CLOS. The my-defun macro and my, etc. are just syntactic sugar. Many classes are defined with defstruct not defclass but this is still within CLOS. . . The low-level socket functions are CLOS generic functions for example.

Posted 2009-08-16 10:21:36 GMT by John Fremlin

c10k is about 10k concurrent connections, not about 10k requests per second.

Posted 2009-10-12 08:08:52 GMT by Anonymous

C10K is about 10k concurrent clients. I agree that it would be better to have a benchmark with more concurrent connections, as that would be closer to the real world where many clients use persistent connexions.

I hope that this would show tpd2 as faster than Ulib as everything is supposed to be unrelated to the number of connections, whereas Ulib uses select, which won't be. However, Stefano has bested me repeatedly and the only way to check is to actually run a benchmark

I am working on getting a better benchmark together as I don't like apachebench for this.

Posted 2010-02-13 20:34:10 GMT by John Fremlin

Post a comment