John Fremlin's blog: How to optimise Common Lisp under SBCL: profiling (draft)

Posted 2012-06-03 00:48:00 GMT

Profiling — Your program is working but the performance does not match your back-of-the-envelope calculations. Unfortunately, the performance of a Lisp program is not as easy to estimate from the source code you write as that of a program in a more machine-orientated language like C. Start with a good benchmark that represents the workload you want to improve.

SBCL comes with an instrumentation profiler (sb-profile), which even measures its overhead quite accurately. Use it to profile all the functions in the packages you're interested in.

In my opinion, the most useful thing the (instrumentation) profile tells you is the number of times a function is called. This helps you check your understanding of how your algorithm is applying to the data in your benchmark.

If you've no idea which packages you're interested in, then use the timer interval profiler (statistical profiler) called sb-sprof to give you a rough idea, then run the sb-profile instrumentation profiler on the functions you identify. In general, don't start working from the timer interval profile. The sampling interval is 10ms by default, so you may need to run your benchmarks for a few minutes before the results become statistically significant.

CL-USER> (sb-profile:profile "TPD2.LIB" "TPD2.IO" "TPD2.HTTP" "TPD2.WEBAPP" "TPD2.USER")
Then run the function you want to profile under (time ...) so that you can check everything adds up.
CL-USER> (time (tpd2:event-loop))
Feel free to interrupt the function when you like to (with C-c C-c in SLIME for example).
Evaluation took:
  36.055 seconds of real time
  18.970000 seconds of total run time (12.810000 user, 6.160000 system)
  [ Run times consist of 1.620 seconds GC time, and 17.350 seconds non-GC time. ]
  52.61% CPU
  3 forms interpreted
  45,963,318,375 processor cycles
  1,318,048,816 bytes consed

before it was aborted by a non-local transfer of control.

; Evaluation aborted. CL-USER< (sb-profile:report) measuring PROFILE overhead..done seconds | consed | calls | sec/call | name --------------------------------------------------------------- 2.062 | 303,620,368 | 100,000 | 0.000021 | TEEPEEDEE2.HTTP::DISPATCHER-RESPOND 1.936 | 132,400 | 100,000 | 0.000019 | TEEPEEDEE2.IO::SOCKET-WRITEV 1.370 | 190,832 | 100,000 | 0.000014 | TEEPEEDEE2.IO::SOCKET-CLOSE 0.452 | 8,242,144 | 200,000 | 0.000002 | TEEPEEDEE2.HTTP::URL-ENCODING-DECODE 0.346 | 75,072 | 100,087 | 0.000003 | TEEPEEDEE2.IO::SOCKET-READ 0.156 | 2,176 | 100,000 | 0.000002 | TEEPEEDEE2.IO::SET-FD-NONBLOCK 0.154 | 24,993,440 | 100,167 | 0.000002 | TEEPEEDEE2.IO::SOCKET-ACCEPT 0.136 | 83,715,536 | 100,000 | 0.000001 | TEEPEEDEE2.IO::MAKE-RECVBUF 0.136 | 44,864 | 100,000 | 0.000001 | TEEPEEDEE2.IO::SOCKET-SUPPORTS-WRITEV 0.130 | 23,440 | 100,000 | 0.000001 | TEEPEEDEE2.HTTP::FIND-DISPATCHER 0.126 | 192 | 100,000 | 0.000001 | TEEPEEDEE2.IO::DEREGISTER-FD 0.126 | 2,678,960 | 6,325 | 0.000020 | TEEPEEDEE2.IO::SYSCALL-RETRY-EPOLL_WAIT 0.114 | 77,051,136 | 100,000 | 0.000001 | TEEPEEDEE2.IO::CON-INIT 0.100 | 16,466,608 | 100,000 | 0.000001 | TEEPEEDEE2.IO::%MAKE-CON 0.036 | 4,176 | 100,000 | 0.0000004 | TEEPEEDEE2.LIB::TIMEOUT-POSITION 0.022 | 16 | 106,402 | 0.0000002 | TEEPEEDEE2.IO::CON-SET-CALLBACK 0.016 | 0 | 100,000 | 0.0000002 | TEEPEEDEE2.IO:CON-DEFAULT-TIMEOUT-FUNCTION 0.016 | 0 | 100,000 | 0.0000002 | TEEPEEDEE2.IO::RECVBUF-PREPARE-READ 0.016 | 5,104 | 100,002 | 0.0000002 | TEEPEEDEE2.IO:CON-PEER-INFO 0.010 | 65,456 | 100,000 | 0.0000001 | TEEPEEDEE2.LIB::TIMEOUT-MERGE 0.010 | 0 | 21 | 0.000476 | TEEPEEDEE2.LIB:NEXT-TIMEOUT 0.009 | 0 | 6,402 | 0.000001 | TEEPEEDEE2.IO::REGISTER-FD 0.006 | 0 | 6,402 | 0.000001 | TEEPEEDEE2.IO::EPOLL-CTL 0.004 | 62,112 | 100,000 | 0.0000000 | TEEPEEDEE2.IO:HANGUP 0.004 | 5,920 | 400,000 | 0.0000000 | TEEPEEDEE2.LIB:BYTE-VECTOR=-FOLD-ASCII-CASE 0.002 | 0 | 6,402 | 0.0000004 | TEEPEEDEE2.IO::SOCKET-REGISTER 0.000 | 23,248 | 100,000 | 0.000000 | TEEPEEDEE2.LIB:MAKE-TIMEOUT ... 7.495 | 1,318,044,720 | 3,570,437 | | Total

estimated total profiling overhead: 6.07 seconds overhead estimation parameters: 2.e-8s/call, 1.7e-6s total profiling, 6.4e-7s internal profiling

First check is that the numbers of calls to the hotspot functions are as you expect. You can probably figure out the expected number of calls quite easily by thinking about your algorithm. If they don't match your expectations then you are probably doing the same work multiple times. And that's easy to avoid.

Now, optimise the hotspot functions. When you have given up on improving the logic, declare them inline, try to make local functions and variables dynamic extent, and follow the optimisation notes emitted by SBCL. Use the disassemble function to check for unnecessary boxing. After each change, check that you actually made a difference, and if not revert it.

Generally, compiling with (proclaim '(optimize speed)) won't make much difference (maybe a couple of percent), other than to spew out the handy optimisation notes.

One function sticks out in this profile as being quite inefficient given the small amount of work it needs to do: TEEPEEDEE2.HTTP::URL-ENCODING-DECODE. In fact, for this benchmark it needn't do any work, as there are no + or % characters in the URL passed to tpd2 from apachebench. However, it unnecessarily allocates some space for storing the replacement at the start, so I modified cl-irregsexp not to do that.

The TEEPEEDEE2.HTTP::URL-ENCODING-DECODE has gone. Now the profile looks like this

  seconds  |     consed    |   calls   |  sec/call  |  name  
---------------------------------------------------------------
     1.762 |   202,819,584 |   100,000 |   0.000018 | TEEPEEDEE2.HTTP::DISPATCHER-RESPOND
     1.606 |        96,000 |   100,000 |   0.000016 | TEEPEEDEE2.IO::SOCKET-WRITEV
     1.150 |    12,653,920 |   100,000 |   0.000012 | TEEPEEDEE2.IO::SOCKET-CLOSE
     0.574 |   204,096,352 |   100,000 |   0.000006 | TEEPEEDEE2.HTTP::HTTP-SERVE-PROCESS-HEADERS
     0.484 |    63,509,248 |   100,088 |   0.000005 | TEEPEEDEE2.IO::SOCKET-ACCEPT
     0.376 |        25,904 |   100,212 |   0.000004 | TEEPEEDEE2.IO::SOCKET-READ
     0.132 |        10,000 |   200,000 |   0.000001 | TEEPEEDEE2.HTTP::URL-ENCODING-DECODE
...
before the function TEEPEEDEE2.HTTP::HTTP-SERVE-PROCESS-HEADERS was nowhere on the profile and consing half as much.
     0.000 |   119,114,816 |   100,000 |   0.000000 | TEEPEEDEE2.HTTP::HTTP-SERVE-PROCESS-HEADERS

Perhaps the extra time is due to the garbage collector? I patched the profile instrumentation to show the garbage collection time spent in each function.

However, that does not explain the problem, which disappeared.

TPD2> (sb-profile:report)
  seconds  |     gc     |     consed    |   calls   |  sec/call  |  name  
---------------------------------------------------------------
     2.064 |      0.000 |        78,864 |   100,007 |   0.000021 | TEEPEEDEE2.IO::SOCKET-WRITEV
     1.954 |      0.130 |   395,228,336 |   100,007 |   0.000020 | TEEPEEDEE2.HTTP::DISPATCHER-RESPOND
     1.564 |      0.000 |        52,272 |   100,007 |   0.000016 | TEEPEEDEE2.IO::SOCKET-CLOSE
     0.534 |      0.290 |   251,960,960 |   100,007 |   0.000005 | TEEPEEDEE2.IO::MAKE-RECVBUF
     0.292 |      0.000 |    30,088,016 |   102,959 |   0.000003 | TEEPEEDEE2.IO::SOCKET-ACCEPT
     0.274 |      0.000 |        13,184 |   100,013 |   0.000003 | TEEPEEDEE2.IO::SOCKET-READ
     0.155 |      0.000 |     3,377,408 |     7,369 |   0.000021 | TEEPEEDEE2.IO::SYSCALL-RETRY-EPOLL_WAIT
...
---------------------------------------------------------------
     7.642 |      0.900 | 1,573,041,632 | 3,584,339 |            | Total

Looking at the instrumentation code revealed that the time used is measured in milliseconds. As all the functions I am using are definitely less than 0.1ms (as the request rate handled is more than 10k/s), the times given by the profiler may not be very accurate, as they measure only the functions that run through a 1ms tick, which due to the periodicity of the work, may mean some functions unfairly catch many ticks and others avoid them altogether. The mystery of the varying consing is not explained however.

The profiler suggests that the dispatcher-respond (i.e. anonymous function handling the page itself) takes a lot of time. This is easy to test by running the anonymous function repeatedly under time, and brings up some surprises.

To be continued . . .

Post a comment