John Fremlin's blog: How to optimise Common Lisp under SBCL: planning

Posted 2009-12-21 11:43:00 GMT

Planning the architecture — Your typical software project fulfils many functions. Before starting to optimise, you have to decide which functions you wish to make faster. For example, in tpd2 you can define new URL (slowly) and request it over HTTP (quickly). Once you have decided on a set of performance requirements, you can consider whether it is possible to achieve it with your system. This means you need to figure out a realistic numeric value for the performance requirements. Saying fast as possible probably means that you have no way of quantifying the benefit of faster performance and so are wasting your energies pointlessly optimising something that doesn't need to go fast.

Hopefully, your existing design already meets your performance requirements. In that case, fantastic. You don't need anything more. Implement it knowing you needn't worry about performance.

For an example of a sensible requirement that can invalidate a design: you want to serve 10k requests/s, and it takes around 300us to start a new thread. Then you cannot have one new thread for each request. There is no point starting out with a design that is based around one thread for each request.

You have to estimate the minimum amount of actual work your approach will require. This requires common sense or experience. Is the bottleneck resource going to be network bandwidth? Disk seeks? Division operations on the CPU? JPEG image decodes? Is there enough RAM?

A very common source of unnecessary work is in changing from one data representation to another at module boundaries (for example, YUV to RGB image conversions). Often it can dominate the actual, useful work. When I was hacking on embedded video codecs (H.264 etc.), simply copying and converting took a large slice of processing. The same with Unicode conversion in early versions of tpd2.

If you are unsure about what is important and want to test your theories, then run a few mock-up experiments with made up data. The time macro is very helpful for this. For example,

(let ((s (make-string 1000000 :initial-element #\x)))
     (time (sb-ext:octets-to-string (sb-ext:string-to-octets s
     	    	       :external-format :utf-8) :external-format
Evaluation took:
  0.146 seconds of real time
  0.150000 seconds of total run time (0.150000 user, 0.000000 system)
  [ Run times consist of 0.040 seconds GC time, and 0.110 seconds non-GC time. ]
  102.74% CPU
  260,730,414 processor cycles
  21,773,968 bytes consed

So it takes approximately 260 cycles per character to convert from Unicode to UTF-8 and back (including GCing the resulting mess). Suppose each request involves 1kB of data; then we have to process 10MB/s of data. If we convert to UTF-8 and back with SBCL's slow string conversion, we will use 2.6 GHz just doing that. So we cannot afford to do it, and we can figure that out before barrelling ahead down an impossible path.

PS. If you want to run more automated experiments, you can get cycle counts with this macro (be careful to repeat tests until you are confident that the results are statistically significant).

(defmacro with-returning-cycles (&body body)
  (alexandria:with-unique-names (h0 l0 h1 l1)
   `(multiple-value-bind (,h0 ,l0) 
      (locally ,@body)
      (multiple-value-bind (,h1 ,l1)
	(sb-impl::elapsed-cycles ,h0 ,l0 ,h1 ,l1)))))

Post a comment