Posted 2013-05-06 07:50:42 GMT
1 watching live
In 2001, before I started university, I interned at a company making radio controlled heating valves: why not use code review I asked? Palpably, the quality of technical decisions in open source software like the Linux kernel was much better for discussion around direction — sometimes descending into frankly ad hominem insults but resulting at least in some degree of consideration of alternatives. On the other hand, who wants a layer of bureaucracy? And so we opted not to.
Since coming to Facebook, where code reviews are strongly encouraged and almost enforced, I've done more review than coding — about three to one — which is personally a little frustrating as writing code is more fun. But one reason I do so many reviews is that it is not always easy to get changes in: there are large swathes of the code-base, lying unmaintained, where proposed changes can go unreviewed forever and finding someone who is able to spend the time to consider the ramifications of a modification is often tricky.
What are the duties of a reviewer? There is a school of thought which suggests that these to not extend to verifying the software for correctness. I would disagree — with the exception that if the description of how the change is tested is an outright fabrication, then the reviewer is responsible for independently assessing the correctness of both the implementation and the assumptions underlying it, including a duty to insist on a proper plan for empirically observing the behaviour of the program. Beyond that, the reviewer should consider the consequences in terms of the wider ecosystem of the change (does it increase load on another system or impose technical debt in terms of fragility to subsequent changes), and should consider alternative approaches. The issue of coding style, especially superficial formatting, should not be the main focus of discussion.
The duties of the coder, the reviewee, comprise foremost a duty to ensure a proper review, which means submitting comprehensible (and therefore small) patches to a reviewer who is capable of understanding their consequences — and sometimes this means insisting on additional consideration of some subtlety that the author may have missed.
The question of how strongly opinions should be expressed in the discussion of a patch is largely a personal preference and in some open source communities vitriolic and scathing remarks are not uncommon (Linus Torvalds being infamous for this). My personal opinion is that the delivery of the message is less important than the content, and the reasoning behind it, which should be made clear. And if the reviewer expresses concerns, the onus is on the reviewee, as supplicant, to placate those, or alternatively to find another more convivial reviewer, rather than to try to bully a change through the process. However, civility and a lighthearted sense of humour are most pleasant to work with!
Sadly, in moments of highest pressure the review process is most circumvented: when the change is very large or even beyond a few hundred lines it is most time-consuming to review, so it becomes tempting to skip the process: but this is exactly when consideration of alternatives can have the greatest benefit. Similarly, when there is a very proximate deadline of some sort it is tempting to short-circuit the review, but exactly then are bugs and wrong decisions most damaging, as there is by definition little time to observe and correct them. Reviews here are most essential and I feel that an additional process requirement of a third pair of eyes might actually be beneficial.
At the end of the day, it's almost certainly easier and quicker to rewrite some code than debug it years later. Good code review means better code, better mutual understanding, better systems and therefore better morale.
Posted 2013-05-05 23:00:00 GMT
1 watching live
The C++ library and toolset liblinear is awesome for sparse large (20M row+) logistic regression — using past data to predict probabilities of an occurrence.
Unfortunately, it has a few gotchas that can catch you out though when using the train and predict functionality.
— interacting features must be done before passing to the package, and text feature labels have to be turned into packed feature indices.
— features indices are labeled starting from 1 not 0 (the first feature has index 1). If using the C++ interface, to indicate the end of features for a row use a feature_node with index = -1.
— only solver mode 0 (L2 regularisation) and solver mode 6 (L2 regularisation) are for logistic regression, the others are for SVM.
— to benefit from regularisation, scale features appropriately (e.g. divide by standard deviation) or else features that have a wide range of values will be penalised.
— the C parameter controlling the degree of regularisation decreases regularization the larger it becomes. To get more regularization make it smaller (e.g. 0.001). To get sparse feature selection, use solver 6 (L1 regularisation penalty) with small C.
This is a great package. Thanks to Dean for much advice, and many thanks to the authors of it at the Machine Learning and Data Mining Group at NTU in Taiwan!
Post a comment
Posted 2013-01-21 22:30:44 GMT
1 watching live
On the way to yesterday's Bay Area Lisp meet-up, which was fascinating and had great talks by many speakers and a very generous giveaway of memorabilia by Paul McJones, I made a little game of balls bouncing around — when they collide the ball with the greatest mojo wins. Thanks to a couple of suggestions from Ron it turned into something quite fun.
The collisions of the balls were calculated by stepping the motion for one frame and then checking for overlap; clearly this is manifestly unfair in a plethora of circumstances. On the way back to San Francisco, I set about improving the game by solving the quadratic equation for the moment of collision of two balls moving with constant velocities (clearly it is in general quadratic as it can be formulated as a polynomial in time and there are two solution: first intersection when the balls start overlapping and final intersection when they eventually pass through each other).
I wished to return the solutions from the following expression
(solve-for time
(- (^2 (+ (ball-r a) (ball-r b)))
(+ (^2 (+ (- (ball-x a) (ball-x b)) (* time (- (ball-vx a) (ball-vx b)))))
(^2 (+ (- (ball-y a) (ball-y b)) (* time (- (ball-vy a) (ball-vy b)))))))
Ric gave a talk at the meetup about the importance of choosing the correct point of abstraction for a software project. In this case I decided not to rearrange the equation by hand; I simply wrote out an equation for the distance between the edges of the balls, employing an abstraction in the form of the unwritten solve-for macro. To solve the quadratic one could write a numeric function like Newton-Raphson and that would be one potential implementation of the solve-for macro, but there is an elementary analytic technique called completing the squares which is preferable in the case of a quadratic. I wished to implement the rearrangement of terms automatically as it is error prone and the result is difficult to interpret or modify.
To that end I implemented the analytic solve-for on the train home. And most of the time was spent in trying to debug the essentially correct approach. My initial assessment was that the main task would be the rearrangement of forms to collect the coefficients of the powers of time in the expanded equation. This in the end went well; it is easy to test step by step after all. Where I stumbled again and again was in the reliable discovery of solutions once the coefficients had been determined.
How can this be? The formula is just (-b ± √ (b2 - 4ac))/2a for the solutions to the quadratic ax2 + bx + c = 0. The case where a is zero is handled separately; and I did it separately — when the coefficient a could be statically determined to be always 0. When it became zero because the balls were moving at the same velocity, the solver would crash.
The bug which confused me the most was that this code
(case (signnum ,b2-4ac) (-1 ...) (0 ...) (1 ...))
dealing with the cases of the balls touching but never overlapping, overlapping then parting, and never touching, did not function as I expected despite my testcases. It transpires to my surprise that signnum is defined to return, not a member of the set of fixnums -1, 0, 1 but these numbers in the same type as the original input which causes this case statement to fall through when passed a float. As most CPUs provide enough information on comparison to distinguish these three eventualities in a single instruction it is rather sad not to be able to exploit it idiomatically. To discover this bug I wrote an alternative to the solve-for that used a simple bisection searching for a change in sign to triangulate the source of my confusion.
Finally, having discovered all these cases, I consider that I should have abstracted out the coefficient solver; a function that takes coefficients and returns the solutions, rather than implementing it inline in the solve-for macro, which should not have expanded focused on the task of doing the rearrangement, which it achieved very successfully.
The mistake I made was in not understanding deeply enough the correct prototype or function signature for the coefficient solver function: after reflection and discussion with my friend Richard Smith, I believe it should take in as input the symbolic representation of the equation with all the constants resolved to the velocities of the balls in question, and then as output return an object which can respond to queries for the next solution at a time greater than or equal to a given time. This would allow one to handle tougher functions that may oscillate very rapidly and cross zero an infinite number of times in a finite interval.
Post a comment
Posted 2013-01-09 07:25:57 GMT
1 watching live
Media center computers are ideally without keyboard. However sometimes they need a software patch. Servers need neither input devices nor screen. How to run commands on them if they lose their Internet connexion? And how to do so securely?
Here I present vii-secure-autorun, a system for running commands from removable media like USB drives and DVDs, with the guarantee that only code from trusted sources can affect the machine. With these udev rules it will attempt to mount and check the signature on any ext2 filesystem labeled vii-secure-auto
ACTION=="add", ENV{ID_FS_LABEL}=="vii-secure-auto", ENV{ID_FS_TYPE}=="ext2", ENV{UDISKS_PRESENTATION_HIDE}:="1", RUN+="/etc/vii-secure-autorun/vii-secure-autorun signed-execute-dev $env{DEVNAME}"
Of course, by simply automounting the removable filesystem it may be possible to exploit bugs in the filesystem drivers and so on, so caveat emptor.
vii-secure-autorun signed-execute-dev /dev/sda1 — mount the device and execute the code on it, umount it, etc.
gpg --export | vii-secure-autorun import-keys — add keys to the trusted keychain
vii-secure-autorun package-sign directory — make a tarball of the files in the directory and sign it; the file that will be executed on unpacking is vii-secure-autorun-exec
Hope it's useful, it is to me!
Post a comment
Posted 2013-01-03 15:28:01 GMT
1 watching live
Proud to be included in Vsevolod Dyomkin's lisp hacker interview series.
Post a comment
Posted 2012-10-07 05:03:47 GMT
1 watching live
Here's a presentation on manardb (slides) at the Bay Area Lisp and Scheme Users Group.
The group had some great suggestions for reducing the cost of the pointer rewriting — one of the best suggestions is maintaining a second list for each object type for the gaps so that they can be filled in again. This would need a simple datastructure: maybe an array of locations with a counter at the head. When freeing an object try to write it at the position below the counter and decrement the counter, or if the counter is at zero add it to the end. When allocating try to use the location at the counter and increment the counter.
Thanks to Nick Allen for organising and to everybody who attended!
Hey John,
You are a hero --- keeping the CL flame burning. I miss our beer sessions,
Greg
Posted 2012-10-06 22:34:50 GMT by
Post a comment
Posted 2012-09-04 00:55:18 GMT
1 watching live
Access to my single user account on my personal Ubuntu computers is effectively a passport to root access because I type in my password to sudo in an X terminal and it's pretty easy to sniff all X keystrokes. Therefore, the extra security afforded by having to type a password is minimal. Why not acknowledge this fact and avoid having to type in a password for sudo? Why not indeed?
To set up passwordless sudo perform the following commands
echo %sudo ALL=NOPASSWD: ALL | sudo tee /etc/sudoers.d/sudo-group sudo chmod 0440 /etc/sudoers.d/sudo-group sudo adduser $USER sudo
Post a comment
Posted 2012-08-18 21:37:14 GMT
1 watching live
What's hot for aspiring dotcom billionaires? Building apps for phones. And justifiably — there's a huge opportunity to quickly ship something to improve people's lives. But there's a diversity of mobile platforms. There are a few companies trying to make cross-platform development frameworks but generally the decision is between focusing on Apple iOS or Android or making a browser based HTML5 app — probably packaged as an app with a webview browser wedged inside it.
The HTML idea is initially attractive and seems to make sense from a business standpoint, because it promises to reduce the effort needed to cover a large market. But this is an illusion. Why?
True, cobbling together something in HTML5 may be possible. But the user experience will inevitably be bad:
— offline access is a really important part of the mobile user experience. Devices go offline all the time: in buildings, in transit, in reception blackspots: in real life an always on connexion is not really available yet. The reason people have an app not a website is that they want to access it on the move and that means that ideally it should provide a decent user experience offline, as much as realistically possible.
— integration with other apps on the platform, user interface conventions, and access to device specific APIs like camera, accelerometer, notifications etc. is much more painful and even impossible.
— performance and responsiveness will be worse. From network access to user interface widgets, native implementations have access to alternatives that are simply unavailable in HTML5.
Now these things are well-understood and very obvious a priori. But they're not the real killer: the real killer is that a webview HTML5 solution is incredibly expensive to polish and make smooth. How do you control the loading experience when to control the loading experience you inevitably need to load a bunch of JavaScript? And then the bugs, the bugs: from SSL not working well on old Android 2 webviews, to crashing on some API calls, to handling loss of connectivity to server-side error responses, to handling running out of memory, it's vastly difficult to go from a demo to a polished production ready system. Essentially, it's a technical trap: doing easy things is easy but getting the details right is impossible sometimes and very tricky and tiresome generally.
In the end, HTML5 doesn't save development time: it increases it and the extra time is spent fiddling with bugs in broken browser implementations. In the end, it's quicker to ship one app per platform, a better user experience, better for morale, and better for finding market fit.
HTML5 caching?
Posted 2012-08-19 18:02:02 GMT by
What if your plan is to support say 10 different platforms. Write and maintain 10 different apps?
Posted 2012-08-20 10:38:12 GMT by
- Offline access: do you know HTML5 local storage and session storage?
- Integration with device: do you know PhoneGap?
- About performance, we could discuss depending what you want to do, although normaly a native application works better.
Posted 2012-08-20 10:57:35 GMT by
And of course, HTML5 cache
Posted 2012-08-20 10:59:07 GMT by
Amazing points.
Except:
HTML5 can work offline. Gmail does it.
Device integration is happening now.
Performance. Flash may beat HTML5 games, but do you need ultra-responsive interfaces for a note-taking app?
Bugs. So you're saying that native apps never have bugs? Cool.
Development time. So your desktop app works on Windows, Mac and Linux and updates instantly?
HTML5 may not be the answer to everything, but you're making points which are ridiculous.
Posted 2012-08-20 17:59:05 GMT by
Post a comment
Posted 2012-08-15 15:37:55 GMT
1 watching live
On a webpage you can mark part of an image as clickable by including a <map> element specifying the clickable region. In a fine example however of how poorly thought out and badly specified the web standards are, this clickable region is specified in pixels and if the image is resized (zoomed to make it fit so that one screen pixel is not one image pixel) the area will not be adjusted to compensate.
How to work around the idiotic oversight? Well in JavaScript with JQuery, if you get hold of the height and the original height of the image (assuming an aspect preserving transformation), this will do it:
$('area').each(function() {
orig_coords = $(this).attr('orig-coords');
if (!orig_coords) {
orig_coords = $(this).attr('coords');
$(this).attr('orig-coords', orig_coords);
} scale = height / orig_height;
$(this).attr('coords', orig_coords.split(',').map(function(x) {
return Math.round(x * scale)
}).join(','));
});
tpd2 is still developing?
Posted 2012-08-18 08:44:27 GMT by
Yes, we need to know more about tpd2. Whatever happened to tpd1
Posted 2013-01-10 08:11:45 GMT by
Post a comment
Posted 2012-07-19 06:58:16 GMT
1 watching live
I was interviewing a candidate and, given their background, I was surprised to find that they didn't know that hashtables have non-constant time insertions. A constant time insertion hashtable with constant time lookup by key is a sleight of hand: an unsustainable impractical proposition like a fixed space representation of numbers. Eventually it runs out.
Now hashtables can have amortized O(1) element access given fixed size elements, but anybody who has tried to build up a big index knows that they have much worse than O(n) time cost for inserting n elements.
The reason is that occasionally the table has to be rebalanced. And this rebalancing takes some time.
To maintain the O(1) element access time the buckets of the hashtable must be bounded in size, k. Therefore when one bucket has exceeded this size the table must be rebalanced. Even if we are very generous with the number of buckets, when a large number of items are inserted one bucket is likely to overflow (the birthday problem again). At that point the table must be rebalanced.
Assuming an entirely new hashtable has to be created on rebalancing, then the cost is O(n) for one rebalancing. Clearly the number of rebalancings must increase as n increases, though this is as far as I know not a pretty formula. Therefore the time complexity of rebalancing is O(n) for the actual insertions, then in addition O(r(n)) where r(n) is an increasing function of n, for the rebalancings. So the insertion is definitely more than O(n). For k = 1, by the birthday problem I think this comes to O(n3/2) but for higher ks I am uncertain.
For a concrete example, I wrote the following program using a decent hashtable implementation and timing the inserting of batches of 10M new elements.
#include <unordered_map> #include <iostream> #include <boost/timer/timer.hpp>int main() { std::unordered_map<double, int> m; int i = 0;
for (;;) { boost::timer::auto_cpu_timer _; for (int j = 0; 10000000 > j; ++j) { m[i] = i++; } } }
which demonstrates decidedly non-linear running time for insertions
8.633190s wall, 8.330000s user + 0.290000s system = 8.620000s CPU (99.8%) 13.141784s wall, 12.780000s user + 0.330000s system = 13.110000s CPU (99.8%) 7.938090s wall, 7.740000s user + 0.190000s system = 7.930000s CPU (99.9%) 21.005692s wall, 20.410000s user + 0.560000s system = 20.970000s CPU (99.8%) 8.781604s wall, 8.520000s user + 0.240000s system = 8.760000s CPU (99.8%) 8.410468s wall, 8.240000s user + 0.150000s system = 8.390000s CPU (99.8%)
If you have a reasonable bound on the size of a hashtable (so if you reach that bound you promise to delete before you insert) it probably does make sense to model insertions as O(1), I suppose.
Posted 2012-07-20 23:04:15 GMT by
Right: but in practice you will often scale up the size of the machine to match the size of the desired hash-table. So suppose you know you need to store a billion 64-bit entries -> 64 keys: you might scrape by with a 64GB machine. But then there is almost certainly going to be some rebalancing as you build up this index. . .
Posted 2012-07-23 07:19:29 GMT by
Post a comment
Post a comment