John Fremlin's blog: A little guide to liblinear logistic regression

Posted 2013-05-05 22:00:00 GMT

The C++ library and toolset liblinear is awesome for sparse large (20M row+) logistic regression — using past data to predict probabilities of an occurrence.

Unfortunately, it has a few gotchas that can catch you out though when using the train and predict functionality.

— interacting features must be done before passing to the package, and text feature labels have to be turned into packed feature indices.

— features indices are labeled starting from 1 not 0 (the first feature has index 1). If using the C++ interface, to indicate the end of features for a row use a feature_node with index = -1.

— only solver mode 0 (L2 regularisation) and solver mode 6 (L2 regularisation) are for logistic regression, the others are for SVM.

— to benefit from regularisation, scale features appropriately (e.g. divide by standard deviation) or else features that have a wide range of values will be penalised.

— the C parameter controlling the degree of regularisation decreases regularization the larger it becomes. To get more regularization make it smaller (e.g. 0.001). To get sparse feature selection, use solver 6 (L1 regularisation penalty) with small C.

This is a great package. Thanks to Dean for much advice, and many thanks to the authors of it at the Machine Learning and Data Mining Group at NTU in Taiwan!

Post a comment