Posted 2013-05-05 22:00:00 GMT
The C++ library and toolset liblinear is awesome for sparse large (20M row+) logistic regression — using past data to predict probabilities of an occurrence.
Unfortunately, it has a few gotchas that can catch you out though when using the train and predict functionality.
— interacting features must be done before passing to the package, and text feature labels have to be turned into packed feature indices.
— features indices are labeled starting from 1 not 0 (the first feature has index 1). If using the C++ interface, to indicate the end of features for a row use a feature_node with index = -1.
— only solver mode 0 (L2 regularisation) and solver mode 6 (L2 regularisation) are for logistic regression, the others are for SVM.
— to benefit from regularisation, scale features appropriately (e.g. divide by standard deviation) or else features that have a wide range of values will be penalised.
— the C parameter controlling the degree of regularisation decreases regularization the larger it becomes. To get more regularization make it smaller (e.g. 0.001). To get sparse feature selection, use solver 6 (L1 regularisation penalty) with small C.
This is a great package. Thanks to Dean for much advice, and many thanks to the authors of it at the Machine Learning and Data Mining Group at NTU in Taiwan!
Post a comment