Are you sure it's bias, or does the balance
of categories reflect the data on which the
classifier will be evaluated? In which case it's
what we'd call an imbalance.
There are two issues you run into with serious
data imbalance. First, if it's language data,
you won't see some character sequences/tokens in
the training data for categories with little data.
That makes it look like they're predictive of
the larger category. You can mitigate this somewhat
by stoplisting -- removing low count tokens.
Second, the basic mechanism behind logistic
regression estimation weights the errors equally
for each training instance. Thus it in some sense
tries harder on the bigger categories, because
there's more weight on the category itself.
I'm afraid we don't have weighted training, which
is one way to deal with this. A crude approximation
is to duplicate the small category instances one or
more times and add those to the training data. If you
add five training instances for every example, that
instance gets upweighted by five. That will let
you bring the whole thing into balance, but can lead
to a lot of variance in the estimate due to the
artificially inflated low counts.
What we wind up doing in practice is that we usually
have a big pile of unlabeled data. So we'll run
the classifier trained on the imbalanced training data
and cherry pick out highly rated documents for the
underrrepresented category and add them to the training
data. It requires more manual supervision, but it's
usually worth it if you're looking for high recall
and precision in the smaller categories. You should
make sure to keep the rejected highly-ranked examples,
too -- they help the classifier learn the diffeence between
one category and another.
- Bob Carpenter
On 9/12/11 3:56 PM, Yogesh wrote:
> I am using the LogisticRegressionClassifier for multi-class classification.
> There is a lot of bias in my train data; few categories have way more train
> documents than the others.
> Is there a way to correct this bias to get more accurate results? Does the
> API have a Class for this?
> - Yogesh
> [Non-text portions of this message have been removed]