I missed responding to Lena Tenenboim's question on the group
a while back, so here goes.
> I am going to use LingPipe LMClassifier for multi-label news
What to do is going to depend heavily on whether the
categories are exhaustive (every input falls in a
category) and exclusive (no input falls into two categories).
That's the situation all of our classifiers were designed
for (including the new ones).
> I think about two ways to do it. First way is to use
> DynamicLMClassifier assigning to the content all categories with
> conditionalProbability or jointLog2Probability above the defined
> threshold. Here I fear from defining threshold, I can't even imagine
> what it should be.
If the above conditions are met, you don't have to define
a threshold. Just take the best-scoring category.
If you want to try to reject some inputs because they don't
match any categories, the best thing to look at is cross-entropy
rates, which are refelcted in the score() method in the resulting
classifications from the LM-based classifiers. You need to look
at scores for matching and non-matching docs and use that to set
Another way to do it would be to have a rejection/junk
or none-of-the-above category. But those are tricky to train,
because there's no way to get a balance of all the docs that
aren't one of your categories.
> The second way is to build BinaryLMClassifier for each category and
> assign the content with all "true" categories. In the second case I
> also have to define some crossEntropyThreshold.
This is the better approach if the categories are not
mutually exclusive and exhaustive. Again, you have
the choice of negative models or threshold. Again, set the thresholds
empirically on a per-category basis. The way to do this is
with cross-validation. Divide your data up into ten piles per
category, then for each pile, train on all other piles and test
on that pile and choose the settings that work the best.
> And also I am
> afraid of multiplicity of classifiers (I have about 400 categories)
> and as a result huge reduce of classification time performance.
It depends on the amount of training data you have.
You can also prune the models through the counters
underlying the language models.
The character LMs run at 1-2M chars/second, so 500
categories would run at 2-4K chars/second.
Generally, when people try to scale something like this, they
look for a so-called "blocking" (de-duplication/linkage terminology)
or "fast match" (speech reco terminology) or "umbrella" (Andrew
McCallum's term) to find possible matches efficiently before
running more expensive classifiers on the candidates.
A good way to do this is with hiearchical classification.
Then you just classify at the top level, take any likely
candidates, then proceed down the hierarchy. That way,
you don't waste time trying to separate tennis from golf
when it's not an article about sports.
> Also, I saw you've added a PerceptronClassifier in the last release.
> What is your expectation regarding its performance comparably to
> DynamicLMClassifier. Would you suggest using it instead?
I haven't actually played around much with perceptrons, and haven't
done any large-scale evals. The main advantage of perceptrons is
that they provide large-margin discriminative training over arbitrary
features. So it'll largely depend on what you use as features. You
could use character n-grams of various lengths, token n-grams,
tokens plus part-of-speech tags, or whatever.
There are two main problems with perceptrons. One, they're
essentially binary classifiers. You can make them multi-way
by just using their scores, but our implementation won't be very
efficient for that (because the basis/support vectors won't be shared).
The second main problem is that perceptrons aren't dynamic -- you train
them all at once iteratively. At least the averaged perceptron
implementation in LingPipe. And they're very costly to train and
run in both size and compute time.
So I'm afraid you'll have to make do with this weasely answer.
If you do evals and get results, feel free to post to the list.
The TF/IDF classifier should scale well, but the KNN classifier
stores everything and is thus even more difficult to scale
than the perceptrons.
- Bob Carpenter