LingPipe 2.2.1 patched for Java 1.4
- I rebuilt LingPipe 2.2.1 on the web site
so that it would work with Java 1.4. I
hope this clears up major/minor version
- Bob Carpenter
I have a question regarding the sentiment api.
The LMClassifier does a good job in getting the
best catgeory among "pos" and "neg". How can I get a
rank/probability/confidence number from the
classifiers, so that
a) if the classfier is VERY confident, I will accept
the pos/neg classification.
b) If the classifier is LESS confident, I may direct
the classification to another process.
c) If the classifier is NOT confident at all I will
throw away the classification subject.
--- Bob Carpenter <carp@...> wrote:
> I rebuilt LingPipe 2.2.1 on the web site__________________________________________________
> so that it would work with Java 1.4. I
> hope this clears up major/minor version
> - Bob Carpenter
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
- Sanjay Singh wrote:
> Hi,You'll have to get that through the API -- the tutorial
> I have a question regarding the sentiment api.
> The LMClassifier does a good job in getting the
> best catgeory among "pos" and "neg". How can I get a
> rank/probability/confidence number from the
> classifiers, so that
> a) if the classfier is VERY confident, I will accept
> the pos/neg classification.
> b) If the classifier is LESS confident, I may direct
> the classification to another process.
> c) If the classifier is NOT confident at all I will
> throw away the classification subject.
doesn't cover it (yet, anyway).
The classification API is set up to run through
implementations of a very simple interface:
Classification classify(Object input);
Depending on the implementation, you may be able to
cast the resulting classification to a more specific
class. The language-model based classifiers return
instances of classify.JointClassification. Sorting
through the inheritance a JointClassification provides
the following methods:
RankedClassification extends Classification
String category(int rank);
ScoredClassification extends RankedClassification
double score(int rank);
ConditionalClassification extends ScoredClassification
double conditionalProbability(int rank);
JointClassification extends ConditionalClassification
double jointLog2Probability(int rank);
The conditional probability is an estimate of the probability
of the category given the input, and is the thing to use
to set thresholds. The joint probability includes the
probability of the object being classified, and is thus
not scaled to be comparable across different object inputs.
So if your classes are "pos", "neg", you can get LingPipe's
estimate of "pos" confidence by:
String input = ...;
= (ConditionalClassification) classifier.classify(input);
double posConfidence = Double.NEGATIVE_INFINITY;
if (classification.size() > 0 && classification.category(0).equals("pos"))
posConfidence = classification.conditionalProbability(0);
else if (classification.size() > 1 && classification.category(1).equals("pos"))
posConfidence = classification.conditionalProbability(1);
This is being extra paranoid in case for some reason the
classifier doesn't return enough results -- this shouldn't
happen with the LM classifiers built in the usual way. So
this could be simplified to:
double posConfidence = classification.conditionalProbability(posIndex);
I should provide a warning: if the input is very long, these
probabilities will quickly approach 0 or 1 and may even round off
to 0 or 1 with 64-bit floating-point arithmetic. This is a problem
with the underlying model's naivete -- it doesn't account for
any of the conditional probability structure of topic or longer
distance, so its confidence tends to get exaggerated. (This is
a problem with almost all text classifiers, and is an even greater
problem with speech-based acoustic classifiers.)
A better thing to use for the ranking is the cross-entropy rate,
which is defined to be the joint probability divided by the
classification.jointLog2Probability(posIndex) / input.length();
The unit here is bits-per-character -- it's what it'd
cost to compress using the positive model in an entropy coder.
I'm afraid we don't provide any help in setting thresholds
for these -- I'd suggest inspecting your results and setting
One more suggestion: if you have only a positive model, then
you can actually set up a classifier based on this kind of
threshold without having to train a negative model. Sometimes
this works better, especially if the negative instances do not
form a coherent topical/genre/etc. collection.