Thank you very much.
The application is to find law firm and accounting firm names in certain
financial and legal documents. We are looking at training our own data.
Performance consideration is less critical than accuracy. First-best results are
OK. No need for nested entities. Very few types. Yes, we have an existing list
that we can use to train, and we can keep imoproving on.
I will check out the rescoring chunker. Thanks again.
From: colloquialdotcom <carp@...
Sent: Tue, July 27, 2010 11:09:26 AM
Subject: Re: [LingPipe] Help in running the NE tutorial with Genetag data.
> I am a newbie in NLP. I am currently evaluating Python NLTK,
> Stanford-Ner and LingPipe for a project for my company.
Glad to see you made this much progress.
What's the application? Are you looking for a pre-built
model or to train on your own data? What kind of performance
requirements do you have in terms of speed and accuracy?
Do you need n-best results or are first-best results OK?
Do you need scaled probabilities to compare entities across
sentences or documents? Do you need nested entities?
Are there lots of entity types or only a few? Do you have
external resources to help train models like dictionaries,
gazeteers, annotated data, etc.
> The values for the new tests above were taken directly from genetag.tag.
> I removed the tags and was expecting a confidence of near-100 for "serum
> LH" and "factor IX". However, I only get 0.1574 for "serum LH" and
> 0.5877 for "factor IX".
I don't think you're doing anything wrong.
It all comes down to the contextual statistics and
what the models are designed to do.
> [java] 0 0.7681 (39, 41) GENE LH
> [java] 1 0.1574 (33, 41) GENE serum LH
> [java] 0 0.5877 (18, 27) GENE factor IX
> [java] 1 0.5037 (77, 86) GENE factor IX
> [java] 2 0.2418 (25, 27) GENE IX
> [java] 3 0.1796 (77, 94) GENE factor IX antigen
> [java] 4 0.0696 (13, 27) GENE hand factor IX
> [java] 5 0.0294 (84, 86) GENE IX
> [java] 6 0.0196 (84, 94) GENE IX antigen
> [java] 7 0.0006 (7, 27) GENE other hand factor IX
> [java] 8 0.0000 (13, 24) GENE hand factor
You might want to look in the rest of the training data.
You'll see lots of instances of "LH" labeled as
a whole gene, and lots of instances of "serum" labeled
as a gene. In these cases, the output sits on the
The demos you're looking at in LingPipe are for HMMs
and HMM-like parsers. The confidence chunker you're looking
at is intentionally designed with heavy smoothing (toward
uniform, away from the training data) in order to be usable
for very high recall applications.
You should use our rescoring chunker for better accuracy.
There you'll find confidences that'll be closer to what
You can also turn down the smoothing by lowering the
interpolation ratios in the models. In the limit,
you'll get a maximum likelihood model with no smoothing
that'll reproduce the training annotations as closely as
is possible. The reason to add more smoothing is that
new data often differs from the training data, and the
tighter you (over) fit a model to training data, the worse
it performs on held-out data.
You could also train up a conditional random field (CRF),
which is more work, but can be more accurate, especially
if you have external resources. CRFs implement all the
same run-time interfaces in LingPipe as HMMs, but are
trained differently. There's a separate tutorial.
- Bob Carpenter
[Non-text portions of this message have been removed]