## Lesson learnt

Expand Messages
• I have just returned from Odyssey04: The Speaker and Language Recognition Workshop (http://www.odyssey04.org) where I presented my first paper that makes use
Message 1 of 5 , Jun 18, 2004
I have just returned from "Odyssey04: The Speaker and Language Recognition
Workshop" (http://www.odyssey04.org)
where I presented my first paper that makes use of probability theory as
logic. The "speaker detection" problem (called "speaker verification" in
commercial applications) is the topic of primary interest in this community.
It is a clearly defined binary hypothesis testing problem:

H1: The speaker in each of two given speech segments are the same.
H2: The speaker in each of two given speech segments are different.

The primary problem of the speaker recognition engineer is to build a
machine to compute the likelihood ratio / bayes factor/ odds ratio (or
whatever you want to call it):

p( data | H1, state-of-knowledge ) / p( data | H2, state-of-knowledge )

But, as always, there are difficult questions about the prior P(H1 |
state-of knowledge). The engineer can respond:
1. This is not my problem - it is the problem of the user and depends on the
specific application.
2. Or, as I did: If the user has "no relevant prior knowledge", simply take
the maxent prior distribution P(H1 | K) = 0.5.

In the long discussions that followed I realized that (2) is incomplete. (I
did not apply the principle of indifference carefully enough!) It is NOT a
good idea to say:

"I don't know" --> P(H1 | K) = P(H2 | K).

To be able to say this, one must be much more specific. ONLY if you can
truly say that your state of prior knowledge leaves the two hypotheses
exchangeable, then can you assign
P(H1 | K) = P(H2 | K).

This became clear to me because in most practical applications of speaker
detection the hypotheses cannot be reasonably exchanged. If it bothers you
to exchange the hypotheses, it means you do have reason to prefer the one
over the the other and some attempt must be made to numerically quantify
this preference.

Niko
• ... When I tried to do similar things in the context of speech (NOT speaker) recognition (specifically, choosing the number of mixture components to use in a
Message 2 of 5 , Jun 18, 2004
Niko Brummer wrote:

>H1: The speaker in each of two given speech segments are the same.
>H2: The speaker in each of two given speech segments are different.
>
>The primary problem of the speaker recognition engineer is to build a
>machine to compute the likelihood ratio / bayes factor/ odds ratio (or
>whatever you want to call it):
>
>p( data | H1, state-of-knowledge ) / p( data | H2, state-of-knowledge )
>

When I tried to do similar things in the context of speech (NOT speaker)
recognition (specifically, choosing the number of mixture components to
use in a Gaussian mixture for the emission pdf of an HMM), I ran into a
problem with the faulty independence assumption that is pervasive in
speech recognition circles. That is, feature vectors at different times
are considered independent, conditional on the HMM state. This is
assumption is not even roughly true, especially when you add dynamic
features (estimated first and second derivatives of base feature values).

The net effect of this faulty model is that the effective amount of data
you have is exaggerated. Thus when you try to do model comparison to
choose the number of mixture components using this faulty model, it
favors use of too many mixture components. There are two approaches to
improving this model, both of which have problems:

1. Auto-regress current feature values on previous feature values. This
doesn't work as well for recognition itself, as the estimated first and
second derivatives seem to do a better job of discrimination. (The
estimated derivatives are calculated from both past and FUTURE feature
values, using a time delay.)

2. Build a maximum-entropy model that accounts for the definitions of
the estimated derivatives in terms of the other features, and thereby
also accounts for dependencies. This requires computation of certain
normalization constants, and the computation turns out to be very demanding.

So how do you deal with this problem of exaggerating the effective
amount of data when dependencies are ignored? Have you found a
computationally tractable way of accounting for the dependencies?
• ... Just a quick comment on the use of the term assumption and the concept of the truth of this assumption: The best thing for me about probability theory
Message 3 of 5 , Jun 21, 2004

> When I tried to do similar things in the context of speech
> (NOT speaker)
> recognition (specifically, choosing the number of mixture
> components to
> use in a Gaussian mixture for the emission pdf of an HMM), I
> ran into a
> problem with the faulty independence assumption that is pervasive in
> speech recognition circles. That is, feature vectors at
> different times
> are considered independent, conditional on the HMM state. This is
> assumption is not even roughly true, especially when you add dynamic
> features (estimated first and second derivatives of base
> feature values).

Just a quick comment on the use of the term "assumption" and the concept of
the "truth" of this assumption:
The best thing for me about probability theory as logic was to make me feel
a bit better about using probability as a tool. When using terms like
"estimating" probability distributions (under faulty assumptions) one always
my state of knowledge from which I ASSIGN probability distributions and this
is the best we can do from this state of knowledge." Then again, this is of
course no excuse to not try to improve your state of knowledge. The
assignment of a probability model that takes speech frames as
interchangeable is of course done from a state of knowledge which
effectively "ignores" temporal relationships.

> The net effect of this faulty model is that the effective
> amount of data
> you have is exaggerated.

Agreed. See my final comment on how I deal with this problem.

> Thus when you try to do model comparison to
> choose the number of mixture components using this faulty model, it
> favors use of too many mixture components.

In the much simpler problem of speaker detection, the number of mixture
components often turns out to be unimportant. In "text-independent" flavour
of speaker recognition which I am involved with, even the temporal
dependence given by an HMM does not seem to help. It has been shown
expermentally by many researchers that you can do best by just using a
degenerate HMM, simply called a Gaussian Mixture Model (GMM) which has no
state transition probabilities. This makes the speech frames completely
interchangeable. You simply use one huge GMM (typically 256 to 2048
components)for a speaker model and a similar GMM for a speaker independent
background model. Then you:
(1) Train a speaker specific model by using the background model as prior
and applying a maximum aposteriori (MAP) algorithm on some training data for
that speaker.
(2) The "score" that is then used in the final decision stage is the
likelihood ratio between the speaker specific and the speaker-independent
models.

If you do this, the accuracy is very insensitive to the number of GMM
components. (I think this is a nice practical illustration of what Jaynes
said in PTLOS: "A folk theorem".)

> There are two
> approaches to
> improving this model, both of which have problems:
>
> 1. Auto-regress current feature values on previous feature
> values.

This was tried some time ago for text-independent speaker recognition (for
example by Bimbot), but seemed not to have borne much fruit.

> This
> doesn't work as well for recognition itself, as the estimated
> first and
> second derivatives seem to do a better job of discrimination. (The
> estimated derivatives are calculated from both past and
> FUTURE feature
> values, using a time delay.)

Sure, these so-called delta features (typically calculated over two past and
two future frames)are used in almost every speaker recognition machine.
(In the Odyssey conference I saw some examples of carefully selected delta
features with much longer time-spans that seemed to carry good speaker
information. I have to study this in more detail.)

> 2. Build a maximum-entropy model that accounts for the definitions of
> the estimated derivatives in terms of the other features, and thereby
> also accounts for dependencies. This requires computation of certain
> normalization constants, and the computation turns out to be
> very demanding.

This probably the ultimately correct route to go.

> So how do you deal with this problem of exaggerating the effective
> amount of data when dependencies are ignored? Have you found a
> computationally tractable way of accounting for the dependencies?

The answer to the last question is "no", but there is a way of "dealing with
this problem":
A. Regard the model likelihood ratio, see my (2) above, as simply being a
"score" that carries information about the unknown speaker hypothesis.
B. Then apply a secondary modeling stage to obtain an hypothesis likelihood
ratio:
p(score|H1, secondary model) / p(score|H2, secondary model).

The modeling problem in B is much easier because the score is only one
dimensional. It must be stressed that if only the score is used as input for
stage B, (by the "data processing inequality") no information can be gained
in step B. The information that is lost because of the frame exchangeability
of the GMM's cannot be restored in this way. What B does do is to help one
to make better decisions, because if done correctly, it will not give
ridiculously large likelihood ratios.

(If anybody is interested, I can send them my Odyssey paper which explains
some of these issues in more detail.)

Niko
• ... But that is not, in fact, our state of knowledge. We KNOW that frames are correlated, because of the overlap of the feature-extraction windows and the
Message 4 of 5 , Jun 24, 2004
Niko Brummer wrote:

>The assignment of a probability model that takes speech frames as
>interchangeable is of course done from a state of knowledge which
>effectively "ignores" temporal relationships.
>
But that is not, in fact, our state of knowledge. We KNOW that frames
are correlated, because of the overlap of the feature-extraction windows
and the response time-scale of the human vocal apparatus. Furthermore,
knowing the definition of derived (estimated derivative) features in
terms of base features, we have large portions of the complete feature
vector that we know to be actual functions of the preceding feature
vectors. The point is, it's often acceptable to ignore some of the
information we have if it doesn't make a large difference, but in this
case the information we are ignoring DOES make a large difference.

>In the much simpler problem of speaker detection, the number of mixture
>components often turns out to be unimportant. In "text-independent" flavour
>of speaker recognition which I am involved with, even the temporal
>dependence given by an HMM does not seem to help. It has been shown
>expermentally by many researchers that you can do best by just using a
>degenerate HMM, simply called a Gaussian Mixture Model (GMM) which has no
>state transition probabilities. This makes the speech frames completely
>interchangeable. You simply use one huge GMM (typically 256 to 2048
>components)for a speaker model and a similar GMM for a speaker independent
>background model. Then you:
>(1) Train a speaker specific model by using the background model as prior
>and applying a maximum aposteriori (MAP) algorithm on some training data for
>that speaker.
>(2) The "score" that is then used in the final decision stage is the
>likelihood ratio between the speaker specific and the speaker-independent
>models.
>

Yes, but you still have a problem of exaggerating the frame evidence by
ignoring dependencies. I predict that, as a result, if you interpret
your score as an odds ratio, you will be overconfident -- you will tend
to compute probabilities of almost exactly 1 or 0 that you have a
speaker match. (Does this match your experience?) Given the experience
in speech recognition with language models, I would also suggest that
this exaggeration of the frame evidence may be swamping any contribution
from transition probabilities, which is why they don't seem to help. As
a purely heuristic hack (because doing the right thing will take a lot
of thought and may be computationally difficult), you could attenuate
the acoustic evidence -- i.e., use

a * log(frame likelihood) + log(transition probabilities)

where 0 < a < 1. Something similar to this is done in speech
recognition for language models, which otherwise have no effect on
recognition accuracy.

>>So how do you deal with this problem of exaggerating the effective
>>amount of data when dependencies are ignored? Have you found a
>>computationally tractable way of accounting for the dependencies?
>>
>>
>
>The answer to the last question is "no", but there is a way of "dealing with
>this problem":
>A. Regard the model likelihood ratio, see my (2) above, as simply being a
>"score" that carries information about the unknown speaker hypothesis.
>B. Then apply a secondary modeling stage to obtain an hypothesis likelihood
>ratio:
>p(score|H1, secondary model) / p(score|H2, secondary model).
>

How does this secondary modeling stage work? Do you just treat the
results of the first stage as a black box and fit some parametric form
to p(score | H, S.M.) ?

>(If anybody is interested, I can send them my Odyssey paper which explains
>some of these issues in more detail.)
>
Yes, I would like to see this.
• ... I m 100% agreed. ... Two comments here: 1. In the speaker detection literature, the score is almost never interpreted as an odds ratio or directly used to
Message 5 of 5 , Jun 25, 2004

> >The assignment of a probability model that takes speech frames as
> >interchangeable is of course done from a state of knowledge which
> >effectively "ignores" temporal relationships.
> >
> But that is not, in fact, our state of knowledge. We KNOW
> that frames
> are correlated,

I'm 100% agreed.

> Yes, but you still have a problem of exaggerating the frame
> evidence by
> ignoring dependencies. I predict that, as a result, if you interpret
> your score as an odds ratio, you will be overconfident -- you
> will tend
> to compute probabilities of almost exactly 1 or 0 that you have a
> speaker match. (Does this match your experience?)

1. In the speaker detection literature, the score is almost never
interpreted as an odds ratio or directly used to calculate a posterior.
Rather the game that is played is to empirically select a score threshold
that will give minimum empirical error (or more generally minimum average
cost over some tuning data).
2. This raw model log-likleihood score is most often subjected to a further
normalization stage usually involving a "cohort" of models from other
speakers. This tends to normalize the non-target score distribution to have
unity variance. In practice therefore you don't see these huge
log-likelihoods.

> Given the
> experience
> in speech recognition with language models, I would also suggest that
> this exaggeration of the frame evidence may be swamping any
> contribution
> from transition probabilities, which is why they don't seem
> to help.

Agreed.

> >B. Then apply a secondary modeling stage to obtain an
> hypothesis likelihood
> >ratio:
> >p(score|H1, secondary model) / p(score|H2, secondary model).
> >
>
> How does this secondary modeling stage work? Do you just treat the
> results of the first stage as a black box and fit some
> parametric form
> to p(score | H, S.M.) ?

1. Half the speaker detection community don't do this. They simply threshold
the score to make decisions. The thresholds are of course prior and cost
dependent. This practice is to some extent encouraged by the yearly NIST
speaker recognition evaluations which year after year evaluate the
technology via the average cost of decisions over some evaluation database,
each year with the same prior and costs.
2. The other half of the community work on so-called "forensic" speaker
recognition where the ideal is to present the judge/jury with an odds ratio
based on the speech evidence (weight of evidence). It is then implied that
the judge/jury should use a suitable prior to get a posterior and then
threshold this posterior with a principle like "beyond a reasonable doubt"
:-). In the last few years the forensic workers have published a few papers
on how to do this secondary modeling stage. And yes, they use various
methods to fit (usually by maximum likelihood) various parametric forms
(usually gaussian mixture models) to the score distributions. But I see a
serious problem with this kind of modeling: For extreme score values (which
can easily obtain for speech samples acquired over previously unseen speech
channels) in the tails of the distributions, anything can happen. A "log
odds ratio" of large magnitude and arbitrary sign can be obtained, which is
clearly not satisfactory.

My paper addressed the problem of evaluating the work of the second group.
The NIST evaluations evaluate quality of decisions, but not the "quality of
odds ratios". I'll send you (Kevin) a copy in a separate email.

Niko
Your message has been successfully submitted and would be delivered to recipients shortly.