- I have just returned from "Odyssey04: The Speaker and Language Recognition

Workshop" (http://www.odyssey04.org)

where I presented my first paper that makes use of probability theory as

logic. The "speaker detection" problem (called "speaker verification" in

commercial applications) is the topic of primary interest in this community.

It is a clearly defined binary hypothesis testing problem:

H1: The speaker in each of two given speech segments are the same.

H2: The speaker in each of two given speech segments are different.

The primary problem of the speaker recognition engineer is to build a

machine to compute the likelihood ratio / bayes factor/ odds ratio (or

whatever you want to call it):

p( data | H1, state-of-knowledge ) / p( data | H2, state-of-knowledge )

But, as always, there are difficult questions about the prior P(H1 |

state-of knowledge). The engineer can respond:

1. This is not my problem - it is the problem of the user and depends on the

specific application.

2. Or, as I did: If the user has "no relevant prior knowledge", simply take

the maxent prior distribution P(H1 | K) = 0.5.

In the long discussions that followed I realized that (2) is incomplete. (I

did not apply the principle of indifference carefully enough!) It is NOT a

good idea to say:

"I don't know" --> P(H1 | K) = P(H2 | K).

To be able to say this, one must be much more specific. ONLY if you can

truly say that your state of prior knowledge leaves the two hypotheses

exchangeable, then can you assign

P(H1 | K) = P(H2 | K).

This became clear to me because in most practical applications of speaker

detection the hypotheses cannot be reasonably exchanged. If it bothers you

to exchange the hypotheses, it means you do have reason to prefer the one

over the the other and some attempt must be made to numerically quantify

this preference.

Niko - Niko Brummer wrote:

>H1: The speaker in each of two given speech segments are the same.

When I tried to do similar things in the context of speech (NOT speaker)

>H2: The speaker in each of two given speech segments are different.

>

>The primary problem of the speaker recognition engineer is to build a

>machine to compute the likelihood ratio / bayes factor/ odds ratio (or

>whatever you want to call it):

>

>p( data | H1, state-of-knowledge ) / p( data | H2, state-of-knowledge )

>

recognition (specifically, choosing the number of mixture components to

use in a Gaussian mixture for the emission pdf of an HMM), I ran into a

problem with the faulty independence assumption that is pervasive in

speech recognition circles. That is, feature vectors at different times

are considered independent, conditional on the HMM state. This is

assumption is not even roughly true, especially when you add dynamic

features (estimated first and second derivatives of base feature values).

The net effect of this faulty model is that the effective amount of data

you have is exaggerated. Thus when you try to do model comparison to

choose the number of mixture components using this faulty model, it

favors use of too many mixture components. There are two approaches to

improving this model, both of which have problems:

1. Auto-regress current feature values on previous feature values. This

doesn't work as well for recognition itself, as the estimated first and

second derivatives seem to do a better job of discrimination. (The

estimated derivatives are calculated from both past and FUTURE feature

values, using a time delay.)

2. Build a maximum-entropy model that accounts for the definitions of

the estimated derivatives in terms of the other features, and thereby

also accounts for dependencies. This requires computation of certain

normalization constants, and the computation turns out to be very demanding.

So how do you deal with this problem of exaggerating the effective

amount of data when dependencies are ignored? Have you found a

computationally tractable way of accounting for the dependencies? - Kevin please see below for several comments:

> When I tried to do similar things in the context of speech

Just a quick comment on the use of the term "assumption" and the concept of

> (NOT speaker)

> recognition (specifically, choosing the number of mixture

> components to

> use in a Gaussian mixture for the emission pdf of an HMM), I

> ran into a

> problem with the faulty independence assumption that is pervasive in

> speech recognition circles. That is, feature vectors at

> different times

> are considered independent, conditional on the HMM state. This is

> assumption is not even roughly true, especially when you add dynamic

> features (estimated first and second derivatives of base

> feature values).

the "truth" of this assumption:

The best thing for me about probability theory as logic was to make me feel

a bit better about using probability as a tool. When using terms like

"estimating" probability distributions (under faulty assumptions) one always

feels apologetic about this tool. It feels so much better to say: "This is

my state of knowledge from which I ASSIGN probability distributions and this

is the best we can do from this state of knowledge." Then again, this is of

course no excuse to not try to improve your state of knowledge. The

assignment of a probability model that takes speech frames as

interchangeable is of course done from a state of knowledge which

effectively "ignores" temporal relationships.

> The net effect of this faulty model is that the effective

Agreed. See my final comment on how I deal with this problem.

> amount of data

> you have is exaggerated.

> Thus when you try to do model comparison to

In the much simpler problem of speaker detection, the number of mixture

> choose the number of mixture components using this faulty model, it

> favors use of too many mixture components.

components often turns out to be unimportant. In "text-independent" flavour

of speaker recognition which I am involved with, even the temporal

dependence given by an HMM does not seem to help. It has been shown

expermentally by many researchers that you can do best by just using a

degenerate HMM, simply called a Gaussian Mixture Model (GMM) which has no

state transition probabilities. This makes the speech frames completely

interchangeable. You simply use one huge GMM (typically 256 to 2048

components)for a speaker model and a similar GMM for a speaker independent

background model. Then you:

(1) Train a speaker specific model by using the background model as prior

and applying a maximum aposteriori (MAP) algorithm on some training data for

that speaker.

(2) The "score" that is then used in the final decision stage is the

likelihood ratio between the speaker specific and the speaker-independent

models.

If you do this, the accuracy is very insensitive to the number of GMM

components. (I think this is a nice practical illustration of what Jaynes

said in PTLOS: "A folk theorem".)

> There are two

This was tried some time ago for text-independent speaker recognition (for

> approaches to

> improving this model, both of which have problems:

>

> 1. Auto-regress current feature values on previous feature

> values.

example by Bimbot), but seemed not to have borne much fruit.

> This

Sure, these so-called delta features (typically calculated over two past and

> doesn't work as well for recognition itself, as the estimated

> first and

> second derivatives seem to do a better job of discrimination. (The

> estimated derivatives are calculated from both past and

> FUTURE feature

> values, using a time delay.)

two future frames)are used in almost every speaker recognition machine.

(In the Odyssey conference I saw some examples of carefully selected delta

features with much longer time-spans that seemed to carry good speaker

information. I have to study this in more detail.)

> 2. Build a maximum-entropy model that accounts for the definitions of

This probably the ultimately correct route to go.

> the estimated derivatives in terms of the other features, and thereby

> also accounts for dependencies. This requires computation of certain

> normalization constants, and the computation turns out to be

> very demanding.

> So how do you deal with this problem of exaggerating the effective

The answer to the last question is "no", but there is a way of "dealing with

> amount of data when dependencies are ignored? Have you found a

> computationally tractable way of accounting for the dependencies?

this problem":

A. Regard the model likelihood ratio, see my (2) above, as simply being a

"score" that carries information about the unknown speaker hypothesis.

B. Then apply a secondary modeling stage to obtain an hypothesis likelihood

ratio:

p(score|H1, secondary model) / p(score|H2, secondary model).

The modeling problem in B is much easier because the score is only one

dimensional. It must be stressed that if only the score is used as input for

stage B, (by the "data processing inequality") no information can be gained

in step B. The information that is lost because of the frame exchangeability

of the GMM's cannot be restored in this way. What B does do is to help one

to make better decisions, because if done correctly, it will not give

ridiculously large likelihood ratios.

(If anybody is interested, I can send them my Odyssey paper which explains

some of these issues in more detail.)

Niko - Niko Brummer wrote:

>The assignment of a probability model that takes speech frames as

But that is not, in fact, our state of knowledge. We KNOW that frames

>interchangeable is of course done from a state of knowledge which

>effectively "ignores" temporal relationships.

>

are correlated, because of the overlap of the feature-extraction windows

and the response time-scale of the human vocal apparatus. Furthermore,

knowing the definition of derived (estimated derivative) features in

terms of base features, we have large portions of the complete feature

vector that we know to be actual functions of the preceding feature

vectors. The point is, it's often acceptable to ignore some of the

information we have if it doesn't make a large difference, but in this

case the information we are ignoring DOES make a large difference.

>In the much simpler problem of speaker detection, the number of mixture

Yes, but you still have a problem of exaggerating the frame evidence by

>components often turns out to be unimportant. In "text-independent" flavour

>of speaker recognition which I am involved with, even the temporal

>dependence given by an HMM does not seem to help. It has been shown

>expermentally by many researchers that you can do best by just using a

>degenerate HMM, simply called a Gaussian Mixture Model (GMM) which has no

>state transition probabilities. This makes the speech frames completely

>interchangeable. You simply use one huge GMM (typically 256 to 2048

>components)for a speaker model and a similar GMM for a speaker independent

>background model. Then you:

>(1) Train a speaker specific model by using the background model as prior

>and applying a maximum aposteriori (MAP) algorithm on some training data for

>that speaker.

>(2) The "score" that is then used in the final decision stage is the

>likelihood ratio between the speaker specific and the speaker-independent

>models.

>

ignoring dependencies. I predict that, as a result, if you interpret

your score as an odds ratio, you will be overconfident -- you will tend

to compute probabilities of almost exactly 1 or 0 that you have a

speaker match. (Does this match your experience?) Given the experience

in speech recognition with language models, I would also suggest that

this exaggeration of the frame evidence may be swamping any contribution

from transition probabilities, which is why they don't seem to help. As

a purely heuristic hack (because doing the right thing will take a lot

of thought and may be computationally difficult), you could attenuate

the acoustic evidence -- i.e., use

a * log(frame likelihood) + log(transition probabilities)

where 0 < a < 1. Something similar to this is done in speech

recognition for language models, which otherwise have no effect on

recognition accuracy.

>>So how do you deal with this problem of exaggerating the effective

How does this secondary modeling stage work? Do you just treat the

>>amount of data when dependencies are ignored? Have you found a

>>computationally tractable way of accounting for the dependencies?

>>

>>

>

>The answer to the last question is "no", but there is a way of "dealing with

>this problem":

>A. Regard the model likelihood ratio, see my (2) above, as simply being a

>"score" that carries information about the unknown speaker hypothesis.

>B. Then apply a secondary modeling stage to obtain an hypothesis likelihood

>ratio:

>p(score|H1, secondary model) / p(score|H2, secondary model).

>

results of the first stage as a black box and fit some parametric form

to p(score | H, S.M.) ?

>(If anybody is interested, I can send them my Odyssey paper which explains

Yes, I would like to see this.

>some of these issues in more detail.)

>

- Kevin please see below:

> >The assignment of a probability model that takes speech frames as

I'm 100% agreed.

> >interchangeable is of course done from a state of knowledge which

> >effectively "ignores" temporal relationships.

> >

> But that is not, in fact, our state of knowledge. We KNOW

> that frames

> are correlated,

> Yes, but you still have a problem of exaggerating the frame

Two comments here:

> evidence by

> ignoring dependencies. I predict that, as a result, if you interpret

> your score as an odds ratio, you will be overconfident -- you

> will tend

> to compute probabilities of almost exactly 1 or 0 that you have a

> speaker match. (Does this match your experience?)

1. In the speaker detection literature, the score is almost never

interpreted as an odds ratio or directly used to calculate a posterior.

Rather the game that is played is to empirically select a score threshold

that will give minimum empirical error (or more generally minimum average

cost over some tuning data).

2. This raw model log-likleihood score is most often subjected to a further

normalization stage usually involving a "cohort" of models from other

speakers. This tends to normalize the non-target score distribution to have

unity variance. In practice therefore you don't see these huge

log-likelihoods.

> Given the

Agreed.

> experience

> in speech recognition with language models, I would also suggest that

> this exaggeration of the frame evidence may be swamping any

> contribution

> from transition probabilities, which is why they don't seem

> to help.

> >B. Then apply a secondary modeling stage to obtain an

Two comments:

> hypothesis likelihood

> >ratio:

> >p(score|H1, secondary model) / p(score|H2, secondary model).

> >

>

> How does this secondary modeling stage work? Do you just treat the

> results of the first stage as a black box and fit some

> parametric form

> to p(score | H, S.M.) ?

1. Half the speaker detection community don't do this. They simply threshold

the score to make decisions. The thresholds are of course prior and cost

dependent. This practice is to some extent encouraged by the yearly NIST

speaker recognition evaluations which year after year evaluate the

technology via the average cost of decisions over some evaluation database,

each year with the same prior and costs.

2. The other half of the community work on so-called "forensic" speaker

recognition where the ideal is to present the judge/jury with an odds ratio

based on the speech evidence (weight of evidence). It is then implied that

the judge/jury should use a suitable prior to get a posterior and then

threshold this posterior with a principle like "beyond a reasonable doubt"

:-). In the last few years the forensic workers have published a few papers

on how to do this secondary modeling stage. And yes, they use various

methods to fit (usually by maximum likelihood) various parametric forms

(usually gaussian mixture models) to the score distributions. But I see a

serious problem with this kind of modeling: For extreme score values (which

can easily obtain for speech samples acquired over previously unseen speech

channels) in the tails of the distributions, anything can happen. A "log

odds ratio" of large magnitude and arbitrary sign can be obtained, which is

clearly not satisfactory.

My paper addressed the problem of evaluating the work of the second group.

The NIST evaluations evaluate quality of decisions, but not the "quality of

odds ratios". I'll send you (Kevin) a copy in a separate email.

Niko