Loading ...
Sorry, an error occurred while loading the content.

Lesson learnt

Expand Messages
  • Niko Brummer
    I have just returned from Odyssey04: The Speaker and Language Recognition Workshop (http://www.odyssey04.org) where I presented my first paper that makes use
    Message 1 of 5 , Jun 18, 2004
    • 0 Attachment
      I have just returned from "Odyssey04: The Speaker and Language Recognition
      Workshop" (http://www.odyssey04.org)
      where I presented my first paper that makes use of probability theory as
      logic. The "speaker detection" problem (called "speaker verification" in
      commercial applications) is the topic of primary interest in this community.
      It is a clearly defined binary hypothesis testing problem:

      H1: The speaker in each of two given speech segments are the same.
      H2: The speaker in each of two given speech segments are different.

      The primary problem of the speaker recognition engineer is to build a
      machine to compute the likelihood ratio / bayes factor/ odds ratio (or
      whatever you want to call it):

      p( data | H1, state-of-knowledge ) / p( data | H2, state-of-knowledge )


      But, as always, there are difficult questions about the prior P(H1 |
      state-of knowledge). The engineer can respond:
      1. This is not my problem - it is the problem of the user and depends on the
      specific application.
      2. Or, as I did: If the user has "no relevant prior knowledge", simply take
      the maxent prior distribution P(H1 | K) = 0.5.

      In the long discussions that followed I realized that (2) is incomplete. (I
      did not apply the principle of indifference carefully enough!) It is NOT a
      good idea to say:

      "I don't know" --> P(H1 | K) = P(H2 | K).

      To be able to say this, one must be much more specific. ONLY if you can
      truly say that your state of prior knowledge leaves the two hypotheses
      exchangeable, then can you assign
      P(H1 | K) = P(H2 | K).

      This became clear to me because in most practical applications of speaker
      detection the hypotheses cannot be reasonably exchanged. If it bothers you
      to exchange the hypotheses, it means you do have reason to prefer the one
      over the the other and some attempt must be made to numerically quantify
      this preference.

      Niko
    • Kevin S. Van Horn
      ... When I tried to do similar things in the context of speech (NOT speaker) recognition (specifically, choosing the number of mixture components to use in a
      Message 2 of 5 , Jun 18, 2004
      • 0 Attachment
        Niko Brummer wrote:

        >H1: The speaker in each of two given speech segments are the same.
        >H2: The speaker in each of two given speech segments are different.
        >
        >The primary problem of the speaker recognition engineer is to build a
        >machine to compute the likelihood ratio / bayes factor/ odds ratio (or
        >whatever you want to call it):
        >
        >p( data | H1, state-of-knowledge ) / p( data | H2, state-of-knowledge )
        >

        When I tried to do similar things in the context of speech (NOT speaker)
        recognition (specifically, choosing the number of mixture components to
        use in a Gaussian mixture for the emission pdf of an HMM), I ran into a
        problem with the faulty independence assumption that is pervasive in
        speech recognition circles. That is, feature vectors at different times
        are considered independent, conditional on the HMM state. This is
        assumption is not even roughly true, especially when you add dynamic
        features (estimated first and second derivatives of base feature values).

        The net effect of this faulty model is that the effective amount of data
        you have is exaggerated. Thus when you try to do model comparison to
        choose the number of mixture components using this faulty model, it
        favors use of too many mixture components. There are two approaches to
        improving this model, both of which have problems:

        1. Auto-regress current feature values on previous feature values. This
        doesn't work as well for recognition itself, as the estimated first and
        second derivatives seem to do a better job of discrimination. (The
        estimated derivatives are calculated from both past and FUTURE feature
        values, using a time delay.)

        2. Build a maximum-entropy model that accounts for the definitions of
        the estimated derivatives in terms of the other features, and thereby
        also accounts for dependencies. This requires computation of certain
        normalization constants, and the computation turns out to be very demanding.

        So how do you deal with this problem of exaggerating the effective
        amount of data when dependencies are ignored? Have you found a
        computationally tractable way of accounting for the dependencies?
      • Niko Brummer
        ... Just a quick comment on the use of the term assumption and the concept of the truth of this assumption: The best thing for me about probability theory
        Message 3 of 5 , Jun 21, 2004
        • 0 Attachment
          Kevin please see below for several comments:

          > When I tried to do similar things in the context of speech
          > (NOT speaker)
          > recognition (specifically, choosing the number of mixture
          > components to
          > use in a Gaussian mixture for the emission pdf of an HMM), I
          > ran into a
          > problem with the faulty independence assumption that is pervasive in
          > speech recognition circles. That is, feature vectors at
          > different times
          > are considered independent, conditional on the HMM state. This is
          > assumption is not even roughly true, especially when you add dynamic
          > features (estimated first and second derivatives of base
          > feature values).

          Just a quick comment on the use of the term "assumption" and the concept of
          the "truth" of this assumption:
          The best thing for me about probability theory as logic was to make me feel
          a bit better about using probability as a tool. When using terms like
          "estimating" probability distributions (under faulty assumptions) one always
          feels apologetic about this tool. It feels so much better to say: "This is
          my state of knowledge from which I ASSIGN probability distributions and this
          is the best we can do from this state of knowledge." Then again, this is of
          course no excuse to not try to improve your state of knowledge. The
          assignment of a probability model that takes speech frames as
          interchangeable is of course done from a state of knowledge which
          effectively "ignores" temporal relationships.

          > The net effect of this faulty model is that the effective
          > amount of data
          > you have is exaggerated.

          Agreed. See my final comment on how I deal with this problem.

          > Thus when you try to do model comparison to
          > choose the number of mixture components using this faulty model, it
          > favors use of too many mixture components.

          In the much simpler problem of speaker detection, the number of mixture
          components often turns out to be unimportant. In "text-independent" flavour
          of speaker recognition which I am involved with, even the temporal
          dependence given by an HMM does not seem to help. It has been shown
          expermentally by many researchers that you can do best by just using a
          degenerate HMM, simply called a Gaussian Mixture Model (GMM) which has no
          state transition probabilities. This makes the speech frames completely
          interchangeable. You simply use one huge GMM (typically 256 to 2048
          components)for a speaker model and a similar GMM for a speaker independent
          background model. Then you:
          (1) Train a speaker specific model by using the background model as prior
          and applying a maximum aposteriori (MAP) algorithm on some training data for
          that speaker.
          (2) The "score" that is then used in the final decision stage is the
          likelihood ratio between the speaker specific and the speaker-independent
          models.

          If you do this, the accuracy is very insensitive to the number of GMM
          components. (I think this is a nice practical illustration of what Jaynes
          said in PTLOS: "A folk theorem".)

          > There are two
          > approaches to
          > improving this model, both of which have problems:
          >
          > 1. Auto-regress current feature values on previous feature
          > values.

          This was tried some time ago for text-independent speaker recognition (for
          example by Bimbot), but seemed not to have borne much fruit.


          > This
          > doesn't work as well for recognition itself, as the estimated
          > first and
          > second derivatives seem to do a better job of discrimination. (The
          > estimated derivatives are calculated from both past and
          > FUTURE feature
          > values, using a time delay.)

          Sure, these so-called delta features (typically calculated over two past and
          two future frames)are used in almost every speaker recognition machine.
          (In the Odyssey conference I saw some examples of carefully selected delta
          features with much longer time-spans that seemed to carry good speaker
          information. I have to study this in more detail.)

          > 2. Build a maximum-entropy model that accounts for the definitions of
          > the estimated derivatives in terms of the other features, and thereby
          > also accounts for dependencies. This requires computation of certain
          > normalization constants, and the computation turns out to be
          > very demanding.

          This probably the ultimately correct route to go.

          > So how do you deal with this problem of exaggerating the effective
          > amount of data when dependencies are ignored? Have you found a
          > computationally tractable way of accounting for the dependencies?

          The answer to the last question is "no", but there is a way of "dealing with
          this problem":
          A. Regard the model likelihood ratio, see my (2) above, as simply being a
          "score" that carries information about the unknown speaker hypothesis.
          B. Then apply a secondary modeling stage to obtain an hypothesis likelihood
          ratio:
          p(score|H1, secondary model) / p(score|H2, secondary model).

          The modeling problem in B is much easier because the score is only one
          dimensional. It must be stressed that if only the score is used as input for
          stage B, (by the "data processing inequality") no information can be gained
          in step B. The information that is lost because of the frame exchangeability
          of the GMM's cannot be restored in this way. What B does do is to help one
          to make better decisions, because if done correctly, it will not give
          ridiculously large likelihood ratios.

          (If anybody is interested, I can send them my Odyssey paper which explains
          some of these issues in more detail.)

          Niko
        • Kevin S. Van Horn
          ... But that is not, in fact, our state of knowledge. We KNOW that frames are correlated, because of the overlap of the feature-extraction windows and the
          Message 4 of 5 , Jun 24, 2004
          • 0 Attachment
            Niko Brummer wrote:

            >The assignment of a probability model that takes speech frames as
            >interchangeable is of course done from a state of knowledge which
            >effectively "ignores" temporal relationships.
            >
            But that is not, in fact, our state of knowledge. We KNOW that frames
            are correlated, because of the overlap of the feature-extraction windows
            and the response time-scale of the human vocal apparatus. Furthermore,
            knowing the definition of derived (estimated derivative) features in
            terms of base features, we have large portions of the complete feature
            vector that we know to be actual functions of the preceding feature
            vectors. The point is, it's often acceptable to ignore some of the
            information we have if it doesn't make a large difference, but in this
            case the information we are ignoring DOES make a large difference.

            >In the much simpler problem of speaker detection, the number of mixture
            >components often turns out to be unimportant. In "text-independent" flavour
            >of speaker recognition which I am involved with, even the temporal
            >dependence given by an HMM does not seem to help. It has been shown
            >expermentally by many researchers that you can do best by just using a
            >degenerate HMM, simply called a Gaussian Mixture Model (GMM) which has no
            >state transition probabilities. This makes the speech frames completely
            >interchangeable. You simply use one huge GMM (typically 256 to 2048
            >components)for a speaker model and a similar GMM for a speaker independent
            >background model. Then you:
            >(1) Train a speaker specific model by using the background model as prior
            >and applying a maximum aposteriori (MAP) algorithm on some training data for
            >that speaker.
            >(2) The "score" that is then used in the final decision stage is the
            >likelihood ratio between the speaker specific and the speaker-independent
            >models.
            >

            Yes, but you still have a problem of exaggerating the frame evidence by
            ignoring dependencies. I predict that, as a result, if you interpret
            your score as an odds ratio, you will be overconfident -- you will tend
            to compute probabilities of almost exactly 1 or 0 that you have a
            speaker match. (Does this match your experience?) Given the experience
            in speech recognition with language models, I would also suggest that
            this exaggeration of the frame evidence may be swamping any contribution
            from transition probabilities, which is why they don't seem to help. As
            a purely heuristic hack (because doing the right thing will take a lot
            of thought and may be computationally difficult), you could attenuate
            the acoustic evidence -- i.e., use

            a * log(frame likelihood) + log(transition probabilities)

            where 0 < a < 1. Something similar to this is done in speech
            recognition for language models, which otherwise have no effect on
            recognition accuracy.

            >>So how do you deal with this problem of exaggerating the effective
            >>amount of data when dependencies are ignored? Have you found a
            >>computationally tractable way of accounting for the dependencies?
            >>
            >>
            >
            >The answer to the last question is "no", but there is a way of "dealing with
            >this problem":
            >A. Regard the model likelihood ratio, see my (2) above, as simply being a
            >"score" that carries information about the unknown speaker hypothesis.
            >B. Then apply a secondary modeling stage to obtain an hypothesis likelihood
            >ratio:
            >p(score|H1, secondary model) / p(score|H2, secondary model).
            >

            How does this secondary modeling stage work? Do you just treat the
            results of the first stage as a black box and fit some parametric form
            to p(score | H, S.M.) ?

            >(If anybody is interested, I can send them my Odyssey paper which explains
            >some of these issues in more detail.)
            >
            Yes, I would like to see this.
          • Niko Brummer
            ... I m 100% agreed. ... Two comments here: 1. In the speaker detection literature, the score is almost never interpreted as an odds ratio or directly used to
            Message 5 of 5 , Jun 25, 2004
            • 0 Attachment
              Kevin please see below:

              > >The assignment of a probability model that takes speech frames as
              > >interchangeable is of course done from a state of knowledge which
              > >effectively "ignores" temporal relationships.
              > >
              > But that is not, in fact, our state of knowledge. We KNOW
              > that frames
              > are correlated,

              I'm 100% agreed.

              > Yes, but you still have a problem of exaggerating the frame
              > evidence by
              > ignoring dependencies. I predict that, as a result, if you interpret
              > your score as an odds ratio, you will be overconfident -- you
              > will tend
              > to compute probabilities of almost exactly 1 or 0 that you have a
              > speaker match. (Does this match your experience?)

              Two comments here:
              1. In the speaker detection literature, the score is almost never
              interpreted as an odds ratio or directly used to calculate a posterior.
              Rather the game that is played is to empirically select a score threshold
              that will give minimum empirical error (or more generally minimum average
              cost over some tuning data).
              2. This raw model log-likleihood score is most often subjected to a further
              normalization stage usually involving a "cohort" of models from other
              speakers. This tends to normalize the non-target score distribution to have
              unity variance. In practice therefore you don't see these huge
              log-likelihoods.

              > Given the
              > experience
              > in speech recognition with language models, I would also suggest that
              > this exaggeration of the frame evidence may be swamping any
              > contribution
              > from transition probabilities, which is why they don't seem
              > to help.

              Agreed.

              > >B. Then apply a secondary modeling stage to obtain an
              > hypothesis likelihood
              > >ratio:
              > >p(score|H1, secondary model) / p(score|H2, secondary model).
              > >
              >
              > How does this secondary modeling stage work? Do you just treat the
              > results of the first stage as a black box and fit some
              > parametric form
              > to p(score | H, S.M.) ?

              Two comments:
              1. Half the speaker detection community don't do this. They simply threshold
              the score to make decisions. The thresholds are of course prior and cost
              dependent. This practice is to some extent encouraged by the yearly NIST
              speaker recognition evaluations which year after year evaluate the
              technology via the average cost of decisions over some evaluation database,
              each year with the same prior and costs.
              2. The other half of the community work on so-called "forensic" speaker
              recognition where the ideal is to present the judge/jury with an odds ratio
              based on the speech evidence (weight of evidence). It is then implied that
              the judge/jury should use a suitable prior to get a posterior and then
              threshold this posterior with a principle like "beyond a reasonable doubt"
              :-). In the last few years the forensic workers have published a few papers
              on how to do this secondary modeling stage. And yes, they use various
              methods to fit (usually by maximum likelihood) various parametric forms
              (usually gaussian mixture models) to the score distributions. But I see a
              serious problem with this kind of modeling: For extreme score values (which
              can easily obtain for speech samples acquired over previously unseen speech
              channels) in the tails of the distributions, anything can happen. A "log
              odds ratio" of large magnitude and arbitrary sign can be obtained, which is
              clearly not satisfactory.

              My paper addressed the problem of evaluating the work of the second group.
              The NIST evaluations evaluate quality of decisions, but not the "quality of
              odds ratios". I'll send you (Kevin) a copy in a separate email.

              Niko
            Your message has been successfully submitted and would be delivered to recipients shortly.