Don't hit me, but...
Some years ago, I worked for a startup called The Mission Corporation, which intended to provide access to online government services, healthcare, shopping, entertainment, via natural speech. The user interacted with an animated spokesperson for each service. E.g. for setting up a medical appointment, they'd be speaking to an "admitting nurse". The service would be delivered over the Internet, but using TV and a set-top-box as the platform.
I worked on goal recognition -- determining the user's purpose from what they said. We used a limited vocabulary speech recognizer from the Oregon Graduate Institute. My plan was to do two passes on each speech segment: The first pass would use a vocabulary set designed to coarsely categorize the utterance into the main categories (those listed above), then once the main category was chosen, re-run that same utterance with a category-specific classifier. I did a small demo using just decision trees. These output only the "best" category, not probabilities per category, but I always said this had to be replaced by a probabilistic classifier that could also detect classifier failure or problems with the input, and fall back to simpler question-and-answer fixups. (Detection of classifier failure and input errors is a recommendation from Eric Horvitz of MSR -- look for the Bayesian Receptionist project.)
The bulk of the effort went into 3D graphics -- trying to make realistic animated avatars. Those of us on the machine learning side considered this putting the cart before the horse -- get the interaction right and forget the realism. Those of us with some HCI background considered it wrong from a usability standpoint too -- people are much more forgiving of mistakes by a machine if it's *obvious* that it's a machine, if its presentation is cute or whimsical (i.e. not presented as "superior" to the user, nor as "knowing all"), and if it acknowledges and apologizes for its mistakes. We said, use a simple cartoon -- preferably something a bit goofy.
And we always had the worry that the need for this sort of hand-holding was ephemeral -- maybe it would be better to put the effort into a mixed interface (voice plus other controls) without the avatar, into training people on the interface, and into getting user feedback to improve the experience.
Why can I tell you this? Because TMC is a dot-gone. It died partly of internal competition and secrecy and infighting, but also because the bulk of the work and funding went into the 3D graphics, rather than getting a simple working system out the door. So, I'd be more impressed by something where the effort was not so heavily on presentation and visuals. I'd rather have Mr. Clippy, if he were capable of accurately detecting that the user was having a problem and offering useful solutions. IMO, that's where the work is needed.