166Re: Methodology for assessing results from multiple search engines?
- May 1, 2008Hi Lee,
Have you looked at the methodology adopted by the IR community, e.g,
I'm not sure how applicable it is to environments where you can't
control most aspects of the setup (e.g. the documents, the queries,
the relevance judgements, and so on) but if it's 'scientific rigour'
you're after then TREC is certainly the way to go.
It's been a while since I looked at any of the TREC data but as I
recall much of it should be directly usable straight out the box. A
key part of the methodology is to use a standardised test collection,
and apply relevance judgements that were acquired *in advance*, so
that they are independent of any particular test run (and hence avoid
any 'post-hoc' rationalisation).
For the scoring methodology most folks use some variant on the F-
measure, which is designed to balance out precision and recall:
--- In SearchCoP@yahoogroups.com, "Lee Romero" <pekadad@...> wrote:
> Hi all - I'm familiar with the general process of evaluating
> packages against requirements. I'm currently looking at that
> with search engines (as hinted at in my previous posts about the
> Google Search Appliance).
> Question for you all - Obviously, search result relevance is one of
> those requirements that you would use in evaluating a search engine.
> Do you have any methodology for assessing a search engine in terms
> how it ranks results for particular keywords? It's obviously
> impossible to answer the question, "Does it always get relevance
> exactly right for all possible searches?" because A) you never know
> what particular searches users might use and B) each user's
> expectations of relevance are different (I'm sure there are other
> things as well that make it impossible to do this :-) ).
> Anyway - I'm trying to figure out a way to generally compare the
> results from one engine against another.
> One thought would be to identify the top searches already used by
> users (say the top 20 or 50 or 100 or however many you want to deal
> with). Then ask some type of random sampling of users (how to do
> is another question) to try each of those searches for each search
> engine and provide an assessment of how good the results returned
> were. (Maybe ask them to score on a scale of 0 = nothing relevant,
> a few relevant results, 5 = more relevant results, 10 = all relevant
> results - forcing them to use that to ensure a spread of the
> Then you can average the perception across each search term to score
> that search term. Then average across all of the search terms to
> a general "score" for relevance for the engine?
> Seems like a lot of holes in that idea - it's relatively easy to
> identify the searches, but is it fair to constraint to those? How
> identify a test population? Does averaging scores (either one of
> two averages above) make sense?
> Thanks for your insights!
> Lee Romero
- << Previous post in topic Next post in topic >>