Re: Methodology for assessing results from multiple search engines?
- Hi Lee,
Have you looked at the methodology adopted by the IR community, e.g,
I'm not sure how applicable it is to environments where you can't
control most aspects of the setup (e.g. the documents, the queries,
the relevance judgements, and so on) but if it's 'scientific rigour'
you're after then TREC is certainly the way to go.
It's been a while since I looked at any of the TREC data but as I
recall much of it should be directly usable straight out the box. A
key part of the methodology is to use a standardised test collection,
and apply relevance judgements that were acquired *in advance*, so
that they are independent of any particular test run (and hence avoid
any 'post-hoc' rationalisation).
For the scoring methodology most folks use some variant on the F-
measure, which is designed to balance out precision and recall:
--- In SearchCoP@yahoogroups.com, "Lee Romero" <pekadad@...> wrote:
> Hi all - I'm familiar with the general process of evaluating
> packages against requirements. I'm currently looking at that
> with search engines (as hinted at in my previous posts about the
> Google Search Appliance).
> Question for you all - Obviously, search result relevance is one of
> those requirements that you would use in evaluating a search engine.
> Do you have any methodology for assessing a search engine in terms
> how it ranks results for particular keywords? It's obviously
> impossible to answer the question, "Does it always get relevance
> exactly right for all possible searches?" because A) you never know
> what particular searches users might use and B) each user's
> expectations of relevance are different (I'm sure there are other
> things as well that make it impossible to do this :-) ).
> Anyway - I'm trying to figure out a way to generally compare the
> results from one engine against another.
> One thought would be to identify the top searches already used by
> users (say the top 20 or 50 or 100 or however many you want to deal
> with). Then ask some type of random sampling of users (how to do
> is another question) to try each of those searches for each search
> engine and provide an assessment of how good the results returned
> were. (Maybe ask them to score on a scale of 0 = nothing relevant,
> a few relevant results, 5 = more relevant results, 10 = all relevant
> results - forcing them to use that to ensure a spread of the
> Then you can average the perception across each search term to score
> that search term. Then average across all of the search terms to
> a general "score" for relevance for the engine?
> Seems like a lot of holes in that idea - it's relatively easy to
> identify the searches, but is it fair to constraint to those? How
> identify a test population? Does averaging scores (either one of
> two averages above) make sense?
> Thanks for your insights!
> Lee Romero
Jim here again! Avi is right on with her analysis.
Another consideration is out of the box relevance versus tuned
relevance. The issue with an intranet is it has broad variety of
knowledge and different types of searches and searchers. Broad
general subjects are easy to see if out of the box search gives
better relevance than tools that require tuning. What difficult is
to find esoteric information and if you don't have the background
with the content you will need a COP to help you get there.
As I noted earlier "out of the box relevance" can appear to be very
high. But when we tried to find some esoteric information and needed
to use multiple words that we became less satisfied with the
results. We tested using our actual corpus but added a set of
controlled content elements for the test search to the collections.
This controlled content contained our test documents including
documents with intentional misspelling of trade names ( to test the
data dictionary and spell check), documents with and without metadata
to see if the metadata had an impact.
If I recall correctly, we had nearly 20 different types of documents
and content which was added so could search for and see how the
controlled content was ranked against the total corpus. We soon
could see that even documents we had intentionally tried to make
important could be ranked much loser than you would expect. Using
compound and phrase searches did not work as well as we expected
without some tuning.
We also knew we would want to eventually preload search queries with
personalized profile information so that we could improve the
expected results for different people in the country. These were
location (UK, US, Brazil), job role (sales, field engineer, design
engineer) , hierarchical position in the organization ( employee
manager, supervisor/manager, upper management) etc.
I hope this helps
Out of the box relevancy is a difficult thing to judge and only one relevancy factor to consider out of many. I'm guessing that there is no search engine with good-out-of-the-box relevancy unless you are crawling a medium-sized set of high quality, web-only content - and then would that test data realistically represent your company's content? There are so many issues with relevancy that are enterprise specific and can only be ferreted out over time.
- When you initially crawl web content your probably going to get all sorts of anomanalies such as the infinite calanders, duplicate pages, looping pages, error pages (that don't return HTTP errors) and others.
- Users will expect home pages to rank highly yet most home pages don't have a lot of meaningful content or metadata.
- Users are going to expect Google-like results from your enterprise content - so right from the start your survey results are going to be skewed by unrealistic expectations.
- You may find that, even with all of the cool sounding Beysian pattern matching and other wizbang algorithms, that keyword density is still going to overtake relevency and return documents that seem completely irrelevant. For example, we have a departmental web site called "Engineering Test Equipment." Those three words are so overly used in so many documents that the sheer number of times those three words appear in 10,000's of documents outweighs the exact combination of those three words that show up on the web site title once.
Another (possibly better) factor to consider when selecting a search engine vendor is:
- What tools are avalable to create edited content (Best Bets)?
- What tuning tools are available to remove duplicates via the similar URLs and/or based upon the content?
- What tools are available for weighting documents based upon content or URL?
- Will the search engine allow you to adjust relevancy based on distance from the root? (method for ranking homepages higher - works well on some web sites but not others)
- Will the search engine allow you to exclude/include pages based upon parts of the URL, the hostname or content (ie, a list of stop words)?
- What tools/parameters allow you to adjust the relevancy during the query (to broaden or narrow the focus of the query)? For example, we found out that searching for employee data works better with the "relevancy" parameter set to 50. Web content works okay at 70. Product/parts data works better by logically ANDing the search terms together.
Lastly, there is going to be a lot of value you can add in your user interface design that will likely have the biggest impact on relevancy. This is where you can focus the user's activity to provide better relevancy based upon YOUR business and tailored to you business. I heard someone at the ESS talk about guiding your user into an "Advanced Search" without them explicitly using advanced search. That is a great suggestion!
Here are some ways that you're going to improve search relevancy based upon your user interface design:
- Pick your your search paradigm well, whether it be direct, navigational, faceted, contextual or relational or universal (after 2.5 years I don't think we've gotten this figured out yet)
- Providing a single, high-level navigation based upon content type is going to yield better relevancy than a single search box for everything and trying to federate relevancy (in my opinion).
- Grouping content via a taxonomy or categorization
- Allow site/content owners to contribute Best Bets
- Applying search tactically to search-related problems
- Take a "perpetual beta" approach by releasing frequent iterations of the GUI to determine what works and what doesn't work.
I completely agree with Avi's approach so don't misunderstand me. You've got to evaluate relevancy and performance relative to each vendor. I'm just suggesting that good relevancy will best be achieved by the hard work you do after you deploy the platform. Therefore the tools and resources provided by the vendor will be key to tuning the engine to your specific needs.
I think Rennie Walker from Wells Fargo that was one of the people declaring the concept of relevancy as dead. Certainly to the end user relevancy is king but Rennie is right on. The concept of search engine relevancy within the enterprise is tainted by the success of consumer search technologies.
One Relevancy To Rule Them All, One Relevancy To Find Them, One Relevancy To Bring Them All And In The Darkness Bind Them!
- Avi and all,
I know it's been quite a while since you posted this, but I've been
looking through old posts and trying to determine how to compare the
relevancy of different vendors for in-house testing.
I was wondering if you could expand on "Relevance Ranking - compare
with current clicks." Do you have a suggested way of doing this?
We're in the process of putting together an RFP for new search
software. Since we're a state government (Oregon), we have to go
through a public bid process, which requires assigning scores. We're
planning to do in-house testing of the top vendors and have several
criteria we're planning to use to score vendors. But we're having a
difficult time coming up with a scoring methodology for
Thank you in advance for any advice,
--- In SearchCoP@yahoogroups.com, "Avi Rappoport" <analyst@...> wrote:
> I think it's much easier to compare results than to evaluate just
one search engine. Here's
> my standard process:
> - Create a test suite (use existing search logs if possible)
> -- Simple and complex queries
> -- Spelling, typing and vocabulary errors
> -- Force matching edge-case issues - many matches, few matches, no
> -- Save results pages as HTML for later checking
> - Analyze the differences among them
> -- Variations in indexing
> -- Retrieval & response time
> -- Relevance Ranking - compare with current clicks
> I find using the current search reports for popular queries and
looking at the popular
> results from those queries to be extremely important, as it takes
out most of my personal
> biases and expectations.
> However, I wouldn't give a single number for "relevance" as there's
no way to measure that
> properly. But you may be able to say that one search is relatively
better than another
> within some categories. For example, when I did the article for
Network World, I
> discovered one search engine (no longer sold) that was significantly
worse than the others
> in most kinds of search.
> I hope that helps,
> --- In SearchCoP@yahoogroups.com, "Lee Romero" <pekadad@> wrote:
> > Hi all - I'm familiar with the general process of evaluating software
> > packages against requirements. I'm currently looking at that problem
> > with search engines (as hinted at in my previous posts about the
> > Google Search Appliance).
> > Question for you all - Obviously, search result relevance is one of
> > those requirements that you would use in evaluating a search engine.
> > Do you have any methodology for assessing a search engine in terms of
> > how it ranks results for particular keywords? It's obviously
> > impossible to answer the question, "Does it always get relevance
> > exactly right for all possible searches?" because A) you never know
> > what particular searches users might use and B) each user's
> > expectations of relevance are different (I'm sure there are other
> > things as well that make it impossible to do this :-) ).
> > Anyway - I'm trying to figure out a way to generally compare the
> > results from one engine against another.
> > One thought would be to identify the top searches already used by
> > users (say the top 20 or 50 or 100 or however many you want to deal
> > with). Then ask some type of random sampling of users (how to do that
> > is another question) to try each of those searches for each search
> > engine and provide an assessment of how good the results returned
> > were. (Maybe ask them to score on a scale of 0 = nothing relevant, 1 =
> > a few relevant results, 5 = more relevant results, 10 = all relevant
> > results - forcing them to use that to ensure a spread of the numbers?)
> > Then you can average the perception across each search term to score
> > that search term. Then average across all of the search terms to get
> > a general "score" for relevance for the engine?
> > Seems like a lot of holes in that idea - it's relatively easy to
> > identify the searches, but is it fair to constraint to those? How to
> > identify a test population? Does averaging scores (either one of the
> > two averages above) make sense?
> > Thanks for your insights!
> > Lee Romero
- Hi Crystal - I took the input I received from this thread and my
approach to assessing the search engines I was looking is described
below. I never did reply back here to say thanks to everyone for
their input, so thanks (!) to Avi, Jim, Tony and Tim for your insights
- they did help.
Anyway - here's what I did (sorry for the length, but I hope some of
the details are of value):
First - I split the assessment into two parts. One part was a more
purely "requirements based" assessment which allowed me to include a
measure of things like: Type of file systems that can be indexed,
ability to control the web crawler, power of the administration
interface in other ways, etc. The second part was to measure the
quality of search results. Then you can get an overall picture of the
effectiveness and power of the search engines by using both of those
measures. It would be possible (I believe) to mathematically combine
those two measures but I did not.
For the first part, I used a simplified quality functional deployment
matrix - I identified the various requirements to consider and
assigned them a weight (level of importance); based on some previous
experiences, I forced the weights to be either 10 (very important -
probably "mandatory" in a semantic sense), a 5 (desirable but not
absolutely necessary) or a 1 (nice to have) - this provides a better
spread in the final outcome, I believe.
Then I reviewed the search engines against those requirements and
assigned them a "score" which, again, was measured as a 10 (met out of
the box), a 5 (met with some level of configuration), a 1 (met with
some customization - i.e., probably some type of scripting or similar,
but not configuration through an admin UI) and a 0 (does not meet and
can not meet).
The overall "score" for an engine was then measured as the sum of the
score for each requirement times that requirement's weight.
To measure the quality of search results, I took Avi's insights and
identified a set of specific searches that I wanted to measure. I
identified the candidate searches by looking at the log files for the
existing search solution on the site and pulling out a few searches
that fell into each category Avi identified.
I assumed I did not necessarily know the "right" targets for these
searches, so I enlisted some volunteers among a group of knowledgeable
employees (content managers on the web site) who could complete a
survey I put together. The survey included a section where the
participant had to execute each search against each search engine (the
survey provided a link to do the search - so the participants did not
have to actually go to a search screen somewhere and enter the terms
and search - this was important to keep it somewhat simpler). The
participants were then asked to score the quality of the results for
each search engine (on a scale of 1-5).
The survey also included some other questions about presentation of
results, performance, etc. (even though we did not customize search
result templates or tweak anything in the searches, we wanted to get a
general sense of usability) and also included a section where users
could define and rate their own searches.
The results from the survey were then analyzed to get an overall
measure of quality of results across this candidate set of searches
for each search engine - basically doing some aggregation of the
different searches into average scores or similar.
With the engines we were looking at, the results were that one was
better on the administration / architectural requirements and the
other was better on the search results - which makes for an
interesting decision, I think.
There are some issues with this approach, I know, but at least it does
make the analysis somewhat quantitative and one can then discuss
things like, "What should the weight of this requirement be? Why is
this requirement scored X for this engine? What's more important -
how well the engine fits into our architecture and how easy it is to
administer or the search results presented to the end user?", instead
of more subjective / emotional issues.
I hope this level of detail helps you move forward with your own research!
On Mon, Aug 18, 2008 at 7:58 PM, crystalkoregon
> Avi and all,
> I know it's been quite a while since you posted this, but I've been
> looking through old posts and trying to determine how to compare the
> relevancy of different vendors for in-house testing.
> I was wondering if you could expand on "Relevance Ranking - compare
> with current clicks." Do you have a suggested way of doing this?
> We're in the process of putting together an RFP for new search
> software. Since we're a state government (Oregon), we have to go
> through a public bid process, which requires assigning scores. We're
> planning to do in-house testing of the top vendors and have several
> criteria we're planning to use to score vendors. But we're having a
> difficult time coming up with a scoring methodology for
> Thank you in advance for any advice,
> Crystal Knapp
Thanks for the detailed information. This definitely helps! It’s good to know that we’re mostly on the right track.
We were thinking of incorporating several of the scoring criteria you suggested and giving different weights for each criterion, but we have been struggling with how to assign scores for the interface and relevancy. I like your suggestion of allowing users to assign scores for these, following controlled testing. We’re worried that they might be biased toward a specific search engine, but that bias will still be there even after the purchase. You’re confirming my hunch that using user feedback to score relevancy is valid.
I’m also not surprised to hear that you preferred the administration of one vendor but the results of another. I’m worried we might run into the same issue.
Thanks so much,
- On Tue, Aug 19, 2008 at 2:34 PM, Crystal Knapp
> Lee,[LR] Well, that assumes I'm on the right track :-) Hopefully I am, though.
> Thanks for the detailed information. This definitely helps! It's good to
> know that we're mostly on the right track.
>[LR] I think that general approach works well. I would have preferred
> We were thinking of incorporating several of the scoring criteria you
> suggested and giving different weights for each criterion, but we have been
> struggling with how to assign scores for the interface and relevancy. I
> like your suggestion of allowing users to assign scores for these, following
> controlled testing.
to do my own assessment while also being able to guard against bias,
but I was balancing between the effort / cost of setting up the
assessment and level of confidence one can lend to the results. With
more effort, I think I could increase the confidence (i.e., reduce the
"error rate") by doing things like having the search results use the
exact same presentation (and not have any reference to the underlying
engine visible in the results) and also by increasing the # of people
involved. Despite that, I still have confidence in the outcome at
least at the level of making a sound decision (one could still say,
"This one isn't high enough in this area" or "That one is too low
> We're worried that they might be biased toward a[LR] Yes, that's very possible (likely?). Biases that participants
> specific search engine, but that bias will still be there even after the
> purchase. You're confirming my hunch that using user feedback to score
> relevancy is valid.
have will definitely influence their perception if they know which
engine is producing which results.
>[LR] I would not be surprised if that happened. Even if it does, you
> I'm also not surprised to hear that you preferred the administration of one
> vendor but the results of another. I'm worried we might run into the same
can still then ask yourselves - "Is it better to have an X% better
search experience when we have to do Y amount of work to get this
engine to work in our infrastructure?" Depending on X and Y, your
answer will change but at least you can have the discussion.
> Thanks so much,
Re: [SearchCoP] Re: Methodology for assessing results from multiple search engines?I finally got around to writing up my comments on A/B testing for evaluating search engines. See my post on “kitten war” testing: http://wunderwood.org/most_casual_observer/2008/09/search_evaluation_by_kitten_wa.html
Thanks for the wonderful summary. We've already selected a vendor, what we are using this for is to improve the usability of the presentation and confirm enhancements. There are lot of people out there that quote hearsay or have a predetermined solution in mind. Your example was perfect...
Google is Cuter: Or, "brand names work". One of our Ultraseek customers did a blind kitten war test between Ultraseek and Google. Ultraseek was preferred 75% of the time. Some executive found this hard to believe and asked that they try it again with the logos attached to the pages. The second time, people preferred Google over half the time.
Thanks for the summary again - I will use some of your ideas in good health. Have virtual coffee on me.
--- In SearchCoP@yahoogroups.com, Walter Underwood <wunderwood@...> wrote:
> I finally got around to writing up my comments on A/B testing for evaluating
> search engines. See my post on ³kitten war² testing:
- Yes, thanks, Walter. :-)
In case anyone's interested, I've written up my previous post (in
response to Crystal) about a methodology for evaluating search engines
I also ended up putting together a set of requirements groupings that
one might consider in doing an evaluation of search engines as a
follow-up to the above post (and because a former co-worker of mine
happened to ask me that question recently):
Hopefully, of some use to someone out there :-)
I point to your post, Walter, when discussing the known issues with
the particular way I went about doing my own evaluation, so thanks for
writing that up.
On Wed, Sep 17, 2008 at 10:26 AM, Jim <jim.smith@...> wrote:
> Thanks for the wonderful summary. We've already selected a vendor, what we
> are using this for is to improve the usability of the presentation and
> confirm enhancements. There are lot of people out there that quote hearsay
> or have a predetermined solution in mind. Your example was perfect...