Loading ...
Sorry, an error occurred while loading the content.

Re: Methodology for assessing results from multiple search engines?

Expand Messages
  • Tony Rose
    Hi Lee, Have you looked at the methodology adopted by the IR community, e.g, TREC? http://trec.nist.gov/ I m not sure how applicable it is to environments
    Message 1 of 12 , May 1, 2008
    • 0 Attachment
      Hi Lee,

      Have you looked at the methodology adopted by the IR community, e.g,
      TREC?

      http://trec.nist.gov/

      I'm not sure how applicable it is to environments where you can't
      control most aspects of the setup (e.g. the documents, the queries,
      the relevance judgements, and so on) but if it's 'scientific rigour'
      you're after then TREC is certainly the way to go.

      It's been a while since I looked at any of the TREC data but as I
      recall much of it should be directly usable straight out the box. A
      key part of the methodology is to use a standardised test collection,
      and apply relevance judgements that were acquired *in advance*, so
      that they are independent of any particular test run (and hence avoid
      any 'post-hoc' rationalisation).

      For the scoring methodology most folks use some variant on the F-
      measure, which is designed to balance out precision and recall:

      http://en.wikipedia.org/wiki/Information_retrieval#F-measure

      Best regards,
      Tony


      --- In SearchCoP@yahoogroups.com, "Lee Romero" <pekadad@...> wrote:
      >
      > Hi all - I'm familiar with the general process of evaluating
      software
      > packages against requirements. I'm currently looking at that
      problem
      > with search engines (as hinted at in my previous posts about the
      > Google Search Appliance).
      >
      > Question for you all - Obviously, search result relevance is one of
      > those requirements that you would use in evaluating a search engine.
      > Do you have any methodology for assessing a search engine in terms
      of
      > how it ranks results for particular keywords? It's obviously
      > impossible to answer the question, "Does it always get relevance
      > exactly right for all possible searches?" because A) you never know
      > what particular searches users might use and B) each user's
      > expectations of relevance are different (I'm sure there are other
      > things as well that make it impossible to do this :-) ).
      >
      > Anyway - I'm trying to figure out a way to generally compare the
      > results from one engine against another.
      >
      > One thought would be to identify the top searches already used by
      > users (say the top 20 or 50 or 100 or however many you want to deal
      > with). Then ask some type of random sampling of users (how to do
      that
      > is another question) to try each of those searches for each search
      > engine and provide an assessment of how good the results returned
      > were. (Maybe ask them to score on a scale of 0 = nothing relevant,
      1 =
      > a few relevant results, 5 = more relevant results, 10 = all relevant
      > results - forcing them to use that to ensure a spread of the
      numbers?)
      >
      > Then you can average the perception across each search term to score
      > that search term. Then average across all of the search terms to
      get
      > a general "score" for relevance for the engine?
      >
      > Seems like a lot of holes in that idea - it's relatively easy to
      > identify the searches, but is it fair to constraint to those? How
      to
      > identify a test population? Does averaging scores (either one of
      the
      > two averages above) make sense?
      >
      > Thanks for your insights!
      > Lee Romero
      >
    • Jim
      Lee. Jim here again! Avi is right on with her analysis. Another consideration is out of the box relevance versus tuned relevance. The issue with an intranet is
      Message 2 of 12 , May 2, 2008
      • 0 Attachment
        Lee.
        Jim here again! Avi is right on with her analysis.

        Another consideration is out of the box relevance versus tuned
        relevance. The issue with an intranet is it has broad variety of
        knowledge and different types of searches and searchers. Broad
        general subjects are easy to see if out of the box search gives
        better relevance than tools that require tuning. What difficult is
        to find esoteric information and if you don't have the background
        with the content you will need a COP to help you get there.

        As I noted earlier "out of the box relevance" can appear to be very
        high. But when we tried to find some esoteric information and needed
        to use multiple words that we became less satisfied with the
        results. We tested using our actual corpus but added a set of
        controlled content elements for the test search to the collections.
        This controlled content contained our test documents including
        documents with intentional misspelling of trade names ( to test the
        data dictionary and spell check), documents with and without metadata
        to see if the metadata had an impact.

        If I recall correctly, we had nearly 20 different types of documents
        and content which was added so could search for and see how the
        controlled content was ranked against the total corpus. We soon
        could see that even documents we had intentionally tried to make
        important could be ranked much loser than you would expect. Using
        compound and phrase searches did not work as well as we expected
        without some tuning.

        We also knew we would want to eventually preload search queries with
        personalized profile information so that we could improve the
        expected results for different people in the country. These were
        location (UK, US, Brazil), job role (sales, field engineer, design
        engineer) , hierarchical position in the organization ( employee
        manager, supervisor/manager, upper management) etc.

        I hope this helps
        jim
      • Tim
        Out of the box relevancy is a difficult thing to judge and only one relevancy factor to consider out of many. I m guessing that there is no search engine with
        Message 3 of 12 , May 5, 2008
        • 0 Attachment

          Out of the box relevancy is a difficult thing to judge and only one relevancy factor to consider out of many.  I'm guessing that there is no search engine with good-out-of-the-box relevancy unless you are crawling a medium-sized set of high quality, web-only content - and then would that test data realistically represent your company's content?  There are so many issues with relevancy that are enterprise specific and can only be ferreted out over time.

          • When you initially crawl web content your probably going to get all sorts of anomanalies such as the infinite calanders, duplicate pages, looping pages, error pages (that don't return HTTP errors) and others.
          • Users will expect home pages to rank highly yet most home pages don't have a lot of meaningful content or metadata.
          • Users are going to expect Google-like results from your enterprise content - so right from the start your survey results are going to be skewed by unrealistic expectations.
          • You may find that, even with all of the cool sounding Beysian pattern matching and other wizbang algorithms, that keyword density is still going to overtake relevency and return documents that seem completely irrelevant.  For example, we have a departmental web site called "Engineering Test Equipment."   Those three words are so overly used in so many documents that the sheer number of times those three words appear in 10,000's of documents outweighs the exact combination of those three words that show up on the web site title once. 

          Another (possibly better) factor to consider when selecting a search engine vendor is:

          • What tools are avalable to create edited content (Best Bets)?
          • What tuning tools are available to remove duplicates via the similar URLs and/or based upon the content?
          • What tools are available for weighting documents based upon content or URL?
          • Will the search engine allow you to adjust relevancy based on distance from the root? (method for  ranking homepages higher  - works well on some web sites but not others)
          • Will the search engine allow you to exclude/include pages based upon parts of the URL, the hostname or content (ie, a list of stop words)?
          • What tools/parameters allow you to adjust the relevancy during the query (to broaden or narrow the focus of the query)?  For example, we found out that searching for employee data works better with the "relevancy" parameter set to 50.  Web content works okay at 70.   Product/parts data works better by logically ANDing the search terms together.

          Lastly, there is going to be a lot of value you can add in your user interface design that will likely have the biggest impact on relevancy.  This is where you can focus the user's activity to provide better relevancy based upon YOUR business and tailored to you business.  I heard someone at the ESS talk about guiding your user into an "Advanced Search" without them explicitly using advanced search.  That is a great suggestion!  

          Here are some ways that you're going to improve search relevancy based upon your user interface design:

          • Pick your your search paradigm well, whether it be direct, navigational, faceted, contextual or relational or universal (after 2.5 years I don't think we've gotten this figured out yet)
          • Providing a single, high-level navigation based upon content type is going to yield better relevancy than a single search box for everything and trying to federate relevancy (in my opinion).
          • Grouping content via a taxonomy or categorization
          • Allow site/content owners to contribute Best Bets
          • Applying search tactically to search-related problems
          • Take a "perpetual beta" approach by releasing frequent iterations of the GUI to determine what works and what doesn't work. 

          I completely agree with Avi's approach so don't misunderstand me.  You've got to evaluate relevancy and performance relative to each vendor.  I'm just suggesting that good relevancy will best be achieved by the hard work you do after you deploy the platform.  Therefore the tools and resources provided by the vendor will be key to tuning the engine to your specific needs.

          I think Rennie Walker from Wells Fargo that was one of the people declaring the concept of relevancy as dead.  Certainly to the end user relevancy is king but Rennie is right on.  The concept of search engine relevancy within the enterprise is tainted by the success of consumer search technologies. 

          Tim

          One Relevancy To Rule Them All, One Relevancy To Find Them, One Relevancy To Bring Them All And In The Darkness Bind Them!

        • crystalkoregon
          Avi and all, I know it s been quite a while since you posted this, but I ve been looking through old posts and trying to determine how to compare the relevancy
          Message 4 of 12 , Aug 18, 2008
          • 0 Attachment
            Avi and all,

            I know it's been quite a while since you posted this, but I've been
            looking through old posts and trying to determine how to compare the
            relevancy of different vendors for in-house testing.

            I was wondering if you could expand on "Relevance Ranking - compare
            with current clicks." Do you have a suggested way of doing this?

            We're in the process of putting together an RFP for new search
            software. Since we're a state government (Oregon), we have to go
            through a public bid process, which requires assigning scores. We're
            planning to do in-house testing of the top vendors and have several
            criteria we're planning to use to score vendors. But we're having a
            difficult time coming up with a scoring methodology for
            relevancy/accuracy.

            Thank you in advance for any advice,
            Crystal Knapp


            --- In SearchCoP@yahoogroups.com, "Avi Rappoport" <analyst@...> wrote:
            >
            > I think it's much easier to compare results than to evaluate just
            one search engine. Here's
            > my standard process:
            >
            > - Create a test suite (use existing search logs if possible)
            > -- Simple and complex queries
            > -- Spelling, typing and vocabulary errors
            > -- Force matching edge-case issues - many matches, few matches, no
            matches
            > -- Save results pages as HTML for later checking
            >
            > - Analyze the differences among them
            > -- Variations in indexing
            > -- Retrieval & response time
            > -- Relevance Ranking - compare with current clicks
            >
            > I find using the current search reports for popular queries and
            looking at the popular
            > results from those queries to be extremely important, as it takes
            out most of my personal
            > biases and expectations.
            >
            > However, I wouldn't give a single number for "relevance" as there's
            no way to measure that
            > properly. But you may be able to say that one search is relatively
            better than another
            > within some categories. For example, when I did the article for
            Network World, I
            > discovered one search engine (no longer sold) that was significantly
            worse than the others
            > in most kinds of search.
            >
            > I hope that helps,
            >
            > Avi
            >
            >
            > --- In SearchCoP@yahoogroups.com, "Lee Romero" <pekadad@> wrote:
            > >
            > > Hi all - I'm familiar with the general process of evaluating software
            > > packages against requirements. I'm currently looking at that problem
            > > with search engines (as hinted at in my previous posts about the
            > > Google Search Appliance).
            > >
            > > Question for you all - Obviously, search result relevance is one of
            > > those requirements that you would use in evaluating a search engine.
            > > Do you have any methodology for assessing a search engine in terms of
            > > how it ranks results for particular keywords? It's obviously
            > > impossible to answer the question, "Does it always get relevance
            > > exactly right for all possible searches?" because A) you never know
            > > what particular searches users might use and B) each user's
            > > expectations of relevance are different (I'm sure there are other
            > > things as well that make it impossible to do this :-) ).
            > >
            > > Anyway - I'm trying to figure out a way to generally compare the
            > > results from one engine against another.
            > >
            > > One thought would be to identify the top searches already used by
            > > users (say the top 20 or 50 or 100 or however many you want to deal
            > > with). Then ask some type of random sampling of users (how to do that
            > > is another question) to try each of those searches for each search
            > > engine and provide an assessment of how good the results returned
            > > were. (Maybe ask them to score on a scale of 0 = nothing relevant, 1 =
            > > a few relevant results, 5 = more relevant results, 10 = all relevant
            > > results - forcing them to use that to ensure a spread of the numbers?)
            > >
            > > Then you can average the perception across each search term to score
            > > that search term. Then average across all of the search terms to get
            > > a general "score" for relevance for the engine?
            > >
            > > Seems like a lot of holes in that idea - it's relatively easy to
            > > identify the searches, but is it fair to constraint to those? How to
            > > identify a test population? Does averaging scores (either one of the
            > > two averages above) make sense?
            > >
            > > Thanks for your insights!
            > > Lee Romero
            > >
            >
          • Lee Romero
            Hi Crystal - I took the input I received from this thread and my approach to assessing the search engines I was looking is described below. I never did reply
            Message 5 of 12 , Aug 19, 2008
            • 0 Attachment
              Hi Crystal - I took the input I received from this thread and my
              approach to assessing the search engines I was looking is described
              below. I never did reply back here to say thanks to everyone for
              their input, so thanks (!) to Avi, Jim, Tony and Tim for your insights
              - they did help.

              Anyway - here's what I did (sorry for the length, but I hope some of
              the details are of value):

              First - I split the assessment into two parts. One part was a more
              purely "requirements based" assessment which allowed me to include a
              measure of things like: Type of file systems that can be indexed,
              ability to control the web crawler, power of the administration
              interface in other ways, etc. The second part was to measure the
              quality of search results. Then you can get an overall picture of the
              effectiveness and power of the search engines by using both of those
              measures. It would be possible (I believe) to mathematically combine
              those two measures but I did not.

              For the first part, I used a simplified quality functional deployment
              matrix - I identified the various requirements to consider and
              assigned them a weight (level of importance); based on some previous
              experiences, I forced the weights to be either 10 (very important -
              probably "mandatory" in a semantic sense), a 5 (desirable but not
              absolutely necessary) or a 1 (nice to have) - this provides a better
              spread in the final outcome, I believe.

              Then I reviewed the search engines against those requirements and
              assigned them a "score" which, again, was measured as a 10 (met out of
              the box), a 5 (met with some level of configuration), a 1 (met with
              some customization - i.e., probably some type of scripting or similar,
              but not configuration through an admin UI) and a 0 (does not meet and
              can not meet).

              The overall "score" for an engine was then measured as the sum of the
              score for each requirement times that requirement's weight.


              To measure the quality of search results, I took Avi's insights and
              identified a set of specific searches that I wanted to measure. I
              identified the candidate searches by looking at the log files for the
              existing search solution on the site and pulling out a few searches
              that fell into each category Avi identified.

              I assumed I did not necessarily know the "right" targets for these
              searches, so I enlisted some volunteers among a group of knowledgeable
              employees (content managers on the web site) who could complete a
              survey I put together. The survey included a section where the
              participant had to execute each search against each search engine (the
              survey provided a link to do the search - so the participants did not
              have to actually go to a search screen somewhere and enter the terms
              and search - this was important to keep it somewhat simpler). The
              participants were then asked to score the quality of the results for
              each search engine (on a scale of 1-5).

              The survey also included some other questions about presentation of
              results, performance, etc. (even though we did not customize search
              result templates or tweak anything in the searches, we wanted to get a
              general sense of usability) and also included a section where users
              could define and rate their own searches.

              The results from the survey were then analyzed to get an overall
              measure of quality of results across this candidate set of searches
              for each search engine - basically doing some aggregation of the
              different searches into average scores or similar.

              With the engines we were looking at, the results were that one was
              better on the administration / architectural requirements and the
              other was better on the search results - which makes for an
              interesting decision, I think.

              There are some issues with this approach, I know, but at least it does
              make the analysis somewhat quantitative and one can then discuss
              things like, "What should the weight of this requirement be? Why is
              this requirement scored X for this engine? What's more important -
              how well the engine fits into our architecture and how easy it is to
              administer or the search results presented to the end user?", instead
              of more subjective / emotional issues.

              I hope this level of detail helps you move forward with your own research!

              Regards
              Lee Romero

              On Mon, Aug 18, 2008 at 7:58 PM, crystalkoregon
              <crystal.knapp@...> wrote:
              > Avi and all,
              >
              > I know it's been quite a while since you posted this, but I've been
              > looking through old posts and trying to determine how to compare the
              > relevancy of different vendors for in-house testing.
              >
              > I was wondering if you could expand on "Relevance Ranking - compare
              > with current clicks." Do you have a suggested way of doing this?
              >
              > We're in the process of putting together an RFP for new search
              > software. Since we're a state government (Oregon), we have to go
              > through a public bid process, which requires assigning scores. We're
              > planning to do in-house testing of the top vendors and have several
              > criteria we're planning to use to score vendors. But we're having a
              > difficult time coming up with a scoring methodology for
              > relevancy/accuracy.
              >
              > Thank you in advance for any advice,
              > Crystal Knapp
              >
              >
            • Crystal Knapp
              Lee, Thanks for the detailed information. This definitely helps! It s good to know that we re mostly on the right track. We were thinking of incorporating
              Message 6 of 12 , Aug 19, 2008
              • 0 Attachment

                Lee,

                 

                Thanks for the detailed information. This definitely helps!  It’s good to know that we’re mostly on the right track.

                 

                We were thinking of incorporating several of the scoring criteria you suggested and giving different weights for each criterion, but we have been struggling with how to assign scores for the interface and relevancy.  I like your suggestion of allowing users to assign scores for these, following controlled testing.  We’re worried that they might be biased toward a specific search engine, but that bias will still be there even after the purchase.  You’re confirming my hunch that using user feedback to score relevancy is valid.

                 

                I’m also not surprised to hear that you preferred the administration of one vendor but the results of another.  I’m worried we might run into the same issue.

                 

                Thanks so much,

                Crystal

                 


                From: SearchCoP@yahoogroups.com [mailto:SearchCoP@yahoogroups.com] On Behalf Of Lee Romero
                Sent: Tuesday, August 19, 2008 7:51 AM
                To: SearchCoP@yahoogroups.com
                Subject: Re: [SearchCoP] Re: Methodology for assessing results from multiple search engines?

                 

                Hi Crystal - I took the input I received from this thread and my
                approach to assessing the search engines I was looking is described
                below. I never did reply back here to say thanks to everyone for
                their input, so thanks (!) to Avi, Jim, Tony and Tim for your insights
                - they did help.

                Anyway - here's what I did (sorry for the length, but I hope some of
                the details are of value):

                First - I split the assessment into two parts. One part was a more
                purely "requirements based" assessment which allowed me to include a
                measure of things like: Type of file systems that can be indexed,
                ability to control the web crawler, power of the administration
                interface in other ways, etc. The second part was to measure the
                quality of search results. Then you can get an overall picture of the
                effectiveness and power of the search engines by using both of those
                measures. It would be possible (I believe) to mathematically combine
                those two measures but I did not.

                For the first part, I used a simplified quality functional deployment
                matrix - I identified the various requirements to consider and
                assigned them a weight (level of importance); based on some previous
                experiences, I forced the weights to be either 10 (very important -
                probably "mandatory" in a semantic sense), a 5 (desirable but not
                absolutely necessary) or a 1 (nice to have) - this provides a better
                spread in the final outcome, I believe.

                Then I reviewed the search engines against those requirements and
                assigned them a "score" which, again, was measured as a 10 (met out of
                the box), a 5 (met with some level of configuration) , a 1 (met with
                some customization - i.e., probably some type of scripting or similar,
                but not configuration through an admin UI) and a 0 (does not meet and
                can not meet).

                The overall "score" for an engine was then measured as the sum of the
                score for each requirement times that requirement' s weight.

                To measure the quality of search results, I took Avi's insights and
                identified a set of specific searches that I wanted to measure. I
                identified the candidate searches by looking at the log files for the
                existing search solution on the site and pulling out a few searches
                that fell into each category Avi identified.

                I assumed I did not necessarily know the "right" targets for these
                searches, so I enlisted some volunteers among a group of knowledgeable
                employees (content managers on the web site) who could complete a
                survey I put together. The survey included a section where the
                participant had to execute each search against each search engine (the
                survey provided a link to do the search - so the participants did not
                have to actually go to a search screen somewhere and enter the terms
                and search - this was important to keep it somewhat simpler). The
                participants were then asked to score the quality of the results for
                each search engine (on a scale of 1-5).

                The survey also included some other questions about presentation of
                results, performance, etc. (even though we did not customize search
                result templates or tweak anything in the searches, we wanted to get a
                general sense of usability) and also included a section where users
                could define and rate their own searches.

                The results from the survey were then analyzed to get an overall
                measure of quality of results across this candidate set of searches
                for each search engine - basically doing some aggregation of the
                different searches into average scores or similar.

                With the engines we were looking at, the results were that one was
                better on the administration / architectural requirements and the
                other was better on the search results - which makes for an
                interesting decision, I think.

                There are some issues with this approach, I know, but at least it does
                make the analysis somewhat quantitative and one can then discuss
                things like, "What should the weight of this requirement be? Why is
                this requirement scored X for this engine? What's more important -
                how well the engine fits into our architecture and how easy it is to
                administer or the search results presented to the end user?", instead
                of more subjective / emotional issues.

                I hope this level of detail helps you move forward with your own research!

                Regards
                Lee Romero

                On Mon, Aug 18, 2008 at 7:58 PM, crystalkoregon
                <crystal.knapp@ state.or. us> wrote:

                > Avi and all,
                >
                > I know it's been quite a while since you posted this, but I've been
                > looking through old posts and trying to determine how to compare the
                > relevancy of different vendors for in-house testing.
                >
                > I was wondering if you could expand on "Relevance Ranking - compare
                > with current clicks." Do you have a suggested way of doing this?
                >
                > We're in the process of putting together an RFP for new search
                > software. Since we're a state government (
                w:st="on">Oregon ), we have to go
                > through a public bid process, which requires assigning scores. We're
                > planning to do in-house testing of the top vendors and have several
                > criteria we're planning to use to score vendors. But we're having a
                > difficult time coming up with a scoring methodology for
                > relevancy/accuracy.
                >
                > Thank you in advance for any advice,
                > Crystal Knapp
                >
                >

              • Lee Romero
                On Tue, Aug 19, 2008 at 2:34 PM, Crystal Knapp ... [LR] Well, that assumes I m on the right track :-) Hopefully I am, though. ... [LR] I think that general
                Message 7 of 12 , Aug 19, 2008
                • 0 Attachment
                  On Tue, Aug 19, 2008 at 2:34 PM, Crystal Knapp
                  <crystal.knapp@...> wrote:
                  > Lee,
                  >
                  >
                  >
                  > Thanks for the detailed information. This definitely helps! It's good to
                  > know that we're mostly on the right track.
                  >

                  [LR] Well, that assumes I'm on the right track :-) Hopefully I am, though.

                  >
                  >
                  > We were thinking of incorporating several of the scoring criteria you
                  > suggested and giving different weights for each criterion, but we have been
                  > struggling with how to assign scores for the interface and relevancy. I
                  > like your suggestion of allowing users to assign scores for these, following
                  > controlled testing.

                  [LR] I think that general approach works well. I would have preferred
                  to do my own assessment while also being able to guard against bias,
                  but I was balancing between the effort / cost of setting up the
                  assessment and level of confidence one can lend to the results. With
                  more effort, I think I could increase the confidence (i.e., reduce the
                  "error rate") by doing things like having the search results use the
                  exact same presentation (and not have any reference to the underlying
                  engine visible in the results) and also by increasing the # of people
                  involved. Despite that, I still have confidence in the outcome at
                  least at the level of making a sound decision (one could still say,
                  "This one isn't high enough in this area" or "That one is too low
                  here.")

                  > We're worried that they might be biased toward a
                  > specific search engine, but that bias will still be there even after the
                  > purchase. You're confirming my hunch that using user feedback to score
                  > relevancy is valid.

                  [LR] Yes, that's very possible (likely?). Biases that participants
                  have will definitely influence their perception if they know which
                  engine is producing which results.

                  >
                  >
                  >
                  > I'm also not surprised to hear that you preferred the administration of one
                  > vendor but the results of another. I'm worried we might run into the same
                  > issue.

                  [LR] I would not be surprised if that happened. Even if it does, you
                  can still then ask yourselves - "Is it better to have an X% better
                  search experience when we have to do Y amount of work to get this
                  engine to work in our infrastructure?" Depending on X and Y, your
                  answer will change but at least you can have the discussion.

                  Good Luck!
                  Lee

                  >
                  >
                  >
                  > Thanks so much,
                  >
                  > Crystal
                • Walter Underwood
                  I finally got around to writing up my comments on A/B testing for evaluating search engines. See my post on ³kitten war² testing:
                  Message 8 of 12 , Sep 16, 2008
                  • 0 Attachment
                    Re: [SearchCoP] Re: Methodology for assessing results from multiple search engines? I finally got around to writing up my comments on A/B testing for evaluating search engines. See my post on “kitten war” testing: http://wunderwood.org/most_casual_observer/2008/09/search_evaluation_by_kitten_wa.html

                    wunder
                  • Jim
                    Thanks for the wonderful summary. We ve already selected a vendor, what we are using this for is to improve the usability of the presentation and confirm
                    Message 9 of 12 , Sep 17, 2008
                    • 0 Attachment

                      Thanks for the wonderful summary.  We've already selected a vendor,  what we are using this for is to improve the usability of the presentation and confirm enhancements.  There are lot of people out there that quote hearsay or have a predetermined solution in mind.  Your example was perfect... 

                      Google is Cuter: Or, "brand names work". One of our Ultraseek customers did a blind kitten war test between Ultraseek and Google. Ultraseek was preferred 75% of the time. Some executive found this hard to believe and asked that they try it again with the logos attached to the pages. The second time, people preferred Google over half the time.

                      Thanks for the summary again - I will use some of your ideas in good health. Have virtual coffee on me.

                      --- In SearchCoP@yahoogroups.com, Walter Underwood <wunderwood@...> wrote:
                      >
                      > I finally got around to writing up my comments on A/B testing for evaluating
                      > search engines. See my post on ³kitten war² testing:
                      > http://wunderwood.org/most_casual_observer/2008/09/search_evaluation_by_kitt
                      > en_wa.html
                      >
                      > wunder
                      >

                    • Lee Romero
                      Yes, thanks, Walter. :-) In case anyone s interested, I ve written up my previous post (in response to Crystal) about a methodology for evaluating search
                      Message 10 of 12 , Oct 1, 2008
                      • 0 Attachment
                        Yes, thanks, Walter. :-)

                        In case anyone's interested, I've written up my previous post (in
                        response to Crystal) about a methodology for evaluating search engines
                        to share:

                        http://blog.leeromero.org/2008/09/30/evaluating-and-selecting-a-search-engine/

                        I also ended up putting together a set of requirements groupings that
                        one might consider in doing an evaluation of search engines as a
                        follow-up to the above post (and because a former co-worker of mine
                        happened to ask me that question recently):

                        http://blog.leeromero.org/2008/10/01/categories-of-search-requirements/

                        Hopefully, of some use to someone out there :-)

                        I point to your post, Walter, when discussing the known issues with
                        the particular way I went about doing my own evaluation, so thanks for
                        writing that up.

                        Regards
                        Lee Romero

                        On Wed, Sep 17, 2008 at 10:26 AM, Jim <jim.smith@...> wrote:
                        > Thanks for the wonderful summary. We've already selected a vendor, what we
                        > are using this for is to improve the usability of the presentation and
                        > confirm enhancements. There are lot of people out there that quote hearsay
                        > or have a predetermined solution in mind. Your example was perfect...
                        >
                      Your message has been successfully submitted and would be delivered to recipients shortly.