Loading ...
Sorry, an error occurred while loading the content.

Re: [agile-usability] Re: Online Usability Tests

Expand Messages
  • Todd Zaki Warfel
    ... Well, 4 isn t enough. I wouldn t recommend any less than 5 and typically 8-12 is best, otherwise you don t have enough to start seeing significance in
    Message 1 of 29 , Mar 20, 2008
    • 0 Attachment

      On Mar 20, 2008, at 1:17 PM, William Pietri wrote:
      I think we're talking about different kinds of quantitative methods.

      The choice I typically see isn't 12 local vs 100 remote. It's 4 versus 1,000. Or 10,000. Or 100,000. And actual users doing actual tasks versus recruited users doing requested tasks.

      Well, 4 isn't enough. I wouldn't recommend any less than 5 and typically 8-12 is best, otherwise you don't have enough to start seeing significance in patterns. That's going to be one reason that 1000 is going to yield better results with web tracking—you don't have enough in your qualitative study.

      [...] It turned out there were a number of minor user interface issues, none affecting more than 2% of order attempts, which is well below the power of a 12-user study to resolve. And several related to different international styles of entering addresses, which we couldn't have solved with a local user study anyhow. The cost-benefit ratio was also much better; from inception to recommendations, it was under two person-days of work.

      A couple of things here: 
      1. I would suspect that the "minor user interface issues" would have been easily corrected with simply having a good informed interaction designer or usability specialist assess the interface. 

      2. Did you do a 12 user study on this interface? I'll bet that if you did, you would have found the same issues—I've done this literally hundreds of times. If you didn't, how would you know that it's beyond what you can find from a 12 person study? We use web metrics to help identify key abandonment areas, then in-person field studies to find out the why. For example, we had a client who had a significant abandonment in one of their cart screens, but didn't know exactly which fields. They could have spent time coding up the fields w/JS to track every single one to figure out. Instead we did a quick study w/12 people and found out that it was a combination of two fields that was causing the problem on that screen and exactly why they were an issue. Problem fixed. 

      Just a different approach. And yes, we used a mix of qual and quan—something we do quite often.


      I'm also fond of tracking metrics. In-person user testing is good for things you know to ask about, but you need live site metrics to catch things that you didn't even know have changed. 

      Not sure I agree with that. Might be the way you're (or the person doing testing) is conducting the tests. Our testing format utilizes an open format with discovery method. We have some predefined tasks based on goals that we know/hope people are trying to accomplish with the site. This info comes from combined sources (e.g. metrics, sales, marketing, customer service, customer feedback). However, that's not all of it—we always include open ended discovery time to watch for things we don't expect, anticipate, our couldn't plan for—unexpected tasks. We've done this in pretty much every test in the last couple of years and every time find a number of new features, functions, and potential lines of revenue for our client. 

      And once you have some metrics, A/B testing can really pay off. Suppose you want to know which of three new landing page versions increases signups by 5%. You can't do that with a 12-person user test. But you can put each version up in parallel for a week with users randomly assigned to each, and get thousands or tens of thousands of data points.

      True. Our method selection is goal driven. What's your goal? That drives your method. Just to provide the counter point to that, the downside of A/B the way you're suggesting is that while it will tell you that one model increased signup by 5%, it won't tell you why. A quick 12 person study will tell you why and give you guidance on which one would probably increase sign-up. You then take that info and challenge/validate it with a quantitative study like you suggest. Or the reverse, take your A/B and do a supplemental 12 person study to find out why. 

      Answering the why will give you far more from a design insight perspective than just seeing what happened.


      Cheers!

      Todd Zaki Warfel
      President, Design Researcher
      Messagefirst | Designing Information. Beautifully.
      ----------------------------------
      Contact Info
      Voice: (215) 825-7423
      Email: todd@...
      ----------------------------------
      In theory, theory and practice are the same.
      In practice, they are not.

    • William Pietri
      Hi, Todd. I think we agree entirely that people should do both quantitative and qualitative research, and both have their strengths. A few responses to minor
      Message 2 of 29 , Mar 25, 2008
      • 0 Attachment
        Hi, Todd. I think we agree entirely that people should do both
        quantitative and qualitative research, and both have their strengths. A
        few responses to minor points.

        Todd Zaki Warfel wrote:
        > Well, 4 isn't enough. [...]
        > 1. I would suspect that the "minor user interface issues" would have
        > been easily corrected with simply having a good informed interaction
        > designer or usability specialist assess the interface.
        >
        > 2. Did you do a 12 user study on this interface?

        I work mainly with startups and small companies. Perhaps you can do a
        12-user study more effectively than they can, but it's well beyond
        something a lot of small shops can afford to do on a regular basis.
        Doing 4-6 people once a month is more their speed.

        It may be that their designers don't meet the level you consider good.
        However, they are generally the best the company has been able to find.
        Whatever practices I recommend have to work in that context, which
        frequently includes people wearing multiple hats.

        > I'll bet that if you did, you would have found the same issues—I've
        > done this literally hundreds of times.

        I'm sure you win that bet a lot, but in this case you would have lost.

        One substantial cause of failure was international addresses. The cost
        of a multi-continent usability study surely makes sense for some people,
        but not for the sums involved in this case. At the cost of a couple of
        days time around the office, though, it paid off nicely.

        > If you didn't, how would you know that it's beyond what you can find
        > from a 12 person study?

        Well, I said that because of a little math. Perhaps I'm doing it wrong,
        but if only 1-2% of people have some issue, the odds of finding that
        particular issue in a 12-person test don't seem particularly high. And
        if only 1 of 12 has a problem, it would be hard to say whether it's a
        pattern or a fluke. Whereas with 10,000 data points, you'll be able to
        do solid ROI calculations so that you know which fixes are worth the effort.

        > We use web metrics to help identify key abandonment areas, then
        > in-person field studies to find out the why. For example [...]

        I find it weird to keep saying this, but I really like in-person
        studies. I think they are the bees knees. Honest. I do them every chance
        I get.

        The only reason I got involved in this thread was to mention how on-line
        testing indeed can capture some things you said it couldn't, and to
        mention how some people I know are doing it. Nobody should feel obliged
        to do it that way, and they certainly shouldn't stop doing in-person
        studies.

        William
      • Todd Zaki Warfel
        ... This comes down to a recruiting issue. You don t need to do a multi- continent study to find this—you can use craigslist to recruit international people
        Message 3 of 29 , Mar 25, 2008
        • 0 Attachment

          On Mar 25, 2008, at 4:10 AM, William Pietri wrote:
          One substantial cause of failure was international addresses. The cost of a multi-continent usability study surely makes sense for some people[...]

          This comes down to a recruiting issue. You don't need to do a multi-continent study to find this—you can use craigslist to recruit international people to solve this issue. During the study, just have them use their address from home, or the country they came from before they got here to see this. 

          Now, if this is something you don't think of to test in the study, then that's another story. 

          Quick question for you, how did you find this w/the on-line study? What did you do to measure/find something like this using an on-line study? The reason I ask is that it would be nice for others to know the technique so they could use it to look for this when they are testing (reason I included the Craigslist item above).

          Well, I said that because of a little math. Perhaps I'm doing it wrong, but if only 1-2% of people have some issue, the odds of finding that particular issue in a 12-person test don't seem particularly high. And if only 1 of 12 has a problem, it would be hard to say whether it's a pattern or a fluke. Whereas with 10,000 data points, you'll be able to do solid ROI calculations so that you know which fixes are worth the effort.

          Aw, see that's the flaw in the equation. First, what you're looking for is a pattern. Second, what you're looking for is something that might be a smaller issue by sheer number, but you know it's significant. For example, we recently had an issue I sited earlier with someone getting hung up at registration due to a space in their name. That was one person, but it's something you know is a show stopper. In short, it depends on what the issue is. Some of "knowing this" only comes with time and experience. Some of it is a no brainer. Some of these items are easier to see with 10,000 data points. Most of them are going to be something you'll see a pattern with 5-6 people and confirmed with 8-12. 

          The point is that if you start to see it in a few people in a 12 person study, you're going to see it in hundreds or thousands with 10,000. I think the issue is that people get hung up on sheer numbers instead of percentages. We rarely have issues that aren't found in 70% or more of participants in an 8-12 person study. Either way, we make sure that when we report we include:
          1. Percent of people reporting (e.g. 7/10 experienced this)
          2. Percentage of people it was an issue with (e.g. 8/10)
          3. If it's a small number, then the number of people and why we think it might require further investigation.

          For example if we did a 10 person study:
          Issue 70% of 100% reporting would be 10/10
          Issue 50% of 80% reporting 4/8

          This is particularly important to be accurate about the reporting. Simply stating 4 of our participants isn't accurate—you need to indicate that 8 came across the issue and of those 8, 4 or 50%, had trouble with it.

          I find it weird to keep saying this, but I really like in-person studies. I think they are the bees knees. Honest. I do them every chance I get.

          Obviously, I agree, but I also think that remote studies are extremely beneficial. One of the biggest benefits is that they're using their own machine and you get to view how they access things (e.g. bookmarks, pop-up blockers). Additionally, it enables you to do research w/geographically dispersed audiences. 

          For example, we did an ethnographic-based study last year for a client who had employees across the world. We did 48 interviews. We couldn't afford to fly there (time/budget), so we used remote screen sharing and phones to do the research. We had some very interesting findings and remote studies were the only way we could have done this.


          Cheers!

          Todd Zaki Warfel
          President, Design Researcher
          Messagefirst | Designing Information. Beautifully.
          ----------------------------------
          Contact Info
          Voice: (215) 825-7423
          Email: todd@...
          ----------------------------------
          In theory, theory and practice are the same.
          In practice, they are not.

        • William Pietri
          ... We took a running site and instrumented things so that we could see the raw submissions from every sign-up attempt, successful or failed. Then we let it
          Message 4 of 29 , Mar 25, 2008
          • 0 Attachment
            Todd Zaki Warfel wrote:
            >
            > Quick question for you, how did you find this w/the on-line study?
            > What did you do to measure/find something like this using an on-line
            > study? The reason I ask is that it would be nice for others to know
            > the technique so they could use it to look for this when they are
            > testing (reason I included the Craigslist item above).

            We took a running site and instrumented things so that we could see the
            raw submissions from every sign-up attempt, successful or failed. Then
            we let it run for a while and sifted through the failures looking for
            patterns. Some of the issues discovered involved the three-way
            interaction of the user, our code, and external credit-card processors,
            who have yet different ideas of what constitutes a valid address.

            That definitely doesn't catch everything, as the user has to get as far
            as clicking the submit button, which is why next time I try this I'd
            like to do a little AJAX instrumentation, uploading a record of
            keypresses, pauses, mouse movements, and the like. That still won't
            catch everything, of course. Like Col. Prescott, I like to see the
            whites of their eyes. And as you say, it can still leave the "why" a puzzle.


            > The point is that if you start to see it in a few people in a 12
            > person study, you're going to see it in hundreds or thousands with 10,000.

            I'm certainly not denying that there are a lot of great issues that will
            show up in 12-person study, and that those issues will show up in larger
            samples. I'm concerned about the opposite case. If I see it in 100 of
            10,000 (1%), then I may not see it in 1 of 12 people (8%) , and I'm even
            less likely to see it in 1 of 5 (20%).

            I suspect there's also a question of relative expertise. I think you
            mentioned you have done hundreds of these studies, and clearly you have
            spent a lot of time thinking about the usability of interfaces. The kind
            and volume of issues that you can surface in a 5-person study are
            probably much superior to what a self-taught designer can do in between
            cranking out HTML and tweaking the JavaScript. That may explain how you
            extract so much value from them.


            William
          • Todd Zaki Warfel
            Sorry, should have been more specific. What I m interested in is exactly how you instrumented things, or what you did to capture everything so you were able
            Message 5 of 29 , Mar 25, 2008
            • 0 Attachment
              Sorry, should have been more specific. What I'm interested in is exactly how you "instrumented things," or what you did to capture everything so you were able to tell exactly what individual fields/items were the culprit. I'd love to know more about this technique as an option to use in the future. 
              On Mar 25, 2008, at 4:56 PM, William Pietri wrote:
              We took a running site and instrumented things so that we could see the 
              raw submissions from every sign-up attempt, successful or failed.


              Cheers!

              Todd Zaki Warfel
              President, Design Researcher
              Messagefirst | Designing Information. Beautifully.
              ----------------------------------
              Contact Info
              Voice: (215) 825-7423
              Email: todd@...
              ----------------------------------
              In theory, theory and practice are the same.
              In practice, they are not.

            • William Pietri
              ... Ah, I see. Sorry for the confusion. On one occasion, where credit card usage was the primary focus, the system had already been designed to record every
              Message 6 of 29 , Mar 25, 2008
              • 0 Attachment
                Todd Zaki Warfel wrote:
                > Sorry, should have been more specific. What I'm interested in is
                > exactly how you "instrumented things," or what you did to capture
                > everything so you were able to tell exactly what individual
                > fields/items were the culprit. I'd love to know more about this
                > technique as an option to use in the future.

                Ah, I see. Sorry for the confusion.

                On one occasion, where credit card usage was the primary focus, the
                system had already been designed to record every request sent to the
                credit card processor, plus processor responses, so that was a rich seam
                of data to mine. After various UI changes, we then monitored to make
                sure that we indeed solved the problems.

                On another, we found the particular place in the code to which where a
                form was submitted, and logged as much information as possible,
                including IP, browser info, and the details of the form submission. Then
                a little perl cleaned the output enough to feed it to a business
                analyst, who worked in collaboration with the designer to figure out
                what particular failures meant and how to fix them.

                On a third project, we started out wondering these things early, and so
                had one bit of code near the heart of things that logged every bit of
                user input. I don't remember any formal studies that used it, but we'd
                often use it to answer some particular question, or to see what users
                were up to. That was especially handy when discussing whether or not
                users would really do something.

                The imagined ajaxification of this would be a little more complicated,
                collecting client-side events (like keypresses and mouse movements) and
                state information (transaction ids, current state of forms) and
                uploading them via background asynchronous requests. It'd be exciting to
                dig through that data, but one of the first things I'd look at is failed
                client-side validation. Another would be the amount of time and rework
                for individual fields.

                People who are doing advanced work in this area include Netflix and
                Google. They had a great BayCHI presentation which is, alas, not up on
                the web yet, but I will mention it here when it is.

                Hoping that helps,

                William
              Your message has been successfully submitted and would be delivered to recipients shortly.