- ... Well, 4 isn t enough. I wouldn t recommend any less than 5 and typically 8-12 is best, otherwise you don t have enough to start seeing significance inMessage 1 of 29 , Mar 20 1:16 PMView SourceOn Mar 20, 2008, at 1:17 PM, William Pietri wrote:Well, 4 isn't enough. I wouldn't recommend any less than 5 and typically 8-12 is best, otherwise you don't have enough to start seeing significance in patterns. That's going to be one reason that 1000 is going to yield better results with web tracking—you don't have enough in your qualitative study.A couple of things here:1. I would suspect that the "minor user interface issues" would have been easily corrected with simply having a good informed interaction designer or usability specialist assess the interface.2. Did you do a 12 user study on this interface? I'll bet that if you did, you would have found the same issues—I've done this literally hundreds of times. If you didn't, how would you know that it's beyond what you can find from a 12 person study? We use web metrics to help identify key abandonment areas, then in-person field studies to find out the why. For example, we had a client who had a significant abandonment in one of their cart screens, but didn't know exactly which fields. They could have spent time coding up the fields w/JS to track every single one to figure out. Instead we did a quick study w/12 people and found out that it was a combination of two fields that was causing the problem on that screen and exactly why they were an issue. Problem fixed.Just a different approach. And yes, we used a mix of qual and quan—something we do quite often.Not sure I agree with that. Might be the way you're (or the person doing testing) is conducting the tests. Our testing format utilizes an open format with discovery method. We have some predefined tasks based on goals that we know/hope people are trying to accomplish with the site. This info comes from combined sources (e.g. metrics, sales, marketing, customer service, customer feedback). However, that's not all of it—we always include open ended discovery time to watch for things we don't expect, anticipate, our couldn't plan for—unexpected tasks. We've done this in pretty much every test in the last couple of years and every time find a number of new features, functions, and potential lines of revenue for our client.True. Our method selection is goal driven. What's your goal? That drives your method. Just to provide the counter point to that, the downside of A/B the way you're suggesting is that while it will tell you that one model increased signup by 5%, it won't tell you why. A quick 12 person study will tell you why and give you guidance on which one would probably increase sign-up. You then take that info and challenge/validate it with a quantitative study like you suggest. Or the reverse, take your A/B and do a supplemental 12 person study to find out why.Answering the why will give you far more from a design insight perspective than just seeing what happened.
- Hi, Todd. I think we agree entirely that people should do both quantitative and qualitative research, and both have their strengths. A few responses to minorMessage 2 of 29 , Mar 25 1:10 AMView SourceHi, Todd. I think we agree entirely that people should do both
quantitative and qualitative research, and both have their strengths. A
few responses to minor points.
Todd Zaki Warfel wrote:
> Well, 4 isn't enough. [...]I work mainly with startups and small companies. Perhaps you can do a
> 1. I would suspect that the "minor user interface issues" would have
> been easily corrected with simply having a good informed interaction
> designer or usability specialist assess the interface.
> 2. Did you do a 12 user study on this interface?
12-user study more effectively than they can, but it's well beyond
something a lot of small shops can afford to do on a regular basis.
Doing 4-6 people once a month is more their speed.
It may be that their designers don't meet the level you consider good.
However, they are generally the best the company has been able to find.
Whatever practices I recommend have to work in that context, which
frequently includes people wearing multiple hats.
> I'll bet that if you did, you would have found the same issues—I'veI'm sure you win that bet a lot, but in this case you would have lost.
> done this literally hundreds of times.
One substantial cause of failure was international addresses. The cost
of a multi-continent usability study surely makes sense for some people,
but not for the sums involved in this case. At the cost of a couple of
days time around the office, though, it paid off nicely.
> If you didn't, how would you know that it's beyond what you can findWell, I said that because of a little math. Perhaps I'm doing it wrong,
> from a 12 person study?
but if only 1-2% of people have some issue, the odds of finding that
particular issue in a 12-person test don't seem particularly high. And
if only 1 of 12 has a problem, it would be hard to say whether it's a
pattern or a fluke. Whereas with 10,000 data points, you'll be able to
do solid ROI calculations so that you know which fixes are worth the effort.
> We use web metrics to help identify key abandonment areas, thenI find it weird to keep saying this, but I really like in-person
> in-person field studies to find out the why. For example [...]
studies. I think they are the bees knees. Honest. I do them every chance
The only reason I got involved in this thread was to mention how on-line
testing indeed can capture some things you said it couldn't, and to
mention how some people I know are doing it. Nobody should feel obliged
to do it that way, and they certainly shouldn't stop doing in-person
- ... This comes down to a recruiting issue. You don t need to do a multi- continent study to find this—you can use craigslist to recruit international peopleMessage 3 of 29 , Mar 25 5:19 AMView SourceOn Mar 25, 2008, at 4:10 AM, William Pietri wrote:This comes down to a recruiting issue. You don't need to do a multi-continent study to find this—you can use craigslist to recruit international people to solve this issue. During the study, just have them use their address from home, or the country they came from before they got here to see this.Now, if this is something you don't think of to test in the study, then that's another story.Quick question for you, how did you find this w/the on-line study? What did you do to measure/find something like this using an on-line study? The reason I ask is that it would be nice for others to know the technique so they could use it to look for this when they are testing (reason I included the Craigslist item above).
Well, I said that because of a little math. Perhaps I'm doing it wrong, but if only 1-2% of people have some issue, the odds of finding that particular issue in a 12-person test don't seem particularly high. And if only 1 of 12 has a problem, it would be hard to say whether it's a pattern or a fluke. Whereas with 10,000 data points, you'll be able to do solid ROI calculations so that you know which fixes are worth the effort.Aw, see that's the flaw in the equation. First, what you're looking for is a pattern. Second, what you're looking for is something that might be a smaller issue by sheer number, but you know it's significant. For example, we recently had an issue I sited earlier with someone getting hung up at registration due to a space in their name. That was one person, but it's something you know is a show stopper. In short, it depends on what the issue is. Some of "knowing this" only comes with time and experience. Some of it is a no brainer. Some of these items are easier to see with 10,000 data points. Most of them are going to be something you'll see a pattern with 5-6 people and confirmed with 8-12.The point is that if you start to see it in a few people in a 12 person study, you're going to see it in hundreds or thousands with 10,000. I think the issue is that people get hung up on sheer numbers instead of percentages. We rarely have issues that aren't found in 70% or more of participants in an 8-12 person study. Either way, we make sure that when we report we include:1. Percent of people reporting (e.g. 7/10 experienced this)2. Percentage of people it was an issue with (e.g. 8/10)3. If it's a small number, then the number of people and why we think it might require further investigation.For example if we did a 10 person study:Issue 70% of 100% reporting would be 10/10Issue 50% of 80% reporting 4/8This is particularly important to be accurate about the reporting. Simply stating 4 of our participants isn't accurate—you need to indicate that 8 came across the issue and of those 8, 4 or 50%, had trouble with it.
I find it weird to keep saying this, but I really like in-person studies. I think they are the bees knees. Honest. I do them every chance I get.Obviously, I agree, but I also think that remote studies are extremely beneficial. One of the biggest benefits is that they're using their own machine and you get to view how they access things (e.g. bookmarks, pop-up blockers). Additionally, it enables you to do research w/geographically dispersed audiences.For example, we did an ethnographic-based study last year for a client who had employees across the world. We did 48 interviews. We couldn't afford to fly there (time/budget), so we used remote screen sharing and phones to do the research. We had some very interesting findings and remote studies were the only way we could have done this.
- ... We took a running site and instrumented things so that we could see the raw submissions from every sign-up attempt, successful or failed. Then we let itMessage 4 of 29 , Mar 25 1:56 PMView SourceTodd Zaki Warfel wrote:
>We took a running site and instrumented things so that we could see the
> Quick question for you, how did you find this w/the on-line study?
> What did you do to measure/find something like this using an on-line
> study? The reason I ask is that it would be nice for others to know
> the technique so they could use it to look for this when they are
> testing (reason I included the Craigslist item above).
raw submissions from every sign-up attempt, successful or failed. Then
we let it run for a while and sifted through the failures looking for
patterns. Some of the issues discovered involved the three-way
interaction of the user, our code, and external credit-card processors,
who have yet different ideas of what constitutes a valid address.
That definitely doesn't catch everything, as the user has to get as far
as clicking the submit button, which is why next time I try this I'd
like to do a little AJAX instrumentation, uploading a record of
keypresses, pauses, mouse movements, and the like. That still won't
catch everything, of course. Like Col. Prescott, I like to see the
whites of their eyes. And as you say, it can still leave the "why" a puzzle.
> The point is that if you start to see it in a few people in a 12I'm certainly not denying that there are a lot of great issues that will
> person study, you're going to see it in hundreds or thousands with 10,000.
show up in 12-person study, and that those issues will show up in larger
samples. I'm concerned about the opposite case. If I see it in 100 of
10,000 (1%), then I may not see it in 1 of 12 people (8%) , and I'm even
less likely to see it in 1 of 5 (20%).
I suspect there's also a question of relative expertise. I think you
mentioned you have done hundreds of these studies, and clearly you have
spent a lot of time thinking about the usability of interfaces. The kind
and volume of issues that you can surface in a 5-person study are
probably much superior to what a self-taught designer can do in between
extract so much value from them.
- Sorry, should have been more specific. What I m interested in is exactly how you instrumented things, or what you did to capture everything so you were ableMessage 5 of 29 , Mar 25 2:03 PMView SourceSorry, should have been more specific. What I'm interested in is exactly how you "instrumented things," or what you did to capture everything so you were able to tell exactly what individual fields/items were the culprit. I'd love to know more about this technique as an option to use in the future.On Mar 25, 2008, at 4:56 PM, William Pietri wrote:
- ... Ah, I see. Sorry for the confusion. On one occasion, where credit card usage was the primary focus, the system had already been designed to record everyMessage 6 of 29 , Mar 25 3:58 PMView SourceTodd Zaki Warfel wrote:
> Sorry, should have been more specific. What I'm interested in isAh, I see. Sorry for the confusion.
> exactly how you "instrumented things," or what you did to capture
> everything so you were able to tell exactly what individual
> fields/items were the culprit. I'd love to know more about this
> technique as an option to use in the future.
On one occasion, where credit card usage was the primary focus, the
system had already been designed to record every request sent to the
credit card processor, plus processor responses, so that was a rich seam
of data to mine. After various UI changes, we then monitored to make
sure that we indeed solved the problems.
On another, we found the particular place in the code to which where a
form was submitted, and logged as much information as possible,
including IP, browser info, and the details of the form submission. Then
a little perl cleaned the output enough to feed it to a business
analyst, who worked in collaboration with the designer to figure out
what particular failures meant and how to fix them.
On a third project, we started out wondering these things early, and so
had one bit of code near the heart of things that logged every bit of
user input. I don't remember any formal studies that used it, but we'd
often use it to answer some particular question, or to see what users
were up to. That was especially handy when discussing whether or not
users would really do something.
The imagined ajaxification of this would be a little more complicated,
collecting client-side events (like keypresses and mouse movements) and
state information (transaction ids, current state of forms) and
uploading them via background asynchronous requests. It'd be exciting to
dig through that data, but one of the first things I'd look at is failed
client-side validation. Another would be the amount of time and rework
for individual fields.
People who are doing advanced work in this area include Netflix and
Google. They had a great BayCHI presentation which is, alas, not up on
the web yet, but I will mention it here when it is.
Hoping that helps,