Yesterday we had to roll back some production code to its previous build. It was essentially a performance problem. The new bits kept timing out for the types of requests users were making from the system.

Our poor, battered support team had to send out a user notice that said something like “Sorry users, for all the timeouts. We have just rolled back the code until we can fix the problems. Oh and by the way, that means you’ll no longer have these new features we just gave you.”

Shortly thereafter, the head of my department, indirectly asked me why these issues were not caught in testing. …This is the stuff nightmares are made of. I felt like throwing up, resigning, or crying. Once my heart started beating again, I realized I had logged a bug for these problems. Exhale now.

The bug was fairly accurate, stating the users would get timeout errors if they attempted anything beyond a specific threshold. During a bug meeting, it was down-graded in priority and the team decided to move forward with the release. In hindsight, there was a bit of group-think going on. We were so anxious to get the release out, we figured the users would put up with the performance problems. Boy were we wrong.

Being a tester is scary. At any given moment, all hell may break loose in production. And when it does, asking why the tester didn’t find it is, of course, the fairest of questions.

8 comments:

  1. Anonymous said...

    Good post, Eric.

    As the developer of those reports, I have mixed emotions about what happened.

    On one hand, its work; so, I don't want to seem irrationally concerned. There's more to life, of course.

    But, on the other hand, I ask myself, "Why didn't I question what we were doing more"?

    On top of that, I was really giving my all to get those reports out. I really wanted that release to be the turning of the tide--where we actually delivered a lot of reports on time, and where we hopefully made a huge leap into the backlog of reports, which would ultimately make way for cooler things for us to work on afterwards.

    So, after all that, well, you know the rest.

    There are two explanations I can come up with for why it happened the way it did:

    1. There was a lot of group think.
    2. There was a little group think, but at the end of the day it does not matter. Reports are of such little importance, that no one seems very bothered by what happened. Because you and I were so involved as dev/tester, our perception of the problem is magnified.

    If its the former, then our whole team could use some serious self-examination.

    If its the latter then we should should probably just "chillax".

    In either case, let's chillax.

    Again, its just work, so I don't want to seem irrationally concerned. There's more to life, of course.

    Love the blog.

  2. Mark Waite said...

    Cem Kaner tells a similar story in the Black Box Software Testing Bug Advocacy course by the Association for Software Testing. He was working as a consultant for a large electronics manufacturer. They asked him to evaluate failures in the field. He found that in many (possibly even most) cases the field reported failure had already been reported by a tester before the release of the product. Unfortunately, those reports had either suffered from "group think" as you described, or had suffered from being poorly worded so the triage team did not understand the actual severity, or they had not been sufficiently investigated by the tester to present the "true severity" of the problem.

    The Bug Advocacy course is free to members of the Association for Software Testing, although it does have a pre-requisite that you must pass the Black Box Software Testing Foundations course (also free to members)

  3. Mark Waite said...

    Thanks for posting about your challenges! It is much easier to post about things that went well, or looked good on the outside. When we switched to Extreme Programming a few years ago, we made grave mistakes assuming that automated testing was enough, and shipped a some bad releases. We learned more about testing (and how to improve it) from those bad releases than from the following good releases.

  4. Alex said...

    The tester is not solely responsible for quality. That is the job of the entire team. I think that part of the reason its easy for teams to get into this groupthink about bugs is because the testers are marginalized into the fringes and looked at as a hindrance more than a help to the team -- i.e. a roadblock for getting out features that the team really wants to release.

    If a bug like that existed, the feature should never have been released, period. If you have so many bugs that you are commonly making these types of decisions, it suggests that you're trying to create too many features simultaneously. Doing this has the appearance of working faster, but is actually slowing you down - as exemplified by the effort of release rollbacks, patches, bug management, etc..

    I know, I know, you're dealing with reality on your team, and you should be commended as always for working as well as you can in your context.

  5. Mark Waite said...

    I disagree with Alex. I think it is a case of 20/20 hindsight to declare "If a bug like that existed, the feature should never have been released, period".

    When I've made these types of release decisions (and later worried that I made the wrong decision), the decisions were being made in the context (and enthusiasm) of releasing an exciting new version, with new capabilities which the users requested. There were some known problems, but the "lens of the moment" made me think that the new capabilities were so worthwhile that it would be unwise to delay the release. When I made those types of decisions, I was biased towards what Joel Spolsky calls "shipping is a feature" and away from shipping with fewer known bugs.

    If the timeouts had been infrequent or undetected (due to light user workload, or infrequent exercise of the timeout cases), then I suspect the new version would have been heralded as a great success.

    I'm not in Eric's company, so I can't be sure his team made the same types of mistakes I made in my projects. I've certainly made worse mistakes than he describes (shipping flawed software to paying customers, then having to apologize for those flaws).

  6. axis tech said...

    These types of challenges are faced by testers most of the time during any testing phase of the software build. There is no issue of becoming scary or nervous.

  7. Mark Waite said...

    I disagree with axis tech. There are plenty of reasons for a tester to find things are "scary" or for a tester to be "nervous".

    If there are people who depend on my work (which I hope there are), then there should be some "scary" in wanting to assure that I've done excellent work for them.

    If there are people who will be disrupted if my work does not arrive soon enough, or correctly enough, or respond fast enough, then I think it is proper to be nervous, actively trying to assure their needs are met.

    Possibly the words "scary" and "nervous" don't convey enough of the "business dignity" associated with programming and testing, but they certainly do resonate with the feelings I've had in the trenches of programming and testing.

  8. Hagai Jacobson said...

    Scary-er moment in testing: While flight-testing a Rocket launch system (and sitting inside the launcher) and after firing the 1st one in the salvo all the red lights turn on, the words "Hadware Failure/Software failure" appear on screen while the Arm/Fire button lights frantically blink, while being still on safe. (the rest of the salvo was still inside the launcher, and we had no way to know if it's armed or not).

    After going on the comms and a very stressful 10 minutes we we're told to turn everything off, than on again and try to disarm everything.

    Thankfully, it worked.



Copyright 2006| Blogger Templates by GeckoandFly modified and converted to Blogger Beta by Blogcrowds.
No part of the content or the blog may be reproduced without prior written permission.