Your application under test (AUT) probably interfaces with external systems. Fact: these external systems will go down at some point while a user is attempting to use your AUT.

Here is the obvious test:

  1. Take ExternalSystemA down. If this is outside your control, simulate ExternalSystemA’s outage by changing where your AUT points for ExternalSystemA.
  2. Trigger whatever user operations cause your AUT to interface with ExternalSystemA.
Expected Results: The user gets a friendly message indicating some functionality is blocked at this time. The support team gets an error alert indicating ExternalSystemA is not responding.



We executed the above test for 6 or 7 external systems and got our AUT robust enough to only block minimum functionality and provide good communication to users and the support team. However, just when we were getting cocky, we encountered a slight variation on the first test that crippled our AUT. Here is the test we missed for each external system.

  1. Put ExternalSystemA into a state where it is up and running but cannot respond to your AUT within the amount of time your AUT is willing to wait. Note: We were able to simulate this by taking down ExternalSystemB, which gets called by ExternalSystemA.
  2. Trigger whatever user operations cause your AUT to interface with ExternalSystemA.
Expected Results: The user gets a friendly message indicating some functionality is blocked at this time. The support team gets an error alert indicating ExternalSystemB is not responding.

2 comments:

  1. wwcd (what would crayons do) said...

    I have a hard time reconciling my feelings with this issue as on the one hand in a perfect world you'd like a friendly msg for the user and allow the user to "continue working" when a system goes down instead of just crashing.

    On the other hand when you allow a system to "continue working" while displaying "not found", etc. there's ramifications with allowing the user to proceed such as printing a report that has that missing data, saving missing data into a database, etc. I remember an email from one user saying displaying "not found" shouldn't happen again to a final printed rpt.

    Deciding when it's important to *not* allow the user to continue vs. allowing the user to continue makes the implementation inconsistent across the board.

    It also causes the developer to put in a lot of time to place try/catches everywhere for exceptions since we call external systems in a lot of places causing the code to be code brittle by having extra logic that would be triggered like only 1% out of the year when the external system down.

    I guess in my mind the benefits doesn't outweigh the costs due to the frequency it could occur, but I suppose there could be that rare time we need something to air regardless of whether we can get the data and try to fix it up afterwards somehow.

  2. Eric Jacobson said...

    wwcd,

    Those are good reasons not to allow system usage to continue during interface outages. Said tests could have different expected results...like user gets a message that says "try again later" or something.

    It comes down to a business decision. so testers/devs don't have to decide on the expected results.

    For testers who don't know the expected results. These tests may still be worth running. Instead of resolving with a Pass/Fail, tester would just need to explain the results to the business.

    Your comment also begs the question, "can the users get themselves into trouble by using our product when ExternalSystemA is down?" A good tester should investigate this.

    When referring to the user flexibility in our app, I've often heard Rob say "we give our users enough rope to hang themselves".

    I think several have.



Copyright 2006| Blogger Templates by GeckoandFly modified and converted to Blogger Beta by Blogcrowds.
No part of the content or the blog may be reproduced without prior written permission.