Sam

In case of CrowdStrike, improve QA

In the event of an OS-Level issue, are you able to continue working on another device or computer?

There has been lots in the news recently about a global outage caused by a simple defect in an automatic update. This grounded flights, caused healthcare chaos, and less importantly, but still impactful, stopped thousands of engineering teams from doing any work.

Machines have outages, “the cloud” is always online.

In the event of a computer malfunction, what can you do? Are you able to pick up another laptop and carry on? Or are you stuck with changes that you had not pushed, or with configuration files that you cannot recover easily?

In the constantly evolving world of cloud computing, the key benefit of being able to access systems from anywhere has become such an important part of why so many companies are transitioning.

Servers running in a room of the office are becoming a thing of the past, where you can spin up a new one in a few moments in the cloud, quicker than waiting for support teams to scramble to your location to assist. Additionally, in the unlikely (but still happened recently) event of a more globally spread issue, it will take even longer for support to assist, leaving you powerless to change anything.

Far fewer online-enabled tooling providers had any issues, or if they did, the outage was contained and dealt with faster than those with on-prem equipment.

Adding a QA layer can multiply confidence.

Say you had a CI/CD server running all your tests in the office. What happens if this is impacted by the same issues? You are now stuck, unable to run your checks, and the company is no longer able to deliver on any functionality or fixes.

If you had some of your confidence enabled online, you are able to still assert and verify that things are working, even if it is at a degraded rate.

Running tests in the cloud has become common practice, and more recently, become best practice. Repeatable, scalable test solutions are difficult to handle internally, leading to large amounts of time and consideration going on keeping the lights on, not delivering value.

Let someone else have a sleepless night.

If online tooling goes down, you can raise a support ticket, and track its progress. It's not your problem, it’s theirs.

As a tooling vendor, we have made many areas of the application stack failure-tolerant. By this I mean we use Application Performance Monitoring (APM) services alongside anomaly detection to ensure we are aware as soon as something happens.

We then have invested heavily in disaster recovery, multi-region support and regular backups. This allows us to react to a large event like this recent one, whilst minimising the impact to our users.

It’s probably worth mentioning here, without sounding like we are bragging, but during all that chaos, we had not a single second of outage. Not a single service went offline, and no customer was affected. Sure, alerts were going off, but all of the processes and procedures that we had spent months building all fell into place. The stress was on us, not our customers.

So, what do I do?

It's great to have local running capabilities. But if you want to ensure that your testing is able to handle issues like the recent ones, consider placing some of your checks on cloud-enabled tools like DoesQA.

I am not advocating dropping all your hard work, but how much easier would the conversation be if you were like “Hey, we cannot run anything here, but I can grab any laptop and still run a large amount of tests online”.

Used together with what you already have, online-enabled testing can drastically reduce stress, and allow you to deliver value from anywhere, anytime.

Until next time!

‹ Quality, Post-Covid

The first rule of Test Automation ›