
We didn't build AI features because everyone else is. We built them because understanding test failures at scale was taking too long. Here's how we approached it, what we optimised for, and why Automation Intelligence extends DoesQA rather than replacing what already works.
AI is not the product. It's a tool.
So many companies are rushing to become "AI-powered," but by doing so, they're watering down their original offering. What makes them unique?. Their soul.
It's a finished argument. AI is a great tool when used responsibly. It's also a market full of false hope and underdelivered promises.
When we decided to leap into this hype-driven world, we didn't read about how "AI is revolutionising testing" or anything like that. We looked at where our users spent most of their time and what could be handed off to an AI that is actually good at: large-scale data summarisation.
The problem that pushed us here
DoesQA is the fastest way to create reliable, scalable, and powerful web tests. But because of that speed, our users quickly faced a growing issue.
If you have a pack of 100 tests running on every commit, and some of them suddenly fail, understanding what went wrong becomes increasingly difficult. The tests themselves aren't the bottleneck. The time lost is in reading results, tracing failures, and figuring out what actually changed.
So the first thing we built was a foundational set of tooling. Something we could integrate into our platform to extend it, not replace the years of hard work that went into building DoesQA in the first place.
The shopping list
There were a few things I needed before writing a single line of code:
Model agnostic. I'm not getting into bed with a single model provider. I need to use the right model for the right job, and allow cost optimisation too. I don't need an all-singing-all-dancing vision model to do spell checking.
Microservice architecture. Each part of this foundation needs to be self-sufficient, extendable, and most importantly, not become monolithic.
Hallucinations are down to a minimum. It's common to throw thousands of contextual tokens at a model and hope it figures out what's important, not for us.
Extend the platform, not replace it. We've spent a lot of time thinking about our product, building something reliable and dynamic. We're not going to sell out on AI.
Gets better over time. We have a wealth of data. Over 5,000,000 tests and a heap of other useful information. Instead of starting from scratch, how can we tune this to become more personalised per account?
So, let's go shopping.
Model agnostic
There are many connectors for LLMs, but since we're already AWS-first, Bedrock quickly became the favourite. From there, I was able to subscribe to many model providers and models. Easy. Next on the list.
Microservice architecture
Each service is independent. Its own judge, jury, and executioner. I would never have built a monolithic system, purely because they're very hard to extend, maintain, and understand over time.
Reducing hallucinations
Everyone knows (or should know) that AI is the absolute king of making stuff up. It cannot say no to you. It wants to achieve the goal given in the prompt. So, very careful planning is needed to reduce this to a bare minimum.
This part of the foundation took the most time and a lot of testing. Instead of letting the model randomly generate data, we gave it the option to ask for more. That request then fetches content from our real CMS and knowledge bases, at both the account level and globally, to avoid the LLM filling in the blanks itself.
This is where the real value sits. Deep understanding of what the test is meant to do, what went wrong, a clear single objective of what to do next, and historical analysis if it's available.
For a user, this means a layered, practical breakdown rather than a wall of logs:
What was the test meant to do? Instead of having to read every step or trust the name of the test, the first section of the summary tells you exactly what was being tested.
What went wrong? Instead of throwing an error into a model and asking "what happened here", we provide the step that failed, the context of 10 to 15 steps before it, and rich artifacts like screenshots and console logs. An assertion might fail when checking that a field has a specific value. A generic summary would say "check that this field has the right value." Not very helpful. We can identify that the input was blocked or reset. The model has both text and visual aids, so the chances of fabricating an explanation are very, very small.
Has this happened before? Every test is linked to its run and its flow. Flows are linked to other runs, other tests. From a single failure, we can go back in time and understand: did the test change? If so, what changed, and could it have caused this failure? If not, we look over previous data and screenshots to identify differences.
Contextualised, clear, and real data means the model isn't doing heavy lifting. It's summarising rich data into something digestible and actionable. We're not taking anything away from our product. We've done the hard work to get this data into good shape. Now we're using all of it to show users real value.
Another pillar of the platform
This sits alongside what already makes DoesQA what it is. The intuitive flow builder allows one-of-a-kind test creation using branching and a colourful, easy-to-use interface. Our running infrastructure scales beyond what other platforms can handle, into the tens of thousands of concurrent runners. Now we have a reporting layer that deeply understands what you're testing and what has happened before.
DoesQA Automation Intelligence gets smarter over time, more personalised, more relevant to our users every single day.
This is only the beginning of what we'll build on this foundation. The ideas and improvements to our platform from here are limitless.
Watch this space.
I'll be writing more over the coming weeks about the challenges we've faced and how we've overcome them, and I'm excited to share this with you all.