Five things robotics teams get wrong about simulation testing

Most robotics engineers I've spoken to agree that simulation is valuable. The majority use it in some form. But there's a big difference between running sims and building a testing culture that actually makes your team faster. In my experience, the gap usually comes down to the same five mistakes, made in roughly the same order.

1. Treating simulation as a final check rather than a continuous loop

The most common pattern I see is simulation being used as a gate at the end of a sprint. The feature is done, it works locally, now let's run a sim before we merge. This feels responsible, but it misses the point.

By the time you're running that final check, multiple engineers have made changes that interact with each other in ways nobody has fully mapped. When something breaks, the debugging process becomes archaeology. Which commit was it? Whose change? When did it start failing?

The whole value of automated simulation in a CI pipeline is that you find out immediately, on a per-commit basis, whether something broke. Not at the end of a sprint when you've already moved on mentally. The cost of a failure found at commit time is minutes. The cost of one found at sprint review is days.

2. Not versioning your simulation environment

This one is subtle but it compounds badly over time. Your robot code is version controlled. Your simulation environment probably isn't, at least not properly.

The problem is that ROS package management doesn't lock dependency versions by default. Rebuild the same application a few weeks later and you can end up with different underlying dependencies, even from identical source code. The result is that your sim results from last month aren't actually comparable to your sim results this month, even if the robot code didn't change. You lose the ability to track performance over time, catch regressions, or understand what changed when something breaks.

The fix is containerisation. Docker or similar tools let you snapshot the entire simulation environment, not just the code, so that a test run from six months ago is genuinely reproducible today. It's additional setup upfront, but without it you're not really doing regression testing, you're just running experiments.

3. Chasing realism at the expense of repeatability

This is probably the most counterintuitive one. Engineers naturally want their simulation to be as physically accurate as possible, and that instinct makes sense. But for the purposes of CI, realism is less important than repeatability.

Here's the underlying problem: ROS, which most robotics teams are building on, has no way to run deterministically even in testing mode. At a low level, the same code on the same machine will behave differently between runs due to scheduling and transport layer non-determinism. Add in a high-fidelity physics sim and you've multiplied the sources of variance.

The result is flaky tests. A test that passes seven times out of ten isn't a test, it's noise. And flaky tests do something particularly damaging to engineering culture: they train people to re-run rather than investigate, until eventually the CI results get ignored entirely.

For a continuous testing pipeline to work, you need sim results you can trust. That means optimising for a simulation environment that gives you the same result on the same input every time, even if it's slightly less physically faithful than you'd like. You can always have a separate, more realistic sim for pre-hardware validation. But your CI sim needs to be boring and reliable.

4. Running simulations locally

If your simulation only works on a specific engineer's machine, it is not a test. It is a demo.

Local sim setups accumulate hidden dependencies fast. A particular GPU driver, a specific version of a package installed manually six months ago, an environment variable set in someone's bash profile that nobody remembers. The simulation works on their machine and fails mysteriously everywhere else.

Beyond the portability problem, local sims can't scale. If you want to run a suite of scenarios on every pull request, you need compute that's available on demand, not a workstation that's in use for something else. Cloud-based sim infrastructure, triggered automatically by your version control workflow, is what makes continuous simulation testing actually practical rather than aspirational.

5. Keeping simulation results siloed

The last mistake is the most avoidable. An engineer runs a sim, it produces some output, they look at it, make a decision, and move on. Nobody else sees the results. Nothing is logged in a shareable way. The knowledge lives in one person's head.

This breaks in a few ways. When something fails in the field, you have no historical baseline to compare against. When a new engineer joins and wants to understand how the system has been behaving, there's nothing to look at. When a manager or technical lead wants to understand the state of the software, there's no evidence to show them.

Good simulation CI produces persistent, shareable output by default. Logs tagged with the commit, the environment, and the scenario. Pass/fail trends over time. Dashboards that anyone on the team can open. The simulation run shouldn't just validate the code, it should produce institutional knowledge.

The pattern underneath all five

What connects these mistakes is that they all treat simulation as an individual activity rather than a team infrastructure. One engineer running a careful simulation is better than nothing. But it doesn't compound. It doesn't get faster over time, it doesn't catch regressions automatically, and it doesn't produce the kind of shared visibility that lets a team move quickly with confidence.

The teams I've seen move fastest on robotics aren't necessarily the ones running the most sophisticated simulations. They're the ones who made simulation a boring, automatic, always-on part of how they ship code. That shift is harder than it sounds, but it's the one that actually changes your velocity.