The Right Idea: Testing in Production to Build Greater Resiliency

The Right Idea: Testing in Production to Build Greater Resiliency

Shift-Left is a well-established practice in the DevOps community. In essence, it’s about moving key technical practices that are part of software delivery closer to developers when they are writing their features. Behaviour-Driven Development (BDD) is a common example, as this practice reduces common wasteful activities in software development early on. Similarly, integrating security and code quality policies into a Continuous Integration process provides fast feedback before integrating changes into a common codebase. These are well-understood examples of where Shift-Left can deliver real value.

However, it is important not to neglect later stages in the software delivery process that are closer to Production, where a focus on resilience and availability are required. While the Ops teams has an inherent caution in tampering with live production systems, research increasingly demonstrates that taking processes that typically happen before application release and moving them into production offers invaluable learning that can feedback into more resilient systems. This is where Shift-Right comes in.

There’s No Place Like Production

Operations teams are justly apprehensive when it comes to testing in production. A large part of their job is to ensure business continuity, which relies on systems being fully operational. A natural fear of introducing change in production is the increased likelihood of failure. It seems counterintuitive, then, that a Shift-Right might be desirable, but there is one key factor to consider: the assumption should always be that production will fail at some point.

The natural result of the fear to shift right is a tendency to simulate production environments and assess the stability of a change prior to production release. Staging delivery in this manner delivers valuable insights into how systems are likely to behave but there is no substitute for the live production environment to learn and engineer reliability. Simulating size, scale, user-access, and volatility is a costly task. This is especially true when it comes to cloud-based development where a firm does not own the infrastructure on which its code is running, meaning there is a layer of volatility that is entirely out of its control.

Increasingly, firms are beginning to realize that deploying incremental changes in live production, and monitoring how systems behave when the code is released, is the only way to assure scale with real workloads.

Faults Are Friends

Shift-Right involves intentionally stressing the resiliency and reliability of production and creating a feedback loop to subsequently engineer for resilience. As part of a progressive roll-out of a change in a production environment, systems resilience can be tested by employing chaos engineering. This involves intentionally breaking parts of your service and using telemetry to observe behavior. The output of this exercise is to prove resilience or identify improvements that can be made. These exercises offer valuable data and telemetry that can be fed back into the development process and also validate resilience mechanisms that are in place using machine learning.

Assuming production will fail at some point, it makes sense that resilience is tested in real-time. One way to do this is by injecting faults into production and overloading subsystems to verify that fall-backs are properly tuned and triggered. Using monitoring tools, information gathered from these exercises can be used to build in further resilience into the production environment as well as release process. An added benefit of improving resilience on both sides is that it makes successful integration on a continual basis more likely.

Fault injections are only useful if you can recover from them and learn from the process of remediation. The primary purpose is to avoid the failure of a subsystem in production cascading into a more catastrophic system-wide failure that impacts customers.

To safeguard against this scenario, circuit breakers can be employed. As the name suggests, these allow for the setting of parameters so that systems fail when you want them to. Here, there is a level of nuance required when implementing a circuit breaker. You don’t want to trigger a fallback too soon, as this may degrade the performance of the system unnecessarily, owing to the fallback being slower than the system you want running. Similarly, breaking too late means the system will become encumbered, leading to difficulties for users.

Friends With Faults

When rolling out changes into production, patterns and practices can be implemented that allow them to be introduced in a gradual way. One effective way to do this is the use of canary releases, which is a technique that is used to reduce the risk of introducing a new software version in production by gradually rolling out the change to a small subgroup of targeted users, before rolling it out to the entire population. This allows a new software version to be verified before it is exposed to everyone and for rollback if an error is detected. A canary group can be made up of internal users, or even a set of early adopters with whom a certain level of trust has been built.

Another option is to use feature flags. This approach allows for trunk-based development and experimentation of new features. Instead of pushing through a new feature to the entire user-base, the new feature can be offered as an opt-in via a private preview, for example, and exposed using a graphical feature flag or toggle that can be activated and deactivated at will.

Telemetry: What’s Going On?

Shift-Right practices are entirely useless if what’s happening in Production is not monitored. No human is capable of simultaneously assessing, recording and reporting on how systems behave before, during and after a forced breakage or during a deployment. Monitoring for anomalies, failures, exceptions and security events, as well as how your systems are performing against key metrics, is a herculean task. This is where the other key aspect of Shift-Right comes in, the use of telemetry.

As with any DevOps practice, automation tools should be employed so that performance and health can be continually monitored. Once a wealth of information is accumulated using the telemetry, extensive knowledge of customer workloads becomes data that can inform processes early on in the development process.

Developers want feedback on the viability of their code as quickly as possible – enabling this is the key to effectively shifting left. At the other end of the software development pipeline, data gathered from tests that stress production can give developers a clearer picture of how systems will cope with a release, allowing them to build further resiliency into their code changes. In short, Shift-Right practices augment Shift-Left.

This article previously appeared on DZone.

Harbinder Kang is a Global Head of Developer Operations at Finastra. He is passionate about continuous improvement in the software delivery cycle. He has hands-on experience managing multi-site agile teams developing financial software with a DevOps mind-set.