This guide is part of a course on How To Master DORA Metrics. You can find the intro to the course as well as links to other modules here.
Key takeaways from this module include:
- Change failure rate tracks the rate at which your newly deployed code causes production-related failures.
- Monitoring the change failure rate is essential because it can impact business success. It represents your product’s reliability and can impact team productivity and customer trust.
- Rushed deployments and a lack of thorough quality assurance testing can result in an unideal change failure rate. Analyzing changes to identify patterns, implementing real-time monitoring and alerting tooling, and using automation are strategies for overcoming these challenges.
Change failure rate tracks the rate at which newly deployed code changes result in production-phase failures. You calculate this DevOps Research and Assessment (DORA) metric by dividing the number of your team’s production incidents by the number of production deployments.
Note that this metric doesn’t include test failures caught earlier in the integration stage of your continuous integration and continuous delivery (CI/CD) pipeline or production failures not associated with code changes, such as infrastructure issues or external service outages.
Your change failure rate is a crucial metric to monitor and maintain, as it speaks to your product’s reliability and directly impacts customer trust. It also strongly impacts the productivity and morale of your development teams as an indicator of the likelihood of workflow disruptions and increased workload through unplanned incident mitigation efforts.
How change failure rate impacts business
Your product team’s change failure rate can impact the success of your business both internally and externally, and it deserves close attention and remedial action if it underperforms against the industry benchmarks for high-performing teams.
A lack of predictability in product delivery leads to diminished customer trust in your product. If your software is unreliable, customers may seek more reliable alternatives. In the long term, this will damage your product’s reputation if you don’t employ positive changes to your change failure rate.
Moreover, a high change failure rate can drastically reduce your development team’s productivity as they work to manage rollbacks and hotfixes that interrupt their daily workflows, creating stressful conditions of context switching and frustration around unplanned work.
Beyond these impacts on team morale, the time spent fixing and redeploying code failures creates a considerable financial cost for your business.
Common causes of poor change failure rates — and how to reduce change failure rate
Software development teams may encounter several challenges that lead to a higher change failure rate metric.
Insufficient testing or low test coverage
Perhaps the most significant hurdles are a lack of adequate test coverage and the failure to fully automate your test suite, restricting the ability to ensure that new code changes don’t break existing functionality. Insufficient attention to detail and depth of analysis in the code review process can be another source of high change failure rates.
To overcome these challenges, implement robust testing frameworks and expand test coverage. Prioritize automated testing in your development workflow by providing allotments in your team’s project timelines to write adequate tests for each new code change. You should also include robust testing frameworks in the automated integration phase of your CI/CD pipelines.
Robust testing practices give developers the confidence to push new code changes to production, increasing your team’s long-term productivity and ability to deliver new features on schedule. It’s well worth investing time upfront in the development process. You can use gitStream to apply a missing-tests label to any pull request that doesn’t update existing tests or create tests for new files, ensuring your team follows best practices during testing.
A lack of QA testing
A lack of manual quality assurance (QA) testing by your developers or product team to supplement your automated tests can also impact this metric. Skipping manual QA testing increases the likelihood of undetected bugs and issues in the software. Without this crucial step, faulty code can slip through to production, leading to a higher change failure rate.
Manual QA testing provides a human perspective that automated tests might miss, ensuring a more comprehensive evaluation of the software’s functionality, usability, and overall quality.
To protect your change failure rate, perform thorough code quality checks, ensuring thorough peer reviews and automated checks. All new pull requests should receive adequate review and QA testing before merging and deployment. gitStream is uniquely suited to refining your team’s code review process, as it provides insight into the number of code comments or depth of review per pull request.
It also ensures that pull requests are routed to the proper code experts for review and that more complex code changes receive the attention they need while smaller and safer changes can be moved up the queue and merged.
Rushed production deployments can compound issues and inflate change failure rates. This hurry results from inadequate validation processes, often due to a tight deployment timeline and the hurdles outlined above.
Instead, perform gradual rollouts and feature flagging to catch issues early. Larger and highly complex code changes are more difficult to debug and fix once deployed. Gradual rollouts of smaller changes result in more manageable resolutions if a deployment encounters problems.
Gradual code changes also simplify implementing feature flagging, reducing any impacts to a smaller subset of users.
Actions to reduce change failure rate
Below are some actions you can take immediately to reduce your team’s change failure rate.
Analyze changes for patterns
Analyze recent failed changes to pinpoint recurring issues. Identifying patterns in past failures helps address root causes, allowing the team to implement targeted improvements and prevent similar issues from recurring.
You can pair these development-related metrics with your incident monitoring and alerting platforms through automation and visualizations to reveal targeted improvement areas and enable timely issue detection. For example, a correlation between the length and complexity of pull requests and a higher rate of deployment failures could indicate your team’s need to work on smaller, more focused tasks that result in less complex code base changes.
Implement real-time monitoring and alerting
Set up alerts and monitoring to catch failures promptly. Swift detection through proactive monitoring enables quick response, minimizing the impact of failures and reducing the likelihood of cascading issues.
Real-time monitoring and alerting aids in swift failure detection and provides invaluable insights into system health. By instantly flagging anomalies, teams can preemptively address issues, fostering a resilient and adaptive development environment that significantly improves the overall change failure rate.
Foster a culture of collective accountability
Foster a culture of “blameless postmortems” to learn from failures without penalizing individuals. Encouraging open discussions about failures without assigning blame creates a learning environment and promotes a proactive approach to preventing future errors.
As with all DORA metrics, it’s essential to focus on productivity and areas for improvement at the team level rather than pinpointing individual contributors. This is why LinearB only provides aggregated team and organization-level data and doesn’t collect metrics on individual team members.
Problems in the software delivery cycle result from process-level failures, misdirected organizational culture, and misallocated skills and resources across teams. They should, therefore, be assessed and addressed at this level. Blameless postmortems help your team learn from failures without penalizing individuals -- crucial for addressing change failure rates and maintaining team morale.
Implement workflow automations for code reviews to detect issues early. Automated code reviews streamline identifying potential issues, ensuring that problems are caught and addressed early and reducing the risk of failure during deployment.
Using gitStream’s programmable workflows, you can assign code experts, assign reviewers based on the area of the code base the change affects, and notify your security team of changes made to sensitive files. You can also enforce protective measures to reduce the risk of failure, such as prohibiting changes to deprecated components and off-limits areas of the code base.
Implementing real-time monitoring and alerting is crucial as it provides continuous visibility into system health. This proactive approach minimizes downtime by swiftly detecting failures and enhances the team’s ability to address issues promptly, ultimately improving the overall change failure rate.
Pros of lower change failure rates
The advantages of maintaining a low change failure rate include:
- Increased product stability, leading to more satisfied users.
- A streamlined development process with fewer disruptions.
- Higher morale and confidence in deploying new code changes for your development teams.
- Lower risk of production incidents disrupting your developer’s workflow, simplifying estimations of project timelines and providing stakeholders with more accurate product delivery schedules.
Measuring change failure rate
Once you connect your project management and version control systems to your LinearB account, you can track your team’s change failure rate metrics using the DORA metrics dashboard.
With visibility into your team’s change failure rate over time, you can take action on the relevant areas using gitStream’s array of workflow automations for your code pickup, review, and deployment processes.
To use LinearB to monitor your team’s Change Failure Rate, refer to the documentation for our DORA metrics dashboard.