Evolution of CI/CD with SRE

💡

This article was written for the Continuous Delivery Foundation in my role as CDF Ambassador, along with my fellow ambassador Garima Bajpai. The original article can be found on the CDF Blog

In the past decade, we experienced exponential growth and transformation of software development, cloud technologies, and adoption of DevOps culture which supported the advancement of the Continuous Delivery (CD) ecosystem. With the growth, we also witnessed focused advancement of the Site Reliability Engineering (SRE) perspective.. In this blog, we present a broader outlook on the evolution of CD with SRE through the insights presented by our ambassadors.

To fully realize the potential of CD at scale, the integration of SRE principles is essential. Balancing investment in tools and upskilling for reliability vis-a-vis rapid innovation in CD would be an optimal operating model for the digital economy. However, it is important to note that as SRE comes of age, it faces scalability, growth, and complexity challenges.

Advanced resiliency needs, next-generation security threats, and exponential data integration into software products indicate the need for evolution in the SRE approach hand-in-hand with CD. The evolution of SRE with the following key features can be seen as a critical success factor for unleashing the potential of the CD Ecosystem.

Better Reliability Posture and Overall Stability

How does SRE tie to CD? Some of the core tenets of SRE are change management, incident management, and observability. The central theme around which incident and change management revolves is “change”. The CD pipeline is the vehicle of change in your application infrastructure. Having a comprehensive CI/CD solution that covers all changes to production will provide much-needed control for SREs to understand and resolve incidents faster. It also enables them to implement change management processes as well as programmatically verify that the process is adhered to. This in turn translates to better reliability posture and overall stability.

Incident Management

One of the key tenets of SRE is effective incident management. And more often than not, incidents are associated with changes in production. Now, these could be code, configuration, or infrastructure changes. Change awareness—the knowledge of the recent changes that were pushed to production applications infrastructure is vital in resolving an incident within the shortest possible time. And these resolutions involve rolling back (or forward) the problematic changes to a known good state. The Last Known Good state aka LKG is one of the primary strategies used in determining the stable state to move to. This process is closely tied to one of the most important metrics SREs track, the Mean Time to Resolve (MTTR) of an incident.

Change Management

Another aspect of SRE is the discipline of change management itself. SRE engagement models greatly vary based on the cultural and organizational aspects of companies, and this applies to change management as well. A model that works for one organization won’t necessarily fit well in another organization. But there are always some underlying principles that can be applied across teams and organizations. The older model of change approval boards and central control has given way to the peer review and approval model. This is often supplemented with automated software and security testing as part of the continuous integration pipeline so that issues are caught and resolved early on.

Observability

SREs are also responsible for monitoring, or rather observability of the application infrastructure. While this predominantly covers observability of the production infrastructure, as the reliability practices have evolved observability has also extended to the measurement of engineering excellence. This is often achieved with reliability scorecards that are tied to various aspects of engineering and application delivery. Similar to how golden signals provide observability into production infrastructure, DORA metrics provide observability into engineering excellence. To put it in the CI/CD parlance, even observability is shifting—or rather expanding—to the left

Data-Driven CD: Metrics and SLO

A lot has been done in the past with SRE principles getting mainstream, however, most of the SRE practices are still falling short of “shift left” quickly, as highlighted by Dynatrace’s State of SRE Report: 2022 Edition. More often, early integration of SRE principles and practices into CI/CD with a data-centric metric-based approach could be the next step in the evolution of CI/CD with SRE. A unified view on SLOs right from the inception stages of Continuous Delivery, some guidance in this direction could be taken from DevOps Research and Assessment (DORA).

Advanced Reliability Engineering and CD: Platform, Tools & Application

As the adoption of SRE principles scales, it is evident that the reliability engineering space remains fragmented and heavily focused on monitoring, visualization, and communication. To tailor the SRE principles to the organizational needs, SREs often have to take a hybrid approach with automation, monitoring & AIOps tools co-existing in the ecosystem. In the future with the evolution of CD, SRE tools & applications would not only need consolidation but a more standardized approach to scale. CD integration with SRE will go beyond the current hybrid, fragmented tool-based integration to a more platform-oriented approach, paving the way for a more proactive, insightful, and action-oriented integration of CD & SRE.

Emerging Technology & its Integration with CD

As CD tends to integrate emerging technologies, for example, AI-based features, more & more SRE-based tools & applications will have to be moving in the same direction with Data Observability (ML Observability as an example). Chaos Engineering is another important practice, and when integrated based on standardized interfaces and core components, can be maturing as a framework for not only experimenting but also evaluating and prioritizing resilience at every stage of continuous delivery.

Net-Zero Commitment – SRE Can Lead the Way

As CD makes its way to more and more industry segments, it is evident we need to get more serious about the Carbon Footprint and steps to mitigate and reduce Carbon impact. SREs have been toying with the ideas of cost reduction and right-sizing to reduce the TCO and bring down the carbon footprint as a side effect. By observing and managing workloads, and on-demand features for more resource-centric digital applications, products, and features, SREs can take a more conscious approach toward carbon footprint reduction. With this, SRE principles and practices will lead the way toward a carbon-aware and carbon-optimized CD Ecosystem.

Some of these thoughts and practices are traditionally considered to be part of the DevOps domain. But in any reasonably sized organization, SRE and DevOps practices are often intertwined, driving towards the common goal of achieving production reliability and stability through engineering excellence. These cross-functional practices are centrally pivoted on continuous integration and delivery.

Evolution of CI/CD with SRE