<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Safeer C M]]></title><description><![CDATA[Safeer C M]]></description><link>https://safeer.sh</link><generator>RSS for Node</generator><lastBuildDate>Tue, 21 Apr 2026 01:15:17 GMT</lastBuildDate><atom:link href="https://safeer.sh/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Testinfra: Practical, Pytest-Friendly Infrastructure Testing]]></title><description><![CDATA[Modern infrastructure management requires consistency in the state of servers, and it's as crucial as the consistency of the code that runs on them.  This is where the principles of  Test-Driven Infrastructure (TDI) come into play, and Testinfra stan...]]></description><link>https://safeer.sh/testinfra-practical-pytest-friendly-infrastructure-testing</link><guid isPermaLink="true">https://safeer.sh/testinfra-practical-pytest-friendly-infrastructure-testing</guid><category><![CDATA[testinfra]]></category><category><![CDATA[Python]]></category><category><![CDATA[pytest]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Sun, 19 Jan 2025 18:30:00 GMT</pubDate><content:encoded><![CDATA[<p>Modern infrastructure management requires consistency in the state of servers, and it's as crucial as the consistency of the code that runs on them.  This is where the principles of  <strong>Test-Driven Infrastructure (TDI)</strong> come into play, and <strong>Testinfra</strong> stands out as a powerful and elegant tool for implementing this practice. Testinfra is a Python framework that allows you to write unit tests for your infrastructure, verifying that your servers are configured exactly as you expect.</p>
<p>If you have used pytest to test your Python code, Testinfra will feel right at home.  Testinfra is a <strong>pytest</strong> plugin, which means it leverages the power and flexibility of this popular Python testing framework. It allows you to write clean, readable tests to verify the state of various server components, such as packages, services, files, network sockets, and more. Think of it as unit testing for your servers. Instead of testing a function or a class, you're testing the state of your infrastructure.</p>
<h1 id="heading-advantages-of-testinfra">Advantages of Testinfra</h1>
<p>These are some of the benefits testinfra offers:</p>
<ul>
<li><p><strong>Idempotency and Consistency</strong>: By writing tests for your infrastructure, you can ensure that your configuration management tools (like Ansible, Salt, Puppet, or Chef) are working correctly and that your servers are in a consistent and predictable state.</p>
</li>
<li><p><strong>Early Detection of Errors</strong>: Testinfra helps you catch configuration drifts and errors early in the development cycle, long before they can cause problems in production.</p>
</li>
<li><p><strong>Improved Collaboration</strong>: Tests serve as a form of documentation, clearly defining the expected state of your infrastructure and making it easier for teams to collaborate on managing and maintaining servers.</p>
</li>
<li><p><strong>Auditability and compliance:</strong> By creating reproducible tests that are created and maintained centrally, testinfra enables infrastructure and security engineers to ensure that the production infrastructure is in the expected states and is not prone to reliability and security incidents that happen due to manual changes.</p>
</li>
</ul>
<h1 id="heading-modeling-infrastructure-tests">Modeling infrastructure tests</h1>
<p>Fixtures provide a consistent context and reliable environment for tests.  In pytest, fixtures are part of the arrange phase. To know more about different phases of pytest, read the “<a target="_blank" href="https://docs.pytest.org/en/stable/explanation/anatomy.html#test-anatomy">Anatomy of a test”</a>.   The <strong>host fixture</strong> is the central element of testinfra. It represents the system under test and provides access to all the different modules that you can use to inspect the server's state.</p>
<p>Each capability is a <strong>module</strong> exposed as a method on the host.  Here are some of the most commonly used modules available through the host fixture:</p>
<ul>
<li><p>host.package(name): To check the status of a package.</p>
</li>
<li><p>host.service(name): To check the status of a service.</p>
</li>
<li><p>host.file(path): To inspect files and directories.</p>
</li>
<li><p>host.socket(uri): To check for listening sockets.</p>
</li>
<li><p>host.user(name): To get information about a user.</p>
</li>
<li><p>host.group(name): To get information about a group.</p>
</li>
<li><p>host.interface(name): To inspect network interfaces.</p>
</li>
<li><p>host.process(name): To find and inspect processes.</p>
</li>
<li><p>host.command(command): To run a command and inspect its output.</p>
</li>
</ul>
<p>You declare the host as a test function argument and then call the module you need. For example, to check if the host is listening on port 22 ( for SSH ), write the test as follows:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_ssh_port</span>(<span class="hljs-params">host</span>):</span>
    <span class="hljs-keyword">assert</span> host.socket(<span class="hljs-string">"tcp://0.0.0.0:22"</span>).is_listening
</code></pre>
<h1 id="heading-getting-started">Getting started</h1>
<p>Start by installing the pytest-testinfra package</p>
<pre><code class="lang-bash">pip install pytest-testinfra
</code></pre>
<p>Let us write a few tests to validate the localhost on which TestInfra is installed</p>
<p>Add the following tests to the file <code>local_infra.py</code>.  Description of each test is added as a comment</p>
<pre><code class="lang-python">
<span class="hljs-comment"># Test if the passwd file exists and  </span>
<span class="hljs-comment"># has the right user/group ownership and file permissions</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_passwd_file</span>(<span class="hljs-params">host</span>):</span>   
    f = host.file(<span class="hljs-string">"/etc/passwd"</span>)   
    <span class="hljs-keyword">assert</span> f.exists <span class="hljs-keyword">and</span> f.user == <span class="hljs-string">"root"</span> <span class="hljs-keyword">and</span> f.group == <span class="hljs-string">"root"</span> <span class="hljs-keyword">and</span> f.mode == <span class="hljs-number">0o644</span>

<span class="hljs-comment"># Check if the package openresty ( an nginx variant with lua ) is installed </span>
<span class="hljs-comment"># and the package version is 1.2*</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_openresty_installed</span>(<span class="hljs-params">host</span>):</span>   
    pkg = host.package(<span class="hljs-string">"openresty"</span>)   
    <span class="hljs-keyword">assert</span> pkg.is_installed
    <span class="hljs-keyword">assert</span> pkg.version.startswith(<span class="hljs-string">"1.2"</span>)
 <span class="hljs-comment"># Ensure openresty is running as a service and is enabled</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_openresty_service</span>(<span class="hljs-params">host</span>):</span>   
    s = host.service(<span class="hljs-string">"openresty"</span>)
    <span class="hljs-keyword">assert</span> s.is_running
    <span class="hljs-keyword">assert</span> s.is_enabled
</code></pre>
<p>Now, let us test these</p>
<pre><code class="lang-bash">
pytest -v local_infra.py  
===================================================== <span class="hljs-built_in">test</span> session starts ========================================
&lt;OUTPUT TRUNCATED&gt;collected 3 items                                                                                                                                                                            
local_infra.py::test_passwd_file[<span class="hljs-built_in">local</span>] PASSED                                                                                             [ 33%]
local_infra.py::test_openresty_installed[<span class="hljs-built_in">local</span>] PASSED                                                                                     [ 66%]
local_infra.py::test_openresty_service[<span class="hljs-built_in">local</span>] PASSED                                                                                       [100%]
===================================================== 3 passed <span class="hljs-keyword">in</span> 0.13s =========================================
</code></pre>
<h1 id="heading-concept-of-backends">Concept of backends</h1>
<p>Testinfra supports a rich set of <strong>connection backends</strong>. By default, all tests are run <strong>locally</strong>, but you can target remote hosts or containers with --hosts=&lt;backend specification&gt;.</p>
<p>Supported backends are:</p>
<ul>
<li><p>SSH</p>
</li>
<li><p>Paramiko - Python implementation of the SSHv2 protocol</p>
</li>
<li><p>Ansible</p>
</li>
<li><p>Docker</p>
</li>
<li><p>Podman</p>
</li>
<li><p>Kubernetes</p>
</li>
<li><p>Openshift</p>
</li>
<li><p>Salt</p>
</li>
<li><p>WinRM</p>
</li>
<li><p>LXC/LXD</p>
</li>
</ul>
<p>Let's use SSH to connect to a remote host and do some validation</p>
<p>We will log in to a remote host 206.189.137.55 and check if the root login for SSH is disabled or not.  Passwordless SSH is already set up for the host. Create a test file <code>test_backend.py</code> and add the following:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_ssh_no_root_login</span>(<span class="hljs-params">host</span>):</span>
   sshd_config = host.file(<span class="hljs-string">"/etc/ssh/sshd_config"</span>)
   <span class="hljs-keyword">assert</span> sshd_config.contains(<span class="hljs-string">"^PermitRootLogin no"</span>)
</code></pre>
<p>Run the test</p>
<pre><code class="lang-bash">pytest -vv --hosts=<span class="hljs-string">"ssh://206.189.137.55"</span>  test_backend.py  

========================================== <span class="hljs-built_in">test</span> session starts ===========================
&lt;OUTPUT TRUNCATED&gt;
collected 1 item

test_backend.py::test_ssh_no_root_login[ssh://206.189.137.55] PASSED  

============================================ 1 passed <span class="hljs-keyword">in</span> 1.15s ===========================
</code></pre>
<p>For most backends except SSH, you will have to install the corresponding testinfra module as follows:</p>
<pre><code class="lang-bash">pip install <span class="hljs-string">'pytest-testinfra[ansible,salt]'</span>
</code></pre>
<p>You can use pytest parameterization to pass parameters to testinfra tests</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pytest
<span class="hljs-meta">@pytest.mark.parametrize("name", ["curl", "git"])</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_utilities_installed</span>(<span class="hljs-params">host, name</span>):</span>
    <span class="hljs-keyword">assert</span> host.package(name).is_installed
</code></pre>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Testinfra belongs in your platform toolbox.  It gives you fast, reliable feedback that a box (or image/pod) is configured the way you <em>think</em> it is—before a rollout, after a change, and during drift audits. It scales from “does nginx listen on 80” to “is our org-wide security baseline present, enabled, and locked down,” and because it’s pytest, it integrates well into your workflow and CI with minimal effort.</p>
]]></content:encoded></item><item><title><![CDATA[Generating CycloneDX software bill of materials with Anchore Syft]]></title><description><![CDATA[Software bill of materials aka SBOM is a critical tool in protecting software and software supply chain from potential security vulnerabilities and supply chain attacks. An SBOM includes all software components that are used to make a final software ...]]></description><link>https://safeer.sh/generating-cyclonedx-software-bill-of-materials-with-anchore-syft</link><guid isPermaLink="true">https://safeer.sh/generating-cyclonedx-software-bill-of-materials-with-anchore-syft</guid><category><![CDATA[#software bill of materials]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Thu, 17 Oct 2024 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1738599973380/e8bfd851-6552-4b18-b8a9-bf65c10e7df2.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Software bill of materials aka SBOM is a critical tool in protecting software and software supply chain from potential security vulnerabilities and supply chain attacks. An SBOM includes all software components that are used to make a final software product. An SBOM is a document that describes details of the software version, patches, dependencies, vulnerabilities, licenses, etc. There are different formats to document SBOMs, and the industry standard formats right now are CycloneDX and SPDX. CycloneDX is from the OWASP foundation and is more focused on vulnerability and security. SPDX is from the Linux Foundation and is more focused on licensing and compliance.</p>
<p>In this article, we will look into the basics of the CycloneDX format by generating an SBOM file and going through various entries in the document.</p>
<h2 id="heading-the-cyclonedx-format">The CycloneDX Format</h2>
<p>The CycloneDX SBOM specification has a comprehensive object model that can capture software components, services, interdependencies, and relationships across all inventory types including software, hardware, and other digital assets. The object model can support detailed metadata of all components and services, life cycle stages, etc. The framework is also very extensible and modular and can represent a wide range of supply chain entities and metadata.</p>
<p>Let us look at some of the high-level components of the CycloneDX specification:</p>
<ul>
<li><p>Top Level BOM Metadata - Contains basic information that describes the BOM itself, the tooling used, etc</p>
</li>
<li><p>Components - Components could be software, hardware, ML models, source code, configuration, etc. Includes the first-party and third-party components involved along with the metadata of each component</p>
</li>
<li><p>Services - External APIs that the software may call</p>
</li>
<li><p>Dependencies - Represent the graphs of the dependency of components on other components. It can represent direct and transitive dependencies</p>
</li>
</ul>
<p>There are several other top-level entities in the specification - an overview is available at <a target="_blank" href="https://cyclonedx.org/specification/overview/">ttps://cyclonedx.org/specification/overview/</a></p>
<p>The following visual representation will help you get a better idea of the object model:<br /><a target="_blank" href="https://cyclonedx.org/images/CycloneDX-Object-Model-Swimlane.svg">https://cyclonedx.org/images/CycloneDX-Object-Model-Swimlane.svg</a></p>
<h2 id="heading-anchore-syft-for-sbom-generation">Anchore Syft for SBOM generation</h2>
<p>Syft is a comprehensive open-source CLI and library for generating Software Bill of Materials (SBOMs) for container images and filesystems. Syft development is sponsored by the security company Anchore</p>
<p>Some of the notable features of syft are:</p>
<ul>
<li><p>Generates SBOMs for container images, filesystems, archives, and more to discover packages and libraries</p>
</li>
<li><p>Supports OCI, Docker, and Singularity image formats</p>
</li>
<li><p>Linux distribution identification</p>
</li>
<li><p>SBOM signing/attestation with in-toto specification</p>
</li>
<li><p>Support multiple SBOM formats - CycloneDX, SPDX, and Syft's proprietary format.</p>
</li>
</ul>
<h2 id="heading-generating-sboms-with-syft">Generating SBOMs with Syft</h2>
<p>We will demonstrate syft by generating SBOM for a container image. To begin with, install the soft binary by following the instructions at <a target="_blank" href="https://github.com/anchore/syft/">https://github.com/anchore/syft/</a>. This will install the CLI “syft” on your OS.</p>
<p>The basis usage syntax is as follows:</p>
<p><code>syft [SOURCE] [FALGS]...</code></p>
<p>Syft can create SBOMs from a variety of sources including different formats for container images, different container daemons, filesystems and directories, etc. It supports a number of flags, one of the most important ones is the “-o” which can specify an output format and file.</p>
<p>Let us start by creating a CylconeDX formatted SBOM from the alpine docker image from the docker registry.</p>
<pre><code class="lang-plaintext">$syft alpine -o cyclonedx-json=alpine.cyclone.json
</code></pre>
<p>This will generate the SBOM file from the official docker repo at “<a target="_blank" href="http://docker.io/library/alpine">docker.io/library/alpine</a>” in the CycloneDX format and save it into the file alpine.cyclone.json. We will use the “jq” CLI to inspect the file and look at some of the important elements in the SBOM file.</p>
<p>Let us start by looking at the top-level elements of the file</p>
<pre><code class="lang-plaintext">$cat alpine.cyclone.json|jq '.|keys'

[
  "$schema",
  "bomFormat",
  "components",
  "dependencies",
  "metadata",
  "serialNumber",
  "specVersion",
  "version"
]
</code></pre>
<p>Looking at the important BOM-specific metadata</p>
<pre><code class="lang-plaintext">$cat alpine.cyclone.json |jq '{"$schema",bomFormat,serialNumber,specVersion,version}'
{
  "$schema": "http://cyclonedx.org/schema/bom-1.6.schema.json",
  "bomFormat": "CycloneDX",
  "serialNumber": "urn:uuid:83201860-d58d-4a8e-8acb-59431eb61ce9",
  "specVersion": "1.6",
  "version": 1
}
</code></pre>
<p>The SBOM file is generated with the latest CycloneDX schema - version 1.6. Every SBOM will also have a serial number and a version. When you update the SBOM for the same source, the version should be incremented and the serialNumber should remain the same.</p>
<p>The SBOM metadata:</p>
<pre><code class="lang-plaintext">$cat alpine.cyclone.json|jq '.metadata' 
{
  "timestamp": "2024-12-01T01:10:11+05:30",
  "tools": {
    "components": [
      {
        "type": "application",
        "author": "anchore",
        "name": "syft",
        "version": "1.17.0"
      }
    ]
  },
  "component": {
    "bom-ref": "c48d0ee842be961f",
    "type": "container",
    "name": "alpine",
    "version": "sha256:37224ec0ba64192fa71cf0cd764a375e6204b58af5274f9d3b2984f9d5516cbb"
  }
}
</code></pre>
<p>The metadata explains what kind of tools are used to create the SBOM as well as the source object (the component in the metadata ). As you can see, the type of component is “container” with the name “alpine” and the version, in this case, is the sha256 hash of the image.</p>
<p>Let us look at the ( software ) components in the BOM now.</p>
<pre><code class="lang-plaintext">$cat alpine.cyclone.json |jq -r '.components|keys[] as $k|[$k+1, .[$k].name, .[$k].type]|@tsv'|column -t -o " | "
1  | alpine-baselayout      | library
2  | alpine-baselayout-data | library
3  | alpine-keys            | library
4  | apk-tools              | library
5  | busybox                | library
6  | busybox-binsh          | library
7  | ca-certificates-bundle | library
8  | libcrypto3             | library
9  | libssl3                | library
10 | musl                   | library
11 | musl-utils             | library
12 | scanelf                | library
13 | ssl_client             | library
14 | zlib                   | library
15 | alpine                 | operating-system
</code></pre>
<p>Since Alpine is a lightweight container, it only has 15 components, out of which one is the operating system Alpine itself and the rest are the Alpine packages.</p>
<p>Now let us take a closer look at a single package, we will choose the ssl_client package.</p>
<pre><code class="lang-plaintext">$cat alpine.cyclone.json |jq '.components[12]|del(.properties)'
{
  "bom-ref": "pkg:apk/alpine/ssl_client@1.36.1-r29?arch=x86_64&amp;distro=alpine-3.20.3&amp;package-id=c5128491237ee638&amp;upstream=busybox",
  "type": "library",
  "publisher": "Sören Tempel &lt;soeren+alpine@soeren-tempel.net&gt;",
  "name": "ssl_client",
  "version": "1.36.1-r29",
  "description": "EXternal ssl_client for busybox wget",
  "licenses": [
    {
      "license": {
        "id": "GPL-2.0-only"
      }
    }
  ],
  "cpe": "cpe:2.3:a:ssl-client:ssl-client:1.36.1-r29:*:*:*:*:*:*:*",
  "purl": "pkg:apk/alpine/ssl_client@1.36.1-r29?arch=x86_64&amp;distro=alpine-3.20.3&amp;upstream=busybox",
  "externalReferences": [
    {
      "url": "https://busybox.net/",
      "type": "distribution"
    }
  ]
}
</code></pre>
<p>One thing to note particularly is the “bom-ref” key which is a unique key for an element within the BOM document. This ID can be used to refer to this component anywhere within the SBOM document. Most of the other keys are self-explanatory. The CPE and PURL keys are common standards used to specify the location of a package.</p>
<p>Let us also look at a few properties of the package.</p>
<pre><code class="lang-plaintext">$cat alpine.cyclone.json |jq '.components[12].properties'
[
  {
    "name": "syft:package:foundBy",
    "value": "apk-db-cataloger"
  },
  {
    "name": "syft:package:type",
    "value": "apk"
  },
  {
    "name": "syft:package:metadataType",
    "value": "apk-db-entry"
  },
  {
    "name": "syft:location:0:layerID",
    "value": "sha256:75654b8eeebd3beae97271a102f57cdeb794cc91e442648544963a7e951e9558"
  },
  {
    "name": "syft:location:0:path",
    "value": "/lib/apk/db/installed"
  },
  {
    "name": "syft:metadata:installedSize",
    "value": "28672"
  },
  {
    "name": "syft:metadata:originPackage",
    "value": "busybox"
  },
  {
    "name": "syft:metadata:provides:0",
    "value": "cmd:ssl_client=1.36.1-r29"
  },
  {
    "name": "syft:metadata:pullChecksum",
    "value": "Q1fihnCSoO3udDb3DkQwtrfd42MJQ="
  },
  {
    "name": "syft:metadata:pullDependencies:0",
    "value": "so:libc.musl-x86_64.so.1"
  },
  {
    "name": "syft:metadata:pullDependencies:1",
    "value": "so:libcrypto.so.3"
  },
  {
    "name": "syft:metadata:pullDependencies:2",
    "value": "so:libssl.so.3"
  },
  {
    "name": "syft:metadata:size",
    "value": "4693"
  }
]
</code></pre>
<p>Please note that I have removed a few entries to shorten the list, the keys are mostly self-explanatory.</p>
<p>One of the key elements in SBOMs is the license of the component. It is common for one component to have multiple licenses as part of the software might be built from different software dependencies - all of which might have different licenses that specify how they are used in another product. Most of the Alpine packages in the SBOM had simpler licenses, so to demonstrate this point, I generated an SBOM from the Ubuntu image. Let us look at the bash package, which has multiple licenses.</p>
<pre><code class="lang-plaintext">$cat ubuntu.cyclone.json|jq '.components[]|select(.name=="bash")|.licenses'
[
  {
    "license": {
      "id": "BSD-4-Clause-UC"
    }
  },
  {
    "license": {
      "id": "GFDL-1.3-only"
    }
  },
  {
    "license": {
      "id": "GPL-2.0-only"
    }
  },
  {
    "license": {
      "id": "GPL-2.0-or-later"
    }
  },
  {
    "license": {
      "id": "GPL-3.0-only"
    }
  },
  {
    "license": {
      "id": "GPL-3.0-or-later"
    }
  },
  {
    "license": {
      "id": "Latex2e"
    }
  },
  {
    "license": {
      "name": "GFDL-NIV-1.3"
    }
  },
  {
    "license": {
      "name": "MIT-like"
    }
  },
  {
    "license": {
      "name": "permissive"
    }
  }
]
</code></pre>
<p>As you can see, there are multiple licenses associated with bash.</p>
<p>Next, we will look at one of the most important parts of the SBM - software dependencies. We will examine the dependencies of the ssl_client package</p>
<pre><code class="lang-plaintext">$cat alpine.cyclone.json |jq '.dependencies[]|select(.ref=="pkg:apk/alpine/ssl_client@1.36.1-r29?arch=x86_64&amp;distro=alpine-3.20.3&amp;package-id=c5128491237ee638&amp;upstream=busybox")'
{
  "ref": "pkg:apk/alpine/ssl_client@1.36.1-r29?arch=x86_64&amp;distro=alpine-3.20.3&amp;package-id=c5128491237ee638&amp;upstream=busybox",
  "dependsOn": [
    "pkg:apk/alpine/libcrypto3@3.3.2-r0?arch=x86_64&amp;distro=alpine-3.20.3&amp;package-id=0bd67c24de5c4187&amp;upstream=openssl",
    "pkg:apk/alpine/libssl3@3.3.2-r0?arch=x86_64&amp;distro=alpine-3.20.3&amp;package-id=409f5b93e7b861be&amp;upstream=openssl",
    "pkg:apk/alpine/musl@1.2.5-r0?arch=x86_64&amp;distro=alpine-3.20.3&amp;package-id=3ea0974d202d0c73"
  ]
}
</code></pre>
<p>If you see here, the dependency entry for the ssl_client package, all dependencies are referenced using their “bom-ref” ID. This helps in building a dependency graph of all packages within the SBOM.</p>
<p>As you can see, an SBOM provides a lot of valuable information about your application/containers that helps you with security audits, supply chain security, license compliances, etc.</p>
<p>Syft is just one tool that can help you generate an SBOM. There are more such tools in the SBOM ecosystem. A selected list of tools can be found at the CycloneDX tools center - https://cyclonedx.org/tool-center/</p>
]]></content:encoded></item><item><title><![CDATA[An introduction to software bill of materials]]></title><description><![CDATA[Software systems have become more complex, and organizations increasingly rely on third-party components, open-source libraries, and external dependencies to build their applications. While this approach accelerates development, it also introduces ri...]]></description><link>https://safeer.sh/an-introduction-to-software-bill-of-materials</link><guid isPermaLink="true">https://safeer.sh/an-introduction-to-software-bill-of-materials</guid><category><![CDATA[software-supply-chain-security]]></category><category><![CDATA[#software bill of materials]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Tue, 08 Oct 2024 18:30:00 GMT</pubDate><content:encoded><![CDATA[<p>Software systems have become more complex, and organizations increasingly rely on third-party components, open-source libraries, and external dependencies to build their applications. While this approach accelerates development, it also introduces risks related to security vulnerabilities, licensing issues, and supply chain attacks. With the complex web of software dependencies and the attack surfaces it opens up, security and compliance have become ever more important. Enter the Software Bill of Materials (SBOM), a critical tool for managing these risks and ensuring the integrity of software products.</p>
<h1 id="heading-what-is-the-software-bill-of-materials">What is the Software Bill of Materials?</h1>
<p>Bill of materials is a common term used in many other industries. A bill of materials (BOM) is an extensive list of raw materials, components, and instructions required to construct, manufacture, or repair a product or service. There are manufacturing BOM, engineering bills of materials, etc. In software, an SBOM or software bill materials includes all software components used to make a final software product. The SBOM will include details of the software version, patches, dependencies, vulnerabilities, licenses, etc. SBOM has emerged as a critical element of software security and software supply chain risk management.</p>
<h1 id="heading-why-is-sbom-important">Why is SBOM important?</h1>
<p>There have been several high-profile attacks on the software supply chain over the past decade. The attacks like Codecov, Kaseya, Solarwinds, and most recently the XZ Utils backdoor attack of this year all point to the need for a better understanding of software product dependencies, and the security posture of the entire chain of software dependencies. Identifying such vulnerabilities and mitigating them has a certain cost associated with it, and with more than 80% of any software containing open-source dependencies, this becomes a very critical problem to solve.</p>
<p>It is not just the software industry, but governments also have been taking note of the supply chain security issues and their impact on national security and the economy in general. A pivotal moment in this domain was the US Executive Order 14028 for Improving the Nation’s Cybersecurity. The order outlines how government agencies and vendors should approach protecting their software and the software supply chain, one of the major recommendations being the use of the Software Bill of Materials for greater transparency. Governments across the globe are also enacting similar legislations or guidelines, like the EU Parleament’s NIS2 Directive of 2023 or the CERT-IN ( Computer Emergency Response Team of India ) guideline on SBOMs ( 2024 ).</p>
<h1 id="heading-key-features-and-benefits-of-sbom">Key Features and Benefits of SBOM</h1>
<ul>
<li><p>Component Transparency: SBOM offers complete transparency into the components used in a software application, which helps identify potential security vulnerabilities or licensing issues related to third-party dependencies.</p>
</li>
<li><p>Vulnerability Management: By knowing all the components and their versions, organizations can quickly identify if any part of the software is affected by known vulnerabilities and take appropriate remediation measures.  Think of the log4shell vulnerability, any organization that had SBOMs would have been able to quickly protect them because they would know which components of their software infrastructure would be impacted.</p>
</li>
<li><p>Compliance and Risk Assessment: SBOM aids in complying with industry regulations and standards that mandate transparency and disclosure of third-party components. It also helps in evaluating the potential security and legal risks associated with the software.</p>
</li>
<li><p>Supply Chain Security: SBOM enables organizations to better understand and manage their software supply chain, reducing the risk of supply chain attacks and ensuring that components are sourced from trusted vendors.</p>
</li>
</ul>
<h1 id="heading-sbom-formats">SBOM Formats</h1>
<p>SBOM is a document describing the chain of dependencies, their versions, licenses, vulnerabilities, etc. Organizations use and produce a large number of software products and the only way to generate and maintain SBOMs at scale is if the process is automated. Given that multiple parties will be involved in the generation and maintenance of software dependencies, it works best when everybody uses a common standard to represent SBOMs. Having a standard format for SBOM helps with interoperability, automation, and adoption of industry best practices.</p>
<h2 id="heading-spdx">SPDX</h2>
<p>SPDX stands for Software Package Data Exchange and was developed by the Linux Foundation's SPDX Workgroup. It was launched in 2010 to address software supply chain challenges. The primary focus is on licenses and other compliances in mind. The current version of SPDX specification is 3.0.1 ( <a target="_blank" href="https://spdx.github.io/spdx-spec/v3.0.1/">https://spdx.github.io/spdx-spec/v3.0.1/</a>). SPDX is a very comprehensive specification and has been around for a very long time. It is already being used by many industries and products including automotive and healthcare.</p>
<h2 id="heading-cyclonedx">CycloneDX</h2>
<p>CycloneDX is a standard that provides advanced supply chain capabilities for cyber risk reduction. Predominantly focused on vulnerability and security, CycloneDX was created by the OWASP foundation. The latest specification version is 1.6 and was published in 2024 (<a target="_blank" href="https://cyclonedx.org/specification/overview/">https://cyclonedx.org/specification/overview/</a> ). It is estimated to be in use in over 100,000 organizations</p>
<h1 id="heading-sbom-classifications">SBOM Classifications</h1>
<p>SBOMs can be generated at various stages of SDLC and have different classifications and purposes. CISA - the US cyber security agency classifies the SBOMs into 6 categories as follows</p>
<ul>
<li><p>Design - SBOM of software that is in the planning/design phase.  Derived from RFC/Design Docs.</p>
</li>
<li><p>Source - Created from the development environment and source files.  Usually generated from software composition analysis</p>
</li>
<li><p>Build - SBOM for a release artifact being built in a build environment.</p>
</li>
<li><p>Analyzed - SBOM from analysis of the build artifact with external tooling</p>
</li>
<li><p>Deployed - An sBOM providing the inventory of software existing in a production system.  </p>
</li>
<li><p>Runtime - Generates SBOM from the deployed software, its external dependencies, and runtime-loaded components</p>
</li>
</ul>
<p>There are other types of classifications as well, like SaaS BOM, ML BOMs, Hardware BOMs, etc.</p>
<h1 id="heading-sbom-adoption-and-devsecops">SBOM Adoption and DevSecOps</h1>
<p>SBOM is a critical tool in the arsenal of DevSecOps teams. So with this knowledge of SBOMs, what should DevSecOps do to adopt them?</p>
<p>SBOM is a new topic for most engineering organizations, so understand it first and then assess and document the use cases within the organization. Evangelize if necessary. Then evaluate tools, there are open source as well as vendor tools, many of which are SaaS-based. An assorted list of tools can be found at the CycloneDX tool center- <a target="_blank" href="https://cyclonedx.org/tool-center/">https://cyclonedx.org/tool-center/</a></p>
<p>Once you pick the tools, integrate them into your SDLC through CI/CD and automate SBOM generation. There are tools that will help you to save the SBOM data and analyze them. Also, remember that this is not a one-time task and is part of your SDLC. Once you have the SBOM data collected, continuously analyze it, make reports, and share it with the right stakeholders. Also, set alerts and reports on vulnerabilities/licenses/compliances, etc.</p>
<hr />
<p>This article will give you a fundamental understanding of SBOMs and its uses. In a subsequent article, we will see how to generate SBOMs using open-source tools and examine the SBOM document generated.</p>
]]></content:encoded></item><item><title><![CDATA[Streamline your Network Topology with  AWS Transit Gateways]]></title><description><![CDATA[AWS Network offerings started with the simple concept of VPC - a Virtual Private Cloud that isolates your workloads at a network level. Subnets, security groups, and network ACLS provided isolation and security. Over time, more services were added - ...]]></description><link>https://safeer.sh/streamline-your-network-topology-with-aws-transit-gateways</link><guid isPermaLink="true">https://safeer.sh/streamline-your-network-topology-with-aws-transit-gateways</guid><category><![CDATA[AWS]]></category><category><![CDATA[AWS Transit Gateway]]></category><category><![CDATA[networking]]></category><category><![CDATA[Cloud Networking]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Sun, 06 Oct 2024 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/QIbyuC1W0hU/upload/cdc5529dc32d96ab6387a58316f9ce9c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AWS Network offerings started with the simple concept of VPC - a Virtual Private Cloud that isolates your workloads at a network level. Subnets, security groups, and network ACLS provided isolation and security. Over time, more services were added - VPN Gateways, Direct Connect, VPC Peering, Cloud WAN, and much more. As you increase the cloud footprint across VPCs, and geographies and/or adopt hybrid or multi-cloud architecture, managing the network architecture becomes overly complex. This is where VPC Transit Gateway comes to the rescue.</p>
<h2 id="heading-what-is-an-aws-transit-gateway">What is an AWS Transit Gateway?</h2>
<p>AWS Transit Gateway is a service that is aimed at simplifying AWS network topologies. It functions as a managed transit hub that uses a hub-and-spoke model: VPCs, on-premises networks, data centers, or SD-WAN solutions can connect to the Transit Gateway. Instead of potentially managing dozens or hundreds of peering connections, tunnels, etc., engineers can now attach the relevant network entities and manage routing in a centralized place. AWS Transit Gateway provides a scalable, secure, and streamlined networking solution by eliminating the complexity of many point-to-point connections.</p>
<h2 id="heading-key-components-and-architecture">Key Components and Architecture</h2>
<p>The key component is the Transit Gateway itself, which acts as the central hub and a virtual router for the VPCs and on-premise networks. In addition, the following components are part of the Transit Gateway Architecture.</p>
<ul>
<li><p>Attachments - Transit Gateway Attachments are connections that facilitate attaching different network entities like VPCs or VPNs to the transit gateway. Commonly supported attachments are:</p>
<ul>
<li><p>VPCs</p>
</li>
<li><p>VPN Connections</p>
</li>
<li><p>Direct Connect Gateway</p>
</li>
<li><p>SD-WAN/Third-party Network Appliances</p>
</li>
<li><p>Another Transit Gateway</p>
</li>
</ul>
</li>
<li><p>Route Tables - A transit gateway comes with a default route table, but can optionally have additional route tables. Like regular route tables, these route tables determine the next hop based on the destination IP address. Route tables can have static ( manual configuration ) or dynamic ( routing protocol-based) routes and the next hope in each route will be a Transit Gateway Attachment.</p>
</li>
<li><p>Routing Table Association - An attachment will be associated with exactly one route table. A route table within Transit Gateway might or might not be associated with an attachment</p>
</li>
<li><p>Route Propagation - Attachments can “advertise” routes to a Transit Gateway route table. If you enable route propagation for a VPC attachment, the subnet CIDRs of that VPC are automatically advertised to the associated Transit Gateway route table. Similarly, for on-premises connections, the CIDRs of on-premises networks can be automatically propagated to the Transit Gateway.  </p>
</li>
</ul>
<p>Transit Gateway is a regional resource, but it can have associations across regions or accounts. It can also peer with other Transit Gateways to segment or simplify network topologies. The following diagram provides a view of how Transit Gateway fits into AWS network architecture.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738073011792/79f12496-b037-4927-ac53-01e54fa3779c.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-advantages-of-using-aws-transit-gateway">Advantages of Using AWS Transit Gateway</h2>
<ul>
<li><p>Centralized and Simplified Network Management - The biggest USPO of Transit Gateway is that it simplifies network management. Rather than handling a large number of point-to-point connections, TGW helps you centralize these connections with a hub and spoke architecture</p>
</li>
<li><p>Scalability and High Availability - TGW is a fully managed service and scales to handle your traffic automatically. There is no need to manually provision any network components or resize the SKUs as your network traffic needs evolve. The data plane traffic is distributed across multiple availability zones for high availability.</p>
</li>
<li><p>Consistent Performance - Transit Gateway traffic flows through the AWS backbone and that provides consistent performance for inter-VPC and hybrid connectivity. When you peer multiple Transit Gateways across Regions, traffic stays on AWS’s global private network, avoiding the unpredictable nature of the public internet.</p>
</li>
<li><p>Centralized Security and Monitoring - Since the transit gateway consolidates network traffic across your cloud infrastructure, it is easy to apply managed policies and network ACLs on it. As of September 2024, security group referencing is also enabled on Transit Gateways. The centralized nature also makes it easy to have network analysis, packet inspection, firewall capabilities, etc to be implemented in a centrally dedicated infrastructure. The VPC Flow Logs for TGW can be used for packet analysis across the AWS network infrastructure connected to it.</p>
</li>
<li><p>Flexible and Granular Routing Policies - The Transit gateway enables you to enforce granular control over routing. The ability to associate specific routing tables on individual attachments makes it easy to control routing. Transitive routing which was not possible between three or more VPCs is now possible with TGW. It is also possible to segment the network by controlling which parts of the network should and should not communicate with each other.</p>
</li>
<li><p>Cost-Effectiveness Over Large-Scale Deployments - While there is a cost associated with AWS Transit Gateway attachments and data processing, the service can be more cost-effective compared to managing a large mesh of VPC peering connections or multiple VPNs. The fewer network endpoints you have to configure, monitor, and secure, the lower your total administrative overhead. Many organizations also see cost savings by consolidating connectivity through Transit Gateway, particularly when egress or data transfer charges in multi-VPC environments are taken into account.</p>
</li>
</ul>
<h2 id="heading-common-use-cases-of-transit-gateways">Common use cases of Transit Gateways</h2>
<p>TGWs are used to centralize networking and ease management burden. But let us look at some of the use cases of how customers can make use of TGW in their infrastructure.</p>
<ul>
<li><p>Interconnecting Multiple VPCs - A fundamental use case for AWS Transit Gateway is <strong>interconnecting multiple VPCs</strong>. In a complex cloud infrastructure, organizations will maintain 100s of VPCs. The regular VPC peering is one-to-one in nature and handling a large number of VPCs would require a large mesh of peer connections. TGW simplifies this mesh by providing a Hub-and-Spoke architecture for managing network infrastructure.</p>
</li>
<li><p>Hybrid Connectivity with On-Premises Data Centers - AWS Transit Gateway offers a cohesive way to connect on-premises data centers to AWS. Enterprises can use Site-to-Site VPN for smaller or less latency-sensitive workloads or opt for Direct Connect for more demanding, data-intensive applications. Once the on-premises networks are attached to the Transit Gateway, each VPC that is also attached can communicate as needed, all managed from a single routing domain</p>
</li>
<li><p>Multi-Region High-Availability and Disaster Recovery - For High Availability and business continuity plans, businesses invest in multi-region infrastructure. Reliable cross-connectivity between infrastructure in different regions is critical for such plans. Transit gateways can act as the central hub that facilitates the traffic routing between these regions.</p>
</li>
<li><p>Shared Services with central control and security - Create a hub of shared services/offerings that can be accessed from customer networks by integrating with the transit gateway. This will allow businesses to offer private/dedicated services to customers while maintaining control and ensuring security. A SaaS provider with tenant VPCs for each customer would be a good example of such a case.</p>
</li>
<li><p>SD-WAN Integration - A lot of organizations use traditional SD-WAN services from telecom/network vendors. Utilizing a transit gateway, this kind of infrastructure can be migrated to integrate with AWS, allowing for better management and security.</p>
</li>
</ul>
<h2 id="heading-additional-networking-features-of-transit-gateways">Additional Networking Features of Transit Gateways</h2>
<p>Some of the additional features of the transit gateways that are worth mentioning are:</p>
<ul>
<li><p>Larger MTU - The maximum transmission unit is the size of the largest packet that can go through a network. A larger MTU means better bandwidth and faster communication. While a lot of traditional networks have an MTU of 1500 bytes, TGW offers 8500 bytes MTU for many of its attachments.</p>
</li>
<li><p>ECMP - Equal cost multipathing is a network-level load balancing method that allows sending traffic to the same destination through multiple routes. The transit gateway supports ECMP for different types of attachments.</p>
</li>
<li><p>Multicast support - Multicast is a network protocol that allows sending the same message to a selected set of network destinations. The transit gateway supports multicast and provides options to configure multicast groups and destinations.</p>
</li>
</ul>
<p>Transit gateways are a game changer in simplifying cloud network architecture, eliminating the need for complex mesh topologies and legacy hub and spoke networks. It is the key to maintaining a secure and efficient network topology with simplified management. This is very critical as your cloud footprint and complexity grow.</p>
]]></content:encoded></item><item><title><![CDATA[Systemd Timers - a better alternative for crontab?]]></title><description><![CDATA[Cron is the ubiquitous job scheduler in the Linux world.  Anyone who wants to run a periodic automation or a one-off script at a predetermined time interval has used cron.   Plenty of housekeeping tasks run as scheduled jobs by default on all Linux s...]]></description><link>https://safeer.sh/systemd-timers-a-better-alternative-for-crontab</link><guid isPermaLink="true">https://safeer.sh/systemd-timers-a-better-alternative-for-crontab</guid><category><![CDATA[systemd]]></category><category><![CDATA[cronjob]]></category><category><![CDATA[Linux]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Mon, 03 Apr 2023 06:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1741460471657/29ecbdec-2095-4c92-8489-a86438181b68.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Cron is the ubiquitous job scheduler in the Linux world.  Anyone who wants to run a periodic automation or a one-off script at a predetermined time interval has used cron.   Plenty of housekeeping tasks run as scheduled jobs by default on all Linux systems.  The files under /etc/cron.* will give you an idea on these cron jobs.</p>
<p>For a very long time crontab has been the undisputed choice for scheduling jobs on Linux systems.  But with the wide adoption of systemd as the system and service manager, it has given us a new option - systemd timers.</p>
<p>Most casual Linux users consider systemd as a modern replacement for the SysV init system.  While it is partially true, systemd is much more than that.  To quote the systemd website:</p>
<p><strong><em>systemd provides aggressive parallelization capabilities, uses socket and D-Bus activation for starting services, offers on-demand starting of daemons, keeps track of processes using Linux control groups, maintains mount and automount points, and implements an elaborate transactional dependency-based service control logic.</em></strong> </p>
<p>While a discussion on the systemd capabilities is outside the scope of this article, the following points will serve as a refresher to better understand this article:</p>
<ul>
<li><p>Systemd owns PID1 and is started by the Linux kernel</p>
</li>
<li><p>All other processes in the system are descendants of systemd</p>
</li>
<li><p>It is responsible for filesystem initialization </p>
</li>
<li><p>The fundamental building block of the systemd ecosystem is the concept of units</p>
</li>
<li><p>A unit is a system resource that systemd knows how to manage</p>
</li>
<li><p>There are different types of units, including services, devices, mountpoints, sockets, etc</p>
</li>
<li><p>The definition and configuration of a unit is stored in unit files under different directories managed by systemd</p>
</li>
<li><p>There are system level units and user level units</p>
</li>
</ul>
<p>With that background set, let's get back to running crons with systemd timers.</p>
<h2 id="heading-systemd-timers">Systemd Timers</h2>
<p>Systemd provides an alternative to crons in the form of systemd timers.  Timers are a type of unit files that define how a job or a service can be run on calendar events or monotonic events.  Calendar events are similar to cron time and date fields, set based on a calendar time, and can be recurring as well.  Monotonic events are configured as a time delta from the occurrence of an event, say boot time. </p>
<ul>
<li><p>A systemd timer unit file will have the extension .timer</p>
</li>
<li><p>For every systemd timer unit file, there will be another unit file with the same name and a .service extension</p>
</li>
<li><p>The timer unit defines the “when to run”, whereas the service unit defines the “what to run”</p>
</li>
</ul>
<p>Let's use an example to demonstrate how timers work.</p>
<p>Consider the scenario of taking a MySQL backup every three hours.   A crontab entry would look like this:</p>
<pre><code class="lang-bash">0 */3 * * *  /usr/<span class="hljs-built_in">local</span>/bin/mysql-backup.sh
</code></pre>
<p>And the script would be</p>
<pre><code class="lang-bash">cat /usr/<span class="hljs-built_in">local</span>/bin/mysql-backup.sh

<span class="hljs-comment">#!/bin/bash</span>
/usr/bin/mysqldump inventory  &gt; /opt/db-backup/inventory-db-backup_`date +%H-%d-%m-%Y`.sql
</code></pre>
<p>The authentication will be taken care of with /root/.my.cnf</p>
<pre><code class="lang-bash">[client]
host=localhost
user=MYSQL_BACKUP_USER
password=MYSQL_BACKUP_PASSWORD
</code></pre>
<p>Now let us see how to move this cron to a systemd timer.</p>
<p>First we need to translate our backup script into a systemd service.</p>
<pre><code class="lang-bash">cat /etc/systemd/system/mysql-backup.service

[Unit]
Description=<span class="hljs-string">"MySQL Backup Service"</span>
[Service]
ExecStart=/usr/<span class="hljs-built_in">local</span>/bin/mysql-backup.sh
</code></pre>
<p>Now we need to create the timer unit</p>
<pre><code class="lang-bash">
cat /etc/systemd/system/mysql-backup.timer
[Unit]
Description=<span class="hljs-string">"Run mysql-backup.service every 3 hours"</span>

[Timer]
OnCalendar=*-*-* 00/3:00:00
Unit=mysql-backup.service

[Install]
WantedBy=multi-user.target
</code></pre>
<p>Most parts of these unit files are self-explanatory and are not different from normal unit files.  The timer section describes what service to run ( mysql-backup.service ) and one or more time options.  In this particular case, we are giving a specific calendar timing that executes the service once every three hours.  We will look at a few more timer options later.</p>
<p>Now, verify the files are syntactically correct.</p>
<pre><code class="lang-bash">sudo systemd-analyze verify /etc/systemd/system/mysql-backup.*
</code></pre>
<p>If this command doesn't return any output, then we are good to proceed.</p>
<p>Reload systemd to update the system about new unit files.</p>
<pre><code class="lang-bash">sudo systemctl daemon-reload
</code></pre>
<p>Enable and start the timer unit.</p>
<pre><code class="lang-bash">sudo systemctl <span class="hljs-built_in">enable</span> --now mysql-backup.timer
</code></pre>
<p>The timer is active now.</p>
<pre><code class="lang-bash">sudo systemctl status mysql-backup.timer
● mysql-backup.timer - <span class="hljs-string">"Run mysql-backup.service every 3 hours
     Loaded: loaded (/etc/systemd/system/mysql-backup.timer; disabled; vendor preset: enabled)
     Active: active (waiting) since Sat 2023-03-08 21:07:23 IST; 3s ago
    Trigger: Sun 2023-03-09 00:00:00 IST; 2h 52min left
   Triggers: ● mysql-backup.service</span>
</code></pre>
<p>As you can see, the timer is enabled and the next run time is also listed.</p>
<p>Now let us look at the few ways the timers can be configured.  As mentioned in the beginning, there are two categories of timers - real time ( calendar based ) and monotonic ( based on events ).</p>
<h2 id="heading-monotonic-timers">Monotonic timers</h2>
<p>Monotonic timers are triggered after a specific time elapsed from an event, like boot time.  There are different options to configure the monotonic timers, some of them are given below.</p>
<ul>
<li><p>OnBootSec: time after the machine boots up</p>
</li>
<li><p>OnActiveSec: time after the timer unit is activated</p>
</li>
<li><p>OnUnitActiveSec: time after the service unit was last activated</p>
</li>
<li><p>OnUnitInactiveSec: time after the service unit was last deactivated</p>
</li>
<li><p>OnStartupSec: time after the service manager is started</p>
</li>
</ul>
<p>There are various formats in which you can provide values to the monotonic timer options.  Some of them are </p>
<ul>
<li><p>5hours</p>
</li>
<li><p>34minutes</p>
</li>
<li><p>5hours 34minutes</p>
</li>
<li><p>1y 3month</p>
</li>
</ul>
<h2 id="heading-real-time-timer">Real-time timer</h2>
<p>Triggered by calendar events, the real-time timers have only one option: “OnCalendar”.  This is the option that closely resembles crontab timers, so let's have a quick comparison.</p>
<pre><code class="lang-bash">Crontab Format:   minute hours day-of-the-month month day-of-the-week
</code></pre>
<p>This is a five-part format that can use *, absolute values, ranges, and lists for each part.</p>
<pre><code class="lang-bash">Systemd Timer Format:  Day Of the week      Year-Month-Date      Hour:Minute:Second
</code></pre>
<p>This three-part format works as follows:</p>
<ul>
<li><p>First part is the day of the week with 3 character values from Mon to Sun</p>
</li>
<li><p>Second part is Year-Month-Date in number format with 4 digit years, two digit month and day</p>
</li>
<li><p>Third part is time - 2 digits per hour, minute, and second</p>
</li>
<li><p>A continuous range of values can be indicated with two dots.  Eg: Mon..Wed mean Mon, Tue, Wed</p>
</li>
<li><p>A list of values can be indicated with a comma.  Eg: Sat,Sun</p>
</li>
<li><p>Asterisk can be used as a wild card to match all valid values.  Eg: “*” in the first part means all weekdays - Mon..Sun</p>
</li>
<li><p>“/” can be used as a repetition option</p>
</li>
<li><p>Default values can be skipped and the syntax can be shortened in specific ways</p>
</li>
<li><p>Shorthands like minutely, hourly,  yearly, etc can be used as a timer</p>
</li>
</ul>
<p>The systemd time manpage covers the time formats in detail.  Refer - <a target="_blank" href="https://man.archlinux.org/man/systemd.time.7">https://man.archlinux.org/man/systemd.time.7</a></p>
<p>A few values for OnCalendar option are:</p>
<ul>
<li><p>Sat,Sun <em>-</em>-* 23:00 - Every Saturday and Sunday at 11</p>
</li>
<li><p>2023-02-14 00:00:01 - 1 second into 14 Feb 2023</p>
</li>
<li><p><em>-</em>-* 00/3:00:00 - Every three hours</p>
</li>
<li><p>23:40/5 - Every day every 5 minutes starting 11:40 PM</p>
</li>
</ul>
<p>The variety of values in the OnCalendar option can be confusing.  To ensure we provide the right values, systemd provides a command-line option.  Let us validate the last value from this list as follows:</p>
<pre><code class="lang-bash">sudo systemd-analyze calendar --iterations=2 <span class="hljs-string">"23:40/5"</span>
  Original form: 23:40/5
Normalized form: *-*-* 23:40/5:00
    Next elapse: Sat 2023-03-08 23:40:00 IST
       (<span class="hljs-keyword">in</span> UTC): Sat 2023-03-08 18:10:00 UTC
       From now: 17min left
       Iter. <span class="hljs-comment">#2: Sat 2023-03-08 23:45:00 IST</span>
       (<span class="hljs-keyword">in</span> UTC): Sat 2023-03-08 18:15:00 UTC
       From now: 22min left
</code></pre>
<p>As you can see, the command interprets the shorthand and expands into the normalized form, it also provides when the next few occurrences of the schedule will be ( controlled with –iterations ).  It also tells you how much time from your current time will the event occur.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>As you can see, the systemd timers are very versatile.  Crontab provides the simple and efficient form of scheduling, and is available ubiquitously on all Linux operating systems, unlike systemd, which is not guaranteed to be available on all variants of Linux.  Having said that, systemd is available on most popular operating systems and these are some of the advantages of timers.</p>
<ul>
<li><p>With service units, the jobs can be independently tested and run anytime without timers</p>
</li>
<li><p>A job can be configured to depend on another systemd unit.  For example, run the MySQL backup unit only if the mysqld service is running.</p>
</li>
<li><p>Can be resource controlled with cgroups and slices</p>
</li>
<li><p>Easy debugging with journalctl</p>
</li>
<li><p>Time formats allow more control.  Years and seconds are supported.</p>
</li>
</ul>
<p>Do experiment with systemd timers.  Most of the systemd options that can be used to customize the unit files can be used to better configure your timers and services.  If you are running scheduled jobs for production systems, using the systemd timers makes more sense, and is easier to automate with configuration management tools.</p>
]]></content:encoded></item><item><title><![CDATA[Evolution of CI/CD with SRE]]></title><description><![CDATA[💡
This article was written for the Continuous Delivery Foundation in my role as CDF Ambassador, along with my fellow ambassador Garima Bajpai. The original article can be found on the CDF Blog


In the past decade, we experienced exponential growth ...]]></description><link>https://safeer.sh/evolution-of-cicd-with-sre-a-future-perspective</link><guid isPermaLink="true">https://safeer.sh/evolution-of-cicd-with-sre-a-future-perspective</guid><category><![CDATA[cdf]]></category><category><![CDATA[SRE]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Tue, 28 Feb 2023 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711186431947/7c483e1a-823e-406c-9761-6a9c0b59b927.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">This article was written for the <a target="_blank" href="https://cd.foundation">Continuous Delivery Foundation</a> in my role as <a target="_blank" href="https://cd.foundation/ambassadors/">CDF Ambassador</a>, along with my fellow ambassador <a target="_blank" href="https://www.linkedin.com/in/garimabajpai/">Garima Bajpai</a>. The original article can be found on the <a target="_blank" href="https://cd.foundation/blog/2023/03/01/evolution-of-ci-cd-with-sre-a-future-perspective/">CDF Blog</a></div>
</div>

<p>In the past decade, we experienced exponential growth and transformation of software development, cloud technologies, and adoption of DevOps culture which supported the advancement of the Continuous Delivery (CD) ecosystem. With the growth, we also witnessed focused advancement of the Site Reliability Engineering (SRE) perspective.. In this blog, we present a broader outlook on the evolution of CD with SRE through the insights presented by our ambassadors.</p>
<p>To fully realize the potential of CD at scale, the integration of SRE principles is essential. Balancing investment in tools and upskilling for reliability vis-a-vis rapid innovation in CD would be an optimal operating model for the digital economy. However, it is important to note that as SRE comes of age, it faces scalability, growth, and complexity challenges.</p>
<p>Advanced resiliency needs, next-generation security threats, and exponential data integration into software products indicate the need for evolution in the SRE approach hand-in-hand with CD. The evolution of SRE with the following key features can be seen as a critical success factor for unleashing the potential of the CD Ecosystem.</p>
<h2 id="heading-better-reliability-posture-and-overall-stability">Better Reliability Posture and Overall Stability</h2>
<p>How does SRE tie to CD? Some of the core tenets of SRE are change management, incident management, and observability. The central theme around which incident and change management revolves is “change”. The CD pipeline is the vehicle of change in your application infrastructure. Having a comprehensive CI/CD solution that covers all changes to production will provide much-needed control for SREs to understand and resolve incidents faster. It also enables them to implement change management processes as well as programmatically verify that the process is adhered to. This in turn translates to better reliability posture and overall stability.</p>
<h3 id="heading-incident-management"><strong>Incident Management</strong></h3>
<p>One of the key tenets of SRE is effective incident management. And more often than not, incidents are associated with changes in production. Now, these could be code, configuration, or infrastructure changes. Change awareness—the knowledge of the recent changes that were pushed to production applications infrastructure is vital in resolving an incident within the shortest possible time. And these resolutions involve rolling back (or forward) the problematic changes to a known good state. The Last Known Good state aka LKG is one of the primary strategies used in determining the stable state to move to.  This process is closely tied to one of the most important metrics SREs track, the Mean Time to Resolve (MTTR) of an incident.</p>
<h3 id="heading-change-management"><strong>Change Management</strong></h3>
<p>Another aspect of SRE is the discipline of change management itself. SRE engagement models greatly vary based on the cultural and organizational aspects of companies, and this applies to change management as well. A model that works for one organization won’t necessarily fit well in another organization. But there are always some underlying principles that can be applied across teams and organizations. The older model of change approval boards and central control has given way to the peer review and approval model. This is often supplemented with automated software and security testing as part of the continuous integration pipeline so that issues are caught and resolved early on.</p>
<h3 id="heading-observability">Observability</h3>
<p>SREs are also responsible for monitoring, or rather observability of the application infrastructure. While this predominantly covers observability of the production infrastructure, as the reliability practices have evolved observability has also extended to the measurement of engineering excellence. This is often achieved with reliability scorecards that are tied to various aspects of engineering and application delivery. Similar to how golden signals provide observability into production infrastructure, DORA metrics provide observability into engineering excellence. To put it in the CI/CD parlance, even observability is shifting—or rather expanding—to the left</p>
<h2 id="heading-data-driven-cd-metrics-and-slo">Data-Driven CD: Metrics and SLO</h2>
<p>A lot has been done in the past with SRE principles getting mainstream, however, most of the SRE practices are still falling short of “shift left” quickly, as highlighted by Dynatrace’s State of SRE Report: 2022 Edition. More often, early integration of SRE principles and practices into CI/CD with a data-centric metric-based approach could be the next step in the evolution of CI/CD with SRE. A unified view on SLOs right from the inception stages of Continuous Delivery, some guidance in this direction could be taken from DevOps Research and Assessment (DORA).</p>
<h2 id="heading-advanced-reliability-engineering-and-cd-platform-tools-amp-application">Advanced Reliability Engineering and CD: Platform, Tools &amp; Application</h2>
<p>As the adoption of SRE principles scales, it is evident that the reliability engineering space remains fragmented and heavily focused on monitoring, visualization, and communication.  To tailor the SRE principles to the organizational needs, SREs often have to take a hybrid approach with automation, monitoring &amp; AIOps tools co-existing in the ecosystem. In the future with the evolution of CD, SRE tools &amp; applications would not only need consolidation but a more standardized approach to scale. CD integration with SRE will go beyond the current hybrid, fragmented tool-based integration to a more platform-oriented approach, paving the way for a more proactive, insightful, and action-oriented integration of CD &amp; SRE.</p>
<h2 id="heading-emerging-technology-amp-its-integration-with-cd">Emerging Technology &amp; its Integration with CD</h2>
<p>As CD tends to integrate emerging technologies, for example, AI-based features, more &amp; more SRE-based tools &amp; applications will have to be moving in the same direction with Data Observability (ML Observability as an example). Chaos Engineering is another important practice, and when integrated based on standardized interfaces and core components, can be maturing as a framework for not only experimenting but also evaluating and prioritizing resilience at every stage of continuous delivery.</p>
<h2 id="heading-net-zero-commitment-sre-can-lead-the-way">Net-Zero Commitment – SRE Can Lead the Way</h2>
<p>As CD makes its way to more and more industry segments, it is evident we need to get more serious about the Carbon Footprint and steps to mitigate and reduce Carbon impact. SREs have been toying with the ideas of cost reduction and right-sizing to reduce the TCO and bring down the carbon footprint as a side effect. By observing and managing workloads, and on-demand features for more resource-centric digital applications, products, and features, SREs can take a more conscious approach toward carbon footprint reduction. With this, SRE principles and practices will lead the way toward a carbon-aware and carbon-optimized CD Ecosystem.</p>
<p>Some of these thoughts and practices are traditionally considered to be part of the DevOps domain. But in any reasonably sized organization, SRE and DevOps practices are often intertwined, driving towards the common goal of achieving production reliability and stability through engineering excellence. These cross-functional practices are centrally pivoted on continuous integration and delivery.</p>
]]></content:encoded></item><item><title><![CDATA[Golden Signals - Monitoring from first principles]]></title><description><![CDATA[💡
The article was originally published in the Squadcast blog as Golden Signals - Monitoring from First Principles


Monitoring is the cornerstone of operating any software system or application effectively. The more visibility you have into the soft...]]></description><link>https://safeer.sh/golden-signals-monitoring-from-first-principles</link><guid isPermaLink="true">https://safeer.sh/golden-signals-monitoring-from-first-principles</guid><category><![CDATA[goldensignals]]></category><category><![CDATA[observability]]></category><category><![CDATA[monitoring]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Tue, 19 Oct 2021 11:54:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711185389506/8227b2ed-f03d-4261-8fda-2377a1b65df0.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The article was originally published in the <a target="_blank" href="https://www.squadcast.com/blog">Squadcast blog</a> as <a target="_blank" href="https://www.squadcast.com/blog/golden-signals-monitoring-from-first-principles">Golden Signals - Monitoring from First Principles</a></div>
</div>

<p>Monitoring is the cornerstone of operating any software system or application effectively. The more visibility you have into the software and hardware systems, the better you are at serving your customers. It tells you whether you are on the right track and, if not, by how much you are missing the mark.</p>
<p>So what should we expect from a monitoring system? Most of the monitoring concepts that apply to information systems apply to other projects and systems as well. Any monitoring system should be able to collect information about the system under monitoring, analyze and/or process it, and then share the derived data in a way that makes sense for the operators and consumers of the system.</p>
<p>The meaningful information that we are trying to gather from the system is called signals. The focus should always be to gather signals relevant to the system. But just like any radio communication technology that we are drawing this terminology from, noise will interfere with signals. Noise being the unwanted and often irrelevant information that is gathered as a side effect of monitoring.</p>
<p>Traditional monitoring has been built around active and passive checks and the use of near-real-time metrics. The good old Nagios and RRDTools worked this way. Monitoring gradually matured to favor metrics-based monitoring, and that gave rise to popular platforms like Prometheus and Grafana.</p>
<p>Centralized Log analysis and deriving metrics from logs became mainstream - the ELK stack was at the forefront of this change. But the focus is now shifting to traces and the term monitoring is being replaced by observability. Beyond this, we also have all the APM (Application Performance Monitoring) and Synthetic monitoring vendors offering various levels of observability and control.</p>
<p>All these platforms provide you with the tools to monitor anything, but they don’t tell you what to monitor. So how do we choose the relevant metrics from all this clutter and confusion? The crowded landscape of monitoring and observability makes the job harder, not to mention the efforts needed to identify the right metrics and separate noise from the signal. When things get complicated, one way to find a solution is to reason from first principles. We need to deconstruct the problem and identify the fundamentals and build on that. In this specific context, that would be to identify what is the absolute minimum that we need to monitor and then build a strategy on that. So on that note, let’s understand the popular strategy used to choose the right metrics.</p>
<h1 id="heading-sre-golden-signals"><strong>SRE Golden Signals</strong></h1>
<p>SRE Golden signals were first introduced in the Google SRE book - defining it as the basic minimum metrics required to monitor any service. This model was about thinking of metrics from first principles and serves as a foundation for building monitoring around applications. The strategy is simple - for any system, monitor at least these four metrics - Latency, Traffic, Errors, and Saturation.</p>
<h2 id="heading-latency"><strong>Latency</strong></h2>
<p>Latency is the time taken to serve a request. While the definition seems simple enough, latency has to be measured from the perspective of a client or server application. For an application that serves a web request, the latency it can measure is - the time delta between the moment the application receives the first byte of request, to the moment the last byte of the response to this request leaves the application. This would include the time the application took to process the request and build the response and everything in between - which could include disk seek latencies, downstream database queries, time spent in the CPU queue, etc. Things get a little more complicated when measuring latency from the client's perspective because now the network between the client and server also influences the latency. The client could be of two types - the first is another upstream service within your infrastructure, and the second - and more complex - are real users sitting somewhere on the internet and there is no way of ensuring an always stable network between it and the server. For the first kind, you are in control and measure the latencies from the upstream application. For internet users, employ synthetic monitoring or Real User Monitoring (RUM) to get an approximation of latencies. These measurements get overly complicated when there is an array of firewalls, load balancers, and reverse proxies between the client and the server.</p>
<p>There are certain things to keep in mind when measuring latencies. The first is to identify and segregate the good latency and the bad latency, ‌the latencies endured by a successful request versus failed request. Quoting from the SRE Book, an HTTP 500 error latency should be measured as bad latency, and should not be allowed to pollute the HTTP 200 latencies - which could cause an error in judgment when planning to improve your request latencies.</p>
<p>Another important matter is the choice of the type of metrics for latency. Average or rate are not good choices for latency metrics as a large latency outlier can get averaged out and would blindside you. This outlier - otherwise called “tail” can be caught if the latency is measured in buckets of requests. Pick a reasonable number of latency buckets and count the number of requests per bucket. This would allow for plotting the buckets as histograms and flush out the outliers as percentiles or quartiles.</p>
<h2 id="heading-traffic"><strong>Traffic</strong></h2>
<p>Traffic refers to the demand placed on your system by its clients. The exact metric would vary based on what the system is serving - there could also be more than one traffic metric for a system. For most web applications this could be the number of requests served - in a specific time frame. For a streaming service like youtube, it can be the amount of video content served. For a database, it would be the number of queries served and for a cache, it could be the number of cache misses and cache hits.</p>
<p>A traffic metric could be further broken down based on the nature of requests. For a web request, this could be based on the HTTP code, HTTP method or even the type of content served. For video streaming, service content downloads for various resolutions could be categorized. For YouTube, the amount and size of video uploads are traffic metrics as well. Traffic can also be categorized based on geographies or other common characteristics. One way to measure the traffic metrics is to calculate traffic as a monotonically increasing value - usually of the metric type “counter” and then calculate the rate of this metric over a defined interval - say 5 minutes.</p>
<h2 id="heading-errors"><strong>Errors</strong></h2>
<p>This is measured by counting the number of errors from the application and then calculating the rate of errors in a time interval. Error rate per second is a common metric used by most web applications. For eg: errors could be 5xx server-side errors, 4xx client-side errors, 2xx responses with an application-level error - wrong content, no data found, etc. It would also use a counter-type metric and then a rate calculated over a defined interval.</p>
<p>An important decision to make here would be what we can consider as errors. It might look like the errors are always obvious - like 5xx errors or database access errors, etc. But there is another kind of error that is defined by our business logic or system design. For example, serving wrong content for a perfectly valid customer request would still be an HTTP 200 response, but as per your business logic and the contract with the customer, this is an error. Consider the case of a downstream request that ultimately returns the response to an upstream server, but not before the latency threshold defined by the upstream times out. While the upstream would consider this an error - as it should be, the downstream may not be aware that it breached an SLO (which is subject to change and may not be part of the downstream application design) with its upstream and would consider this a successful request - unless the necessary contract is added to the code itself.</p>
<h2 id="heading-saturation"><strong>Saturation</strong></h2>
<p>Saturation is a sign of how used or “full” the system is. 100% utilization of a resource might sound ideal in theory, but a system that’s nearing full utilization of its resources could lead to performance degradation. The tail latencies we discussed earlier could be the side effect of a resource constraint at the application or system level. The saturation could happen to any sort of resources that are needed by the application. It could be system resources like memory or CPU or IO. Open file counts hitting the max limit set by the operating system and disk or network queues filling up are also common examples of saturation. At the application level, there could be request queues that are filling up, the number of database connections hitting the maximum, or thread contention for a shared resource in memory.</p>
<p>Saturation is usually measured as a “gauge” metrics type, which can go up or down, but usually within a defined upper and lower bound. While not a saturation metric, the 99th percentile request latency (or metrics on outliers) of your service could act as an early warning signal. Saturation can have a ripple effect in a multi-tiered system where your upstream would wait on the downstream service response indefinitely or eventually timeout - but also causing additional requests to queue up, resulting in resource starvation.</p>
<p><em>While the Golden signals covered in this blog are metrics-driven and ideally a good starting point to measure if something is going wrong, they are not the only things to consider. There are various other metrics not necessary to track on a daily basis, but certainly an important place to investigate when an incident takes place. We will be covering this in Part 2 of this blog series.</em></p>
<p><em>Irrespective of your strategy, understanding why a system exists, what are its services, and the business use cases it serves, are vital. This ‌will lead you to identify the critical paths in your business logic and help you model the metrics collection system based on that.</em></p>
]]></content:encoded></item><item><title><![CDATA[Trust is a vulnerability — The Zero Trust Security Model]]></title><description><![CDATA[Zero trust security is an improved security model that introduced a shift in how we traditionally thought of securing our data and resources.
The traditional approach to network security

The traditional approach to securing an infrastructure is to u...]]></description><link>https://safeer.sh/trust-is-a-vulnerability-the-zero-trust-security-model</link><guid isPermaLink="true">https://safeer.sh/trust-is-a-vulnerability-the-zero-trust-security-model</guid><category><![CDATA[Security]]></category><category><![CDATA[zerotrust]]></category><category><![CDATA[network security]]></category><category><![CDATA[vpn]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Mon, 14 Jun 2021 11:01:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711186097703/8551e799-a762-40ec-b4ef-6b57806bd71e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Zero trust security is an improved security model that introduced a shift in how we traditionally thought of securing our data and resources.</p>
<h3 id="heading-the-traditional-approach-to-network-security">The traditional approach to network security</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711176490065/d631d4d3-aee9-4680-a723-6a4a1337fcfb.webp" alt class="image--center mx-auto" /></p>
<p>The traditional approach to securing an infrastructure is to use network perimeter-based protection — otherwise known as the castle and moat approach. The corporate firewall becomes the moat that encircles and protects the network castle and anyone inside is trusted while the rest of the world is untrusted. These parties could get access to the inside either by being located inside, in a connected corporate office/ datacentre network, or connected via VPN. All devices and users behind the firewall perimeter are trusted while the rest of the world is considered untrusted. There could be different zones of trust within the trusted environment with more restrictions ( for example DMZ). Some of the tools employed to secure the trusted perimeters are firewalls, VPNs, ACLs, IDS, IPS, etc.</p>
<p>The inherent weakness in this approach is the de facto classification of inside devices and users as trusted. Devices and users can be compromised and once they are compromised can be used to launch attacks using the implicit trust conferred upon them. This problem is further aggravated by the growing adoption of SaaS/IaaS cloud services, more remote users, and bring-your-own-device (BYOD) policies. The perimeters of the trusted network get extended further in such a hybrid environment. It has become hard to bring all these elements under a few trusted perimeters and classify the traffic as trusted or untrusted.</p>
<h3 id="heading-zero-trust-approach-to-network-security">Zero trust approach to network security</h3>
<p>The cornerstone of zero-trust security is the elimination of the primary weakness in traditional security, ie implicit trust. Zero trust doesn’t use the network location — whether a user or device is inside the perimeter — to decide on authorization or access. It rather treats all requests as hostile and enforces security for every request made.</p>
<p>The approach follows three principles:</p>
<ol>
<li><p>Verify explicitly: Authenticate and authorize devices and users based on multiple criteria set by the organization’s policy. This could include but is not limited to the health of the device, the identity of the user, the classification of target service or data to be accessed, suspicious activities originating from the device or by the user, etc.</p>
</li>
<li><p>Use least privilege access — Restricting a user or device to the minimum permission required to perform a certain action minimizes the attack surface. It prevents the chances for lateral movements ( getting access to one location and then moving on to adjacent services and systems ).</p>
</li>
<li><p>Assume Breach — Assume every access/operation is hostile and then implement processes to identify and approve legitimate access.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711175726590/25789a9a-65a7-440e-8b50-67170cec397c.png" alt /></p>
<p>The zero trust model is not a single technology or software implementation. It is rather a framework that can be implemented by incorporating several security technologies that exist and being used already in silos or some combinations.</p>
<p>Fundamentally zero trust is the system that should allow access from a subject to a resource while ensuring authentication, authorization, and transport security for every request between them.</p>
<p>The subject could be</p>
<ul>
<li><p>User/Identity</p>
</li>
<li><p>Device/Endpoints</p>
</li>
<li><p>Applications</p>
</li>
<li><p>Systems</p>
</li>
</ul>
<p>The resource/object could be one or a combination of the following</p>
<ul>
<li><p>Data</p>
</li>
<li><p>Applications/APIs</p>
</li>
<li><p>Infrastructure/Systems</p>
</li>
</ul>
<p>The workhorse of zero-trust is the policy decision and enforcement infrastructure. This infrastructure is spread across the control and data planes of the zero-trust ecosystem.</p>
<p>The control plane hosts two components of the policy infrastructure:</p>
<ol>
<li><p>Trust evaluation/policy engine — Responsible for granting access to a request from a subject to a resource. This component depends on input from multiple systems to make its decision. This includes enterprise access policy and input from various information and security systems like SIEM ( Security information and event management ), compliance services, threat intelligence services, etc.</p>
</li>
<li><p>Policy administrator/Access control engine — This component is responsible for authentication, authorization, and access control for every request. It is also responsible for the continued evaluation of trust while a trusted subject is accessing a resource ( and revoking the access if necessary ). This service will make use of identity management services, the policy engine, real-time threat analysis services, etc.</p>
</li>
</ol>
<p>A data plane usually has only one component.</p>
<ol>
<li>(Trusted) Proxy — This proxy is also known as the policy enforcement point. It is responsible for enabling, monitoring, and terminating connections between subject and resource. It intercepts the traffic from the subject, executes the decision from the policy engine and administrator, and ensures the adaptive access control is enforced. It is responsible for encrypting all traffic.</li>
</ol>
<p>The following diagram represents high-level elements of the zero-trust ecosystem.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711175728691/bce499ff-b439-4ef3-84c2-173b9b5da287.png" alt /></p>
<p>Image adapted from NIST SP 800–207</p>
<p>Zero trust is a reference model that can be implemented in different ways as long as the underlying principles are maintained. Hence you would find different cloud and open-source products that fall under the zero-trust security model. Irrespective of the product or service you chose, zero-trust is an iterative process and requires a lot of work beyond using some services. Some of them include</p>
<ol>
<li><p>Identifying the data stores</p>
</li>
<li><p>Classify the data according to sensitivity</p>
</li>
<li><p>Identify all the roles and the level of access to this data required for various use cases</p>
</li>
<li><p>Ensure all users are brought under one or more of these roles</p>
</li>
<li><p>Map out all transaction flows/communications to these systems and build policies that map those flows to the role and level of access required</p>
</li>
<li><p>Use the above information along with company policies, compliance requirements, and security practices to create enterprise access policies.</p>
</li>
<li><p>Once these are sorted out work out what solutions you want to adopt and redesign your security infrastructure accordingly.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Chaos in the network — using ToxiProxy for network chaos engineering]]></title><description><![CDATA[Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions.

If you are new to Chaos Engineering, go through this introductio...]]></description><link>https://safeer.sh/chaos-in-the-network-using-toxiproxy-for-network-chaos-engineering</link><guid isPermaLink="true">https://safeer.sh/chaos-in-the-network-using-toxiproxy-for-network-chaos-engineering</guid><category><![CDATA[toxiproxy]]></category><category><![CDATA[Chaos Engineering]]></category><category><![CDATA[networking]]></category><category><![CDATA[chaos]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Wed, 05 May 2021 05:11:27 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711185935381/b4141ce7-f72c-4ee2-b730-14c63a054451.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>C<em>haos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions.</em></p>
</blockquote>
<p>If you are new to Chaos Engineering, go through this introduction first:</p>
<p><a target="_blank" href="https://safeer.sh/engineering-chaos-a-gentle-introduction">Engineering Chaos: A Gentle Introduction</a></p>
<p>In production outages, a lot of blame is attributed to the network — sometimes with reason and evidence but countless other times because there is no other visible culprit to blame.</p>
<p>To increase the resilience against network failures and degradation, we need to run our chaos experiments on the network. But this is not always easy — if your application is in a data center, the chances of getting your hands on the network infrastructure to introduce chaos are close to zero, and with good reason. If the application is hosted in the cloud, the network layer is mostly abstracted out from you.</p>
<p>What in this situation would be the right way to introduce some network chaos? Given we can't manipulate the networking infrastructure itself, the next best thing we can do is redirect the traffic to a system that we can control and then forward the traffic to the original destination. This can be achieved in different ways — manipulating routing, modifying DNS records, using forward proxies, and transparently intercepting network packets using tools like iptables or EBPF.</p>
<p>In this article, we are going to examine one such tool — <a target="_blank" href="https://github.com/Shopify/toxiproxy">Toxiproxy</a>. It is a framework and TCP proxy that can simulate poor network conditions. It was developed by Shopify to test the resilience of its webstack. Toxiproxy is a network proxy that can intercept and forward TCP communication. It is highly performant and easy to configure. For any traffic flow that needs to be intercepted and tested for network degradation, that traffic can be sent through Toxiproxy and subjected to various experiments before being sent to its intended destination.</p>
<p>Toxiproxy has two components:</p>
<ol>
<li><p>The control plane — the API used to manage the proxy configuration. The control plane can be managed by directly hitting the API/the toxiproxy-cli / various client libraries</p>
</li>
<li><p>The data plane — the proxies that are created on demand to proxy different services</p>
</li>
</ol>
<p>The Toxiproxy ecosystem is as given below</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711175740630/91e02816-9f77-431a-b89c-bb85f574c8fb.jpeg" alt /></p>
<p>OK, so we have installed and started toxiproxy, but how exactly does it simulate poor network conditions, and what are those conditions?</p>
<p>To proxy the traffic to any given downstream service, a corresponding proxy has to be created within toxiproxy with a source port of our choosing ( through which we will proxy to the destination ) and the port of the specific downstream/destination service.</p>
<p>For example, when you want to proxy traffic to a remote MySQL server running on default port 3306, you create a proxy with a source port of your choice ( say 4306 ) and destination port and host as :3306. Now your application will configure its MySQL client to talk to :4306.</p>
<p>Once the proxy is ready, it's time to introduce the fault (anomaly/poor condition). In toxiproxy, these conditions are called toxics ( hence the name toxiproxy ). These toxics have their own parameters/attributes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711178252236/ec2bdc8c-6f0d-4d46-a8bb-9d2c533279cf.webp" alt class="image--center mx-auto" /></p>
<p>While the toxics and the attributes are mostly self-explanatory, more details about toxics and attributes can be found <a target="_blank" href="https://github.com/Shopify/toxiproxy#toxics">here</a></p>
<h2 id="heading-setting-up-and-testing-a-proxy">Setting up and testing a proxy</h2>
<p>Toxiproxy installation is quite easy ( it is a single binary each for the server and the CLI ). Instructions can be found <a target="_blank" href="https://github.com/Shopify/toxiproxy#1-installing-toxiproxy">here</a>. Once the proxy is installed, run it on the default port — 8474 ( or an alternate port of your choosing — in which case you should use “— host” option with the cli ). This is the port on which the control plane API would be available.</p>
<p>Once the installation is done, we can start setting up proxies.</p>
<p>First start toxiproxy by running the server binary: <code>toxiproxy-server</code> . Binary can be run without any arguments to run it on interface 127.0.0.1 port 8474. Toxiproxy keeps all modifications in memory, but a config file in JSON format with predefined proxies and toxics can be provided as a command-line argument. Once the proxy is started it can be manipulated using the <code>toxiproxy-cli</code> binary or the <a target="_blank" href="https://github.com/Shopify/toxiproxy#clients">client libraries</a> in different languages. The process is outlined below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711175742977/5fdd2540-711f-4b18-bb09-d86019dc17e1.png" alt /></p>
<p>Now let us try a practical example</p>
<ul>
<li><p>toxiproxy is running on the default port on my laptop</p>
</li>
<li><p>the downstream we will proxy to is the Geo-location API of ipify.org. Specifically, the endpoint <a target="_blank" href="https://geo.ipify.org/api/v1">https://geo.ipify.org/api/v1</a> which returns the public IP and Geolocation of the caller</p>
</li>
<li><p>Let's configure the proxy with</p>
</li>
<li><p>— Unique proxy name: ipify</p>
</li>
<li><p>— Downstream server: geo.ipify.org</p>
</li>
<li><p>— Downstream port: 443 ( SSL )</p>
</li>
<li><p>— Proxy port: 8443</p>
</li>
<li><p>Toxic to inject</p>
</li>
<li><p>— type: latency</p>
</li>
<li><p>— attribute: latency</p>
</li>
<li><p>— value: 1500 ( milliseconds )</p>
</li>
<li><p>— name: latency_1500</p>
</li>
</ul>
<p>Let us create the proxy using the <code>toxiproxy-cli</code></p>
<p>toxiproxy-cli create ipify --listen localhost:8443 --upstream geo.ipify.org:443</p>
<p>Add toxic</p>
<p>toxiproxy-cli toxic add --toxicName latency_1500 -type latency --attribute latency=1500 ipify</p>
<p>Let's list the proxies and then inspect the ipify proxy</p>
<p><strong>toxiproxy-cli list</strong></p>
<p>Name Listen Upstream Enabled Toxics<br />\====================================================================<br />ipify 127.0.0.1:8443 geo.ipify.org:443 enabled 1</p>
<p><strong>toxiproxy-cli inspect ipify</strong></p>
<p>Name: ipify Listen: 127.0.0.1:8443 Upstream: geo.ipify.org:443<br />\====================================================================<br />Upstream toxics:<br />Proxy has no Upstream toxics enabled.</p>
<p>Downstream toxics:<br />latency_1500: type=latency stream=downstream toxicity=1.00 attributes=[ jitter=0 latency=1500 ]</p>
<p>Let’s hit the geo.ipify.org API directly and get my public ip and Geo location using curl. We will also print the total time taken for the request. We will filter the JSON output using jq to only pick up the country of the public IP. Please note that I have saved my API key to the shell variable <code>IPIFY_APIKEY</code> already.</p>
<p><strong>curl -s -w "%{stderr}Total Time: %{time_total}\nCountry from public IP: " "</strong><a target="_blank" href="https://geo.ipify.org/api/v1?apiKey=$%7BIPIFY_APIKEY"><strong>https://geo.ipify.org/api/v1?apiKey=${IPIFY_APIKEY</strong></a><strong>}" |jq .location.country</strong></p>
<p>Total Time: 4.108721<br />Country from public IP: "IN"</p>
<p>The vanilla request without any proxy took approx 4100 milliseconds / 4 seconds</p>
<p>Now let us send the traffic via the proxy we created earlier. Not that we need to pass the host header and disable SSL checking.</p>
<p><strong>curl -k -s -w "%{stderr}Total Time: %{time_total}\nCountry from public IP: " -H "Host: geo.ipify.org" "</strong><a target="_blank" href="https://localhost:8443/api/v1?apiKey=$%7BIPIFY_APIKEY"><strong>https://localhost:8443/api/v1?apiKey=${IPIFY_APIKEY</strong></a><strong>}" |jq .location.country</strong></p>
<p>Total Time: 5.727683<br />Country from public IP: "IN"</p>
<p>As you can see, the request now took 5700 milliseconds with the addition of roughly 1500 milliseconds. You can experiment with various toxics like this to chaos test your app against network conditions.</p>
<blockquote>
<p>Note: The ipify API response time greatly varies when it is under load ( and am using a free version of their API ). Try to experiment withe either a performant public API or an internally hosted service for consistent results.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Engineering Chaos: A gentle introduction]]></title><description><![CDATA[With the wide adoption of microservice architecture, it has become increasingly complex to predict and protect against production outages arising from infrastructure and application failures. Oftentimes, underlying reliability and performance issues ...]]></description><link>https://safeer.sh/engineering-chaos-a-gentle-introduction</link><guid isPermaLink="true">https://safeer.sh/engineering-chaos-a-gentle-introduction</guid><category><![CDATA[chaos]]></category><category><![CDATA[Chaos Engineering]]></category><dc:creator><![CDATA[Safeer C M]]></dc:creator><pubDate>Sat, 01 May 2021 17:12:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711175736707/b1a35e52-a62f-4f7e-8927-39e16f6137e1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>With the wide adoption of microservice architecture, it has become increasingly complex to predict and protect against production outages arising from infrastructure and application failures. Oftentimes, underlying reliability and performance issues are brought out by these outages. The engineers maintaining these applications and infrastructures often speculate about common issues that could occur and try to proactively fix them based on their speculation. While this works to a point, the shortcoming is that the speculation often fails to level up with the real impact as it occurs in production.</p>
<p>So how do we better prepare for such issues? Making fixes based on the speculation and waiting for an outage to occur to confirm our hypothesis is neither scalable nor practical. We need to attack this problem head-on. That is where chaos engineering comes into the picture.</p>
<blockquote>
<p><em>Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions</em></p>
</blockquote>
<p>In other words, chaos engineering looks for evidence of weakness in a production system.</p>
<h3 id="heading-the-evolution-of-chaos-engineering">The Evolution of Chaos Engineering</h3>
<p>The term chaos engineering rose into popularity shortly after Netflix published a blog in 2010 about their journey in migrating their infrastructure into the AWS cloud. In the article, they talked about the tool Chaos Monkey, which they used to randomly kill EC2 instances. This was to simulate production outage-like scenarios and observe how their infra copped with such failures. This trend of experimenting by introducing failures into the production infrastructure quickly caught on and companies started adopting the principles and soon chaos engineering evolved into its own discipline.</p>
<p>While Netflix might have popularized chaos engineering, breaking infrastructure to test the resiliency has its roots in Amazon, where Jesse Robins, popularly known in Amazon as the <em>Master of Disaster</em> introduced Gamedays. Gameday is a program where failures are introduced into production to observe how systems and people respond to them and based on the observation the systems are fixed/rebuilt/re-architected and processes improved. This greatly helped Amazon in exposing weaknesses in its infrastructure and fixing them.</p>
<h3 id="heading-chaos-experiments">Chaos Experiments</h3>
<blockquote>
<p>Practical chaos engineering at its heart is the process of defining, conducting, and learning from chaos experiments.</p>
</blockquote>
<p>Before we start with experiments, we need to identify a <strong>target system</strong> for which we will test the resilience. This system could be an infrastructure component or an application. Once you identify such a target, map out its <strong>dependencies/downstream</strong> — databases, external APIs, hardware, cloud/data center infrastructure, etc. The aim is to identify the impact on the target — if any — when an anomaly/fault is injected into one of its dependencies.</p>
<p>Once we have decided on the target and the dependency to which fault should be injected — define a <strong>steady-state</strong> for this system. A steady state of a system is the desirable state in which the target can serve its purpose optimally.</p>
<p>Following this, form a <strong>hypothesis</strong> about the resilience of the system. The hypothesis defines the impact that the target would suffer when the fault is injected into the dependency, causing the target to violate its steady state. As part of this, the <strong>nature and severity of fault</strong> to be injected — the real-world failure scenario that needs to be simulated — should be defined as well.</p>
<p>After we decide on these factors, the <strong>fault injection</strong> can begin. In an ideal world, the fault should be injected into the production system, though it doesn’t hurt to first try this out in a pre-production environment. In a reasonably well-built system, the target will withstand the fault for a while, then start failing when it crosses some threshold or as the impact of the fault aggravates. If the impact of the fault is not severe, or the target is designed to withstand the fault, the experiment and the severity of the fault need to be redefined. In either case, the experiment should define a s<strong>top condition</strong> — this is the point when an experiment is stopped — either after encountering an error/breakage in the target system or after a defined period without encountering any errors/violation to the steady state.</p>
<p>During the experiment, record the behavior of the target system — what was the pattern of the standard metrics, what events and logs it generated, etc. It is also important to watch upstream services that make use of the target service. If the experiment was stopped after encountering an error this would provide valuable insight into the resiliency and should be used to improve the resilience of the target system as well as upstream services.</p>
<p>If the target didn’t encounter any errors and the experiment finished without incidents after the defined time, the fault and its impact need to be redefined. This would entail defining more aggressive faults which would result in increased impact on the target system. This process is known as increasing the blast radius.</p>
<blockquote>
<p>Rinse and repeat the process until all weaknesses are eliminated.</p>
</blockquote>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711175734583/e187b407-1b04-43a1-bdf2-e491ba4e820e.png" alt /></p>
<h3 id="heading-an-example-to-tie-it-all-together">An example to tie it all together</h3>
<p>Consider your company is into fleet management and last-mile delivery. There are several micro-services in your application infrastructure. There would be front-end and back-end services that handle inventory, vehicles, drivers, shifts, route planning, etc.</p>
<blockquote>
<p>Note: this is a very trimmed down and hypothetical design and may not cover all possibilities</p>
</blockquote>
<p>Let’s follow the process outlined in the previous section.</p>
<ul>
<li><p>Selection of <strong>target</strong> service — <strong>route planning</strong> microservice.</p>
</li>
<li><p>The route planner uses Google Maps API, an in-memory cache like Redis, and a MySQL back-end among other things. Route planner exposes a REST API that is used by the upstream front-end services that display the optimal route the drivers should take.</p>
</li>
<li><p><strong>Downstream dependency</strong> service — we pick <strong>Google Maps API</strong> as the service to which fault should be injected.</p>
</li>
<li><p>The <strong>steady-state</strong>: when operating in the optimal condition, the route planner will serve <strong>99% of the requests under 100m</strong>s latency ( P99 latency ). Service can tolerate P99 latency upto 200ms</p>
</li>
<li><p><strong>Fault</strong> and severity — A <strong>latency of 50m</strong>s when accessing the Google Maps API — for each API call</p>
</li>
<li><p><strong>Hypothesis</strong>: Upstream front-end servers have a timeout of 200ms for getting a response from the route planner API. Even after accounting for a 50ms latency to Google Maps API, the route planner will still return a result within 150ms which is well within the expectation of the upstreams. Expect the <strong>P99 latency to be at 150ms</strong> and no significant increase in <strong>4xx or 5xx</strong> errors for the route planner. Don’t expect the upstream services using this API to have any issues.</p>
</li>
<li><p><strong>Stop condition</strong>:</p>
</li>
<li><p>— 10 minutes run without any impacts</p>
</li>
<li><p>— P99 latency crosses 200ms</p>
</li>
<li><p>— Increase in 5xx errors — 2% of total requests</p>
</li>
<li><p>— Failure reports from upstream services or customers</p>
</li>
<li><p>Experiment start and <strong>fault injection</strong>: In practice, this could be achieved in many ways, one popular option being the use of an <strong>intermediary proxy</strong> to control the latency of the outgoing traffic. For eg: <a target="_blank" href="https://medium.com/devopsiraptor/chaos-in-the-network-using-toxiproxy-for-network-chaos-engineering-13fb0ae2deea"><strong>Toxiproxy</strong></a> can be configured to sent outbound traffic with a latency of 50ms ( and proxy the traffic to google maps API ). For this to work, the route planner application should be <strong>redeployed</strong> with the Toxiproxy endpoint in the place of the Google Maps API URL. If you are on Kubernetes you can also use <a target="_blank" href="https://chaos-mesh.org/"><strong>Chaosmesh</strong></a> for introducing chaos.</p>
</li>
<li><p><strong>Record</strong> fault injection and its impact</p>
</li>
<li><p>Scenario 1: No issues, P99 latency was close to but less than 150ms. No visible change in the number of 5xx/4xx errors</p>
</li>
<li><p>— Stop the experiment. Increase the blast radius by raising the latency from 50ms to 75ms. Repeat the experiment</p>
</li>
<li><p>Scenario 2: P99 latency is between 150ms and 200ms, 5xx errors for the front end spike to 3%. Users report seeing blank pages instead of route plans.</p>
</li>
<li><p>— Stop the experiment, and investigate the issue.</p>
</li>
<li><p>— You find that for certain routes 3 Google Maps API calls were needed instead of 1. The route planner returned the right result within 250ms ( 100 + 50 * 3 ) but the API request from the front-end server requesting the route data has already timed out at 200ms. This causes front-end servers to show a blank page instead of a meaningful message. Since only a small percentage of requests had this issue, their 200+ ms latency was averaged out with a large number of 150 ms requests.</p>
</li>
<li><p>— Fix the code, and see if the 3 calls can be run in parallel instead of sequential. Modify the front-end code to gracefully handle timeout and show a customer-friendly message.</p>
</li>
</ul>
<blockquote>
<p>Once the issues are fixed, repeat the experiments.</p>
</blockquote>
<h3 id="heading-best-practices-and-checklists">Best practices and checklists</h3>
<ul>
<li><p>If there are <strong>single-point-of-failures</strong> / other resiliency issues in the infrastructure/application that are already known fix them before attempting to run chaos experiments in the production</p>
</li>
<li><p>Make sure all the vital metrics for the target and dependent systems are being monitored. A <strong>robust monitoring</strong> system is essential for running chaos experiments</p>
</li>
<li><p>While it is natural to focus primarily on the <strong>target service and the downstream</strong> dependency, it is vital to <strong>monitor the upstream</strong> services and the customers of the target service</p>
</li>
<li><p>Use <strong>learn</strong>ing from previous <strong>outages and postmortems</strong> to create a better hypothesis as well as fix known problems proactively</p>
</li>
<li><p>Use the learning from chaos experiments not only to fix software systems but also to fix any <strong>gaps in the process</strong> — oncall response, runbooks, etc.</p>
</li>
<li><p>Once you are confident about the experiments, <strong>automate</strong> them fully and let them run periodically in production.</p>
</li>
<li><p>To build confidence to test in production, chaos experiments can be incorporated into the <strong>testing process</strong> for pre-production build pipelines</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>