Monitoring and Observability Resources for Engineers

DZone's Featured Monitoring and Observability Resources

8 Ways Mobile Observability Is the Opposite of Backend Observability

By Fredric Newberg

Using a hammer to pound a screw into the wall will work, but it's not really a great way to get the job done and probably will lead you with damage you wish hadn't occurred. Similarly, using monitoring tools initially designed for the observability of backend applications to monitor your mobile applications will leave you wishing you had reached for the screwdriver instead of the proverbial hammer. Often the observability challenges for mobile applications are pretty much the opposite of what they are for backend monitoring. Let's take a look at 8 examples where that is the case. To make this more concrete, we will use a typical e-commerce mobile application and the backend application that handles its requests as an example to illustrate these differences. However, the comparisons in these examples are broadly applicable to other types of mobile applications and backend systems that you are running. Duration of Interactions For high-traffic services in your backend application, you are looking to have requests that take milliseconds to run on average, and you want to scale to handle thousands of requests per second. You don't maintain a state between requests, and it's uncommon for prior requests to cause bugs in the current request. Some examples of service calls here include: Getting a list of specific products. Completing a purchase. Fetching a list of alternative products for a given product. The data needed to troubleshoot issues here is likely within the request or the supporting infrastructure. You can trace the individual calls and connected service calls, and then inspect them to look for failure points. However, in your e-commerce mobile app, a single session lasts from multiple seconds to minutes, or even hours. If you want to understand why purchases are failing, the problem could stem from many application and device factors: Did the user background the app between adding items to the cart and attempting to complete the purchase? Certain data might be lost during such app state transitions. Did the app run out of memory in a product list view due to excessive loading and retention of product images not appropriately scaled for the device? Did the app not complete the payment processing in a timely manner, so the user force quit the app? Did the device lose network connectivity and fail to gracefully recover? Did the app crash during the purchase flow, and the user decided to purchase elsewhere? Since mobile is such a dynamic environment, tracking down the root cause of a drop in purchases could have many root causes that fall outside of the actual service calls. The span over which errors can be introduced is far greater than in backend interactions. How you visualize and interpret data becomes very different when your expectation is that issues can evolve over minutes and not over milliseconds. Session Complexity When you envision what a complete session is for your backend application, it frequently boils down to responding to a request from the client. The external variables at play are mostly your infrastructure’s health and capacity. In your e-commerce mobile app, a complete user experience can span multiple sessions across varying lengths of time. The user could launch, background, and then launch the app again over multiple days to complete a single purchase. Key functionality can also take place while the app is backgrounded, such as sending push notifications and pulling fresh products and deals so that the user is always getting the most up-to-date data whenever they launch the app. Some challenges when troubleshooting apps with complex interactions include: Stitching together multiple app sessions to get the complete user experience context. Understanding how app performance is impacted by different launch conditions like cold starts versus reused app processes. Tracking problems with failed or outdated app states that were loaded far earlier than when the resulting error happened. App sessions also cannot be easily modeled as a series of traces, so there are data and visualization challenges when dealing with longer, more complex experiences. Uncontrolled Devices You control the infrastructure that your backend applications run on. As such, it would be a rookie mistake for a DevOps team to, for example, not be aware of servers that are about to run out of disk space, and most people would forgive the backend monitoring agent for not working as expected if a server ran out of disk space. That is not the case for your e-commerce mobile app that runs on devices that you have no control over. People buy devices with the least amount of storage they think they can get away with and promptly fill them up with apps and media. You have to build resilient SDKs that can gracefully handle these situations and still report as complete a picture as possible. You have to find the right balance between retaining relevant information on the device – you may not have network connectivity to send it right away – and making the lack of disk space worse by excessively adding to the data stored on the device. Heterogeneous Devices Not only do you have no control over devices, but also they are far from homogeneous. In a backend environment, you are likely to have a small set of different machine types. For an Android app, you will have it run on tens of thousands of device models, running a variety of OS versions, so you end up with more complicated variables when analyzing the collected data. Cardinality for certain dimensions will grow in ways that just would not be seen in backend applications. Some examples of device-specific issues include: Your developers and QA team have modern devices for testing, which can handle the size of the product images in a list view. However, many customer devices have less RAM and end up with an out-of-memory crash. A manufacturer introduced a bug in their custom Android version, so customers encounter a crash that only affects your app on specific OS version/manufacturer combinations. The UI stutters on some devices because they have old CPU and GPU chipsets that cannot handle the complexity of your application. With so many combinations of device variables, your engineering team needs deep insights into affected user segments to avoid costly issue investigations. Otherwise, they will spend time looking for root causes in code when seeing the holistic picture of impact would streamline their resolution efforts. Network Connectivity Your e-commerce backend application operates with the explicit assumption of constant connectivity. Failures frequently are a capacity problem, which can be alleviated by sizing your infrastructure to handle traffic spikes. Outright losing connectivity occurs mostly during cloud provider outages, which are exceedingly rare. However, constant network connectivity in mobile is never guaranteed. Maybe your app has a great network connection when it starts, then completely drops the network connection, and then gets it back but experiences significant lag and packet loss. Your mobile app observability solution needs to provide insight into how your app deals with these challenging conditions. Some examples of network connectivity issues include: The app cannot launch without connectivity because the download of critical data is required to enter the main application flow. The device loses connection as the user tries to make a purchase, but the user is not greeted with a prompt about the issue. To the user, it still looks like the app is attempting to complete the purchase. They get frustrated and force quit because they don’t know the source of the issue. The app does not effectively cache images, so customers in locales where bandwidth is a scarce resource stop using your application. Since problems can occur during connectivity switches, you need visibility into entire user experiences to track down problems. A great example is content refreshes or data synchronizations that are scheduled as background tasks. Understanding where failures happen under specific network conditions allows your engineering team to understand the root cause of network-related problems and introduce fixes that gracefully handle the loss of network connectivity. Data Delays Many backend observability tools will only accept data that is delayed by minutes or at most a few hours. This works fine for backend applications since the expectation is for the servers to not lose connectivity. The opposite is true in mobile, where the expectation is for connectivity to be lost intermittently and for a significant percentage of data to be delayed. As an example, your engineering team notices a spike in crashes, then launches an investigation and puts out a fix in a new version. You notice the crash rate go down, and everyone is happy. However, users on the previous version that crashed, who were too frustrated to immediately relaunch your app after it crashed, have decided to give it another go a day or two later. They launch the app, which sends a crash report from the device. If your observability tool marks those crashes as having just occurred, you might think the issue is still ongoing, even though you released a fix for it. Ecosystem Limitations When you build a backend application, you get to choose the environment that it runs in. The limitations on what you monitor and how you monitor it are largely dictated by the overhead it introduces and the time it takes to implement it. On mobile, you are operating in ecosystems defined by the device manufacturers or maintainers of the ecosystem, and there are restrictions that you need to find creative solutions to in order to get the data that you need. Certain metrics are forced upon you, such as the crash and Application Not Responding (ANR) rates that, in Android, impact your ranking and discoverability on the Google Play Store. The tricky part here is that the ecosystems have the ability to collect data from a system perspective, while you only have the ability to collect data from the perspective of your application. That means you have to get pretty inventive to find ways to collect the data that helps you solve certain problems, such as ANRs on Android. To provide a bit more color here, ANRs occur when an Android app has a prolonged app freeze that causes the OS to display a prompt that asks the user if they want to terminate your app. Effectively, the app freezes for so long that the user is forced to crash their current app session. From a data collection perspective, the Google Play Console treats ANRs exactly like a crash, capturing a single stack trace at the end of the session. However, app freezes are not deterministic and can stem from endless combinations of initial conditions that led to the main thread being blocked, including: Third-party SDKs (like ad vendors) conflicting with each other Loading heavy resources like large images or unoptimized ads Data synchronizations hitting slow backend service calls Heavy animations or UI work Slow responses to system broadcasts With so many variables at play, your best bet is to capture data as soon as the app freezes and then examine these code patterns across your users to find the most common causes. Backend observability solutions are simply not built for these types of nuanced mobile data capture. Deploying New Code If you discover an issue in your backend application, code can be consistently redeployed with all instances running new code. That means, if you spot an issue that’s preventing the system from completing purchases, the biggest delay is in tracking down the root cause and writing the code to fix it. In mobile, you can’t control when people upgrade their app version. There will be a long tail of old versions out in the wild. It is not unusual for a large, established application to have over a hundred different versions used in a single day. As such, it’s vital that you minimize the number of users who download bad app versions. Slow rollouts and real-time visibility into user experiences can help you proactively address issues before they become widespread. Your mobile observability solution should surface signals that allow for early issue detection for every type of broken experience, including: Performance issues like slow startups or purchase flows Stability issues like crash, error, or ANR spikes User frustration issues like abandons and force quits Device resource issues like excessive memory, CPU, and battery consumption Network issues like failing first and third-party endpoints Mobile is so complex that engineering teams frequently must add logs and release new versions to build enough context to uncover root causes. This approach is riddled with guesswork, resulting in additional releases – some of them which will introduce, rather than solve, problems – out in the wild. You want your mobile observability solution to provide complete visibility so that your engineering team can get to solutions faster and without sacrificing feature velocity. Closing Thoughts At first glance, the challenges of achieving observability in a mobile application may not seem all that different than doing so for a backend application – collect some data, store it in a database, visualize it in a dashboard – but, on closer inspection, the nuances of each of those steps is quite different for the two domains. Trying to use the same tool for monitoring your mobile application as you do for your backend application is better than having no visibility, but it will ultimately deprive you of the full clarity of what is happening in your mobile app. Your developers will take longer to figure out how to solve problems that exist if they can even detect that they are occurring. If you rely on a backend observability approach for your mobile applications, there are mobile-first approaches that can eliminate toil and guesswork while integrating across your existing tech stack for full stack visibility. In addition, given the different challenges in collecting data for mobile apps versus backend systems, open-source communities, and governing groups are actively working on what mobile telemetry standards should be in order to power the future of mobile observability. More

Debugging Using JMX Revisited

By Shai Almog

CORE

Debugging effectively requires a nuanced approach, similar to using tongs that tightly grip the problem from both sides. While low-level tools have their place in system-level service debugging, today's focus shifts towards a more sophisticated segment of the development stack: advanced management tools. Understanding these tools is crucial for developers, as it bridges the gap between code creation and operational deployment, enhancing both efficiency and effectiveness in managing applications across extensive infrastructures. The Need for Advanced Management Tools in Development Development and DevOps teams utilize an array of tools, often perceived as complex or alien by developers. These tools, designed for scalability, enable the management of thousands of servers simultaneously. Such capabilities, although not always necessary for smaller scales, offer significant advantages in application management. Advanced management tools facilitate the navigation and control over multiple machines, making them indispensable for developers seeking to optimize application performance and reliability. Introduction to JMX (Java Management Extensions) One of the pivotal standards in application management is Java Management Extensions (JMX), which Java introduced to simplify the interaction with and management of applications. JMX allows both applications and the Java Development Kit (JDK) itself to expose critical information and functionalities, enabling external tools to manipulate these elements dynamically. Although activating JMX falls outside this discussion, its significance cannot be overstated, with ample resources available for those interested in its implementation. Setting up JMX JMX isn't enabled by default, to enable it we need the following steps: 1. Modify the JVM Startup Parameters To enable JMX on a Java application, you must adjust the Java Virtual Machine (JVM) startup parameters. This involves adding specific flags to your application's startup command. The essential flags for enabling JMX are: -Dcom.sun.management.jmxremote: This flag activates the JMX remote management and monitoring. -Dcom.sun.management.jmxremote.port=<PORT>: Replace <PORT> with a specific port number where the JMX remote connection will listen. -Dcom.sun.management.jmxremote.ssl=false: This flag disables SSL for JMX connections. For development environments, SSL might be disabled for simplicity, but for production environments, consider enabling SSL for security. -Dcom.sun.management.jmxremote.authenticate=false: This flag disables authentication. Similar to SSL, authentication may be disabled in development but should be enabled in production to ensure secure access. 2. Restart Your Application With the JVM parameters set, restart your application. This will apply the new startup parameters, activating JMX. 3. Verify JMX Connectivity After restarting your application, you can verify that JMX is enabled by connecting to it using a JMX client such as JConsole, VisualVM, or a custom management application. Use the port number specified in the startup parameters to establish the connection. JMX Security Considerations While enabling JMX provides powerful management capabilities, it's crucial to consider security implications, especially when JMX is exposed over a network. When deploying applications in production, always enable SSL and authentication to protect against unauthorized access. Additionally, consider firewall rules and network policies to restrict JMX access to trusted clients. Understanding MBeans Central to JMX are Management Beans (MBeans), which serve as the control points within an application. These beans enable developers to publish specific functionalities for runtime monitoring and configuration. The ability to export application metrics to dashboards through MBeans is particularly valuable, facilitating real-time decision-making based on accurate, up-to-date information. Furthermore, operations such as user management can be exposed through MBeans, enhancing administrative capabilities. Spring and Management Beans Spring Framework's Actuator module exemplifies the integration of management capabilities within development, offering extensive metrics and operational details. This integration propels applications to "production-ready" status, allowing developers to monitor and manage applications with unprecedented depth and efficiency. Tooling for JMX Management While JMX can be accessed through various web interfaces and administrative tools, command-line tooling offers a direct, efficient method for interacting with JMX-enabled applications on production servers. Tools like JMXTerm complement visual tools by providing a streamlined interface for rapid insights, especially in environments unfamiliar to the developer. Getting Started With JMXTerm JMXTerm is a powerful utility for managing JMX without the need for graphical visualization, ideal for quick diagnostics or high-level server insights. After enabling JMX on the JVM and setting up the necessary configurations, developers can connect to servers, explore different JMX domains, and manipulate MBeans directly from the command line. We can accomplish all of the following via visual tools and sometimes using a web interface. Normally, that's the approach I use. However, as a learning tool I think JMXTerm is fantastic since it exposes things in a way that's consistent and verbose. If we can understand JMXTerm the GUI version would be a walk in the park. . . We can launch JMXTerm using the command line, in my case I used the following command: java -jar ~/Downloads/jmxterm-1.0.2-uber.jar --url localhost:30002 Once the connection is made we can issue commands to JMX and retrieve information about the JVM or the application; e.g., I can list the domains which you can think of as similar to "packages" or "modules" a way to organize the various beans: $>domains #following domains are available JMImplementation com.sun.management java.lang java.nio java.util.logging javax.cache jdk.management.jfr I can select a specific domain and thus perform future operations within said domain: $>domain java.util.logging #domain is set to java.util.logging Once inside the domain, I can select a specific bean and perform operations on it. For this I need to first list the beans in the domain; in this case there's only the logging bean. I can then select that bean using the bean command: $>beans #domain = java.util.logging: java.util.logging:type=Logging $>bean java.util.logging:type=Logging #bean is set to java.util.logging:type=Logging I can perform many operations on beans. Perhaps the most useful is the info command which lets me query a bean. Notice that a bean can have attributes, think of them like object fields and operations which you can think of as methods. There are also notifications which you can think of as events: $>info #mbean = java.util.logging:type=Logging #class name = sun.management.ManagementFactoryHelper$PlatformLoggingImpl # attributes %0 - LoggerNames ([Ljava.lang.String;, r) %1 - ObjectName (javax.management.ObjectName, r) # operations %0 - java.lang.String getLoggerLevel(java.lang.String p0) %1 - java.lang.String getParentLoggerName(java.lang.String p0) %2 - void setLoggerLevel(java.lang.String p0,java.lang.String p1) #there's no notifications I can run operations and pass various commands; e.g., I can get the logger level, set it and then check that the logger level was indeed updated: $>run getLoggerLevel "org.apache.tomcat.websocket.WsWebSocketContainer" #calling operation getLoggerLevel of mbean java.util.logging:type=Logging with params [org.apache.tomcat.websocket.WsWebSocketContainer] #operation returns: $>run setLoggerLevel org.apache.tomcat.websocket.WsWebSocketContainer INFO #calling operation setLoggerLevel of mbean java.util.logging:type=Logging with params [org.apache.tomcat.websocket.WsWebSocketContainer, INFO] #operation returns: null $>run getLoggerLevel "org.apache.tomcat.websocket.WsWebSocketContainer" #calling operation getLoggerLevel of mbean java.util.logging:type=Logging with params [org.apache.tomcat.websocket.WsWebSocketContainer] #operation returns: INFO This is just the tip of the iceberg. We can get many things such as Spring settings, internal VM information, etc. In this example I can query VM information directly from the console: $>domain com.sun.management #domain is set to com.sun.management $>beans #domain = com.sun.management: com.sun.management:type=DiagnosticCommand com.sun.management:type=HotSpotDiagnostic $>bean com.sun.management:type=HotSpotDiagnostic #bean is set to com.sun.management:type=HotSpotDiagnostic $>info #mbean = com.sun.management:type=HotSpotDiagnostic #class name = com.sun.management.internal.HotSpotDiagnostic # attributes %0 - DiagnosticOptions ([Ljavax.management.openmbean.CompositeData;, r) %1 - ObjectName (javax.management.ObjectName, r) # operations %0 - void dumpHeap(java.lang.String p0,boolean p1) %1 - javax.management.openmbean.CompositeData getVMOption(java.lang.String p0) %2 - void setVMOption(java.lang.String p0,java.lang.String p1) #there's no notifications Leveraging JMX in Debugging and Management JMX stands out as a robust tool for wiring management consoles, allowing developers to expose critical settings and metrics for their projects. Beyond its conventional use, JMX can be leveraged as part of the debugging process, serving as a pseudo-interface for triggering debugging scenarios or observing debugging sessions within the management UI. This approach not only simplifies the management of server applications but also enhances the developer's ability to diagnose and resolve issues efficiently. Exposing MBeans in Spring Boot Up until now we discussed the process of working with beans that are a part of the JVM or Spring. But what about our own application logic? We can expose our own applications internal state so we (and our SREs) can review these in production and staging. Instead of building a custom control panel or logging everything, we can just expose the data. If a flag is problematic we can change it in production, if you want to query a specific state it too can be exposed. Spring Boot simplifies the management and monitoring of applications through its comprehensive support for JMX. By leveraging Spring's infrastructure, we can easily expose their application's beans as JMX Managed Beans (MBeans), making them accessible for monitoring and management via JMX clients. Understanding Spring Boot JMX Support Spring Boot automatically configures JMX for you and exposes any beans annotated with @ManagedResource as JMX MBeans. This feature, combined with Spring Boot’s Actuator, provides a rich set of management endpoints, covering various aspects of the application, from metrics to thread dumps. Expose an MBean in Spring Boot To expose a bean we need to take the following steps: 1. Define a Management Interface Create an interface that defines the operations and attributes you wish to expose via JMX. This interface should be annotated with JMX annotations such as @ManagedOperation for methods and @ManagedAttribute for fields or getter/setter methods. 2. Implement the MBean Implement the interface in a class that performs the actual logic for the operations and attributes defined. This class represents your MBean and can be a regular Spring-managed bean. 3. Annotate the Bean With @ManagedResource Annotate your MBean implementation class with @ManagedResource to indicate that it should be exposed as an MBean. You can specify the object name for the MBean in this annotation, which is how it will be identified in JMX clients. 4. Enable JMX in Spring Boot Ensure that JMX is enabled in your Spring Boot application. This is usually the default behavior, but you can explicitly enable it by setting spring.jmx.enabled=true in your application.properties or application.yml file. 5. Access the MBean via a JMX Client Once your application is running, you can access the exposed MBean through any standard JMX client, such as JConsole, VisualVM, or a custom client. Connect to the Spring Boot application's JMX domain, and you'll find the MBean you exposed, ready for interaction. Example: Exposing a Simple Configuration MBean // Define a management interface public interface ConfigurationMBean { @ManagedAttribute String getApplicationName(); @ManagedOperation void updateApplicationName(String name); } // Implement the MBean @Component @ManagedResource(objectName = "com.example:type=Configuration") public class Configuration implements ConfigurationMBean { private String applicationName = "MyApp"; @Override public String getApplicationName() { return applicationName; } @Override public void updateApplicationName(String name) { this.applicationName = name; } } In this example, the Configuration class is annotated with @ManagedResource, making it available as an MBean with operations and attributes accessible via JMX clients. Exposing MBeans in Spring Boot is a powerful feature that enhances the management and monitoring capabilities of applications. By following the steps outlined above, developers can provide external tools with dynamic access to application internals, offering a window into the runtime behavior and allowing for adjustments on the fly. This not only aids in debugging and performance tuning but also aligns with best practices for building manageable, robust applications. Video Final Word Advanced management tools, particularly JMX and its integration with frameworks like Spring, offer developers powerful capabilities for application monitoring, configuration, and debugging. By understanding and utilizing these tools, developers can achieve a deeper level of control over their applications, enhancing both performance and reliability. Whether through graphical interfaces or command-line utilities like JMXTerm, the dynamic manipulation and monitoring of applications in runtime environments open new avenues for effective software development and management. As the bridge between development and operations continues to narrow, mastering these advanced tools becomes essential for any developer looking to excel in today's fast-paced technological landscape. More

Telemetry Pipelines Workshop: Introduction To Fluent Bit

By Eric D. Schabell

CORE

The State of Observability 2024: Navigating Complexity With AI-Driven Insights

By Tom Smith

CORE

Monitoring and Observability in DevOps: How Top Digital Products Are Leveraging It

By Pritesh Patel

Harnessing the Power of Observability in Kubernetes With OpenTelemetry

In today's dynamic and complex cloud environments, observability has become a cornerstone for maintaining the reliability, performance, and security of applications. Kubernetes, the de facto standard for container orchestration, hosts a plethora of applications, making the need for an efficient and scalable observability framework paramount. This article delves into how OpenTelemetry, an open-source observability framework, can be seamlessly integrated into a Kubernetes (K8s) cluster managed by KIND (Kubernetes IN Docker), and how tools like Loki, Tempo, and the kube-prometheus-stack can enhance your observability strategy. We'll explore this setup through the lens of a practical example, utilizing custom values from a specific GitHub repository. The Observability Landscape in Kubernetes Before diving into the integration, let's understand the components at play: KIND offers a straightforward way to run K8s clusters within Docker containers, ideal for development and testing. Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. Tempo is a high-volume, minimal-dependency trace aggregator, providing a robust way to store and query distributed traces. kube-prometheus-stack bundles Prometheus together with Grafana and other tools to provide a comprehensive monitoring solution out-of-the-box. OpenTelemetry Operator simplifies the deployment and management of OpenTelemetry collectors in K8s environments. Promtail is responsible for gathering logs and sending them to Loki. Integrating these components within a K8s cluster orchestrated by KIND not only streamlines the observability but also leverages the strengths of each tool, creating a cohesive and powerful monitoring solution. Setting up Your Kubernetes Cluster With KIND Firstly, ensure you have KIND installed on your machine. If not, you can easily install it using the following command: Shell curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.11.1/kind-$(uname)-amd64 chmod +x ./kind mv ./kind /usr/local/bin/kind Once KIND is installed, you can create a cluster by running: Shell kind create cluster --config kind-config.yaml kubectl create ns observability kubectl config set-context --current --namespace observability kind-config.yaml should be tailored to your specific requirements. It's important to ensure your cluster has the necessary resources (CPU, memory) to support the observability tools you plan to deploy. Deploying Observability Tools With HELM HELM, the package manager for Kubernetes, simplifies the deployment of applications. Here's how you can install Loki, Tempo, and the kube-prometheus-stack using HELM: Add the necessary HELM repositories: helm repo add grafana https://grafana.github.io/helm-charts helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update Install Loki, Tempo, and kube-prometheus-stack: For each tool, we'll use a custom values file available in the provided GitHub repository. This ensures a tailored setup aligned with specific monitoring and tracing needs. Loki: helm upgrade --install loki grafana/loki --values https://raw.githubusercontent.com/brainupgrade-in/kubernetes/main/observability/opentelemetry/01-loki-values.yaml Tempo: helm install tempo grafana/tempo --values https://raw.githubusercontent.com/brainupgrade-in/kubernetes/main/observability/opentelemetry/02-tempo-values.yaml kube-prometheus-stack: helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --values https://raw.githubusercontent.com/brainupgrade-in/kubernetes/main/observability/opentelemetry/03-grafana-helm-values.yaml Install OpenTelemetry Operator and Promtail: The OpenTelemetry Operator and Promtail can also be installed via HELM, further streamlining the setup process. OpenTelemetry Operator: helm install opentelemetry-operator open-telemetry/opentelemetry-operator Promtail: helm install promtail grafana/promtail --set "loki.serviceName=loki.observability.svc.cluster.local" Configuring OpenTelemetry for Optimal Observability Once the OpenTelemetry Operator is installed, you'll need to configure it to collect metrics, logs, and traces from your applications. OpenTelemetry provides a unified way to send observability data to various backends like Loki for logs, Prometheus for metrics, and Tempo for traces. A sample OpenTelemetry Collector configuration might look like this: YAML apiVersion: opentelemetry.io/v1alpha1 kind: OpenTelemetryCollector metadata: name: otel namespace: observability spec: config: | receivers: filelog: include: ["/var/log/containers/*.log"] otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: memory_limiter: check_interval: 1s limit_percentage: 75 spike_limit_percentage: 15 batch: send_batch_size: 1000 timeout: 10s exporters: # NOTE: Prior to v0.86.0 use `logging` instead of `debug`. debug: prometheusremotewrite: endpoint: "http://prometheus-kube-prometheus-prometheus.observability:9090/api/v1/write" loki: endpoint: "http://loki.observability:3100/loki/api/v1/push" otlp: endpoint: http://tempo.observability.svc.cluster.local:4317 retry_on_failure: enabled: true tls: insecure: true service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [debug,otlp] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [debug,prometheusremotewrite] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [debug,loki] mode: daemonset This configuration sets up the collector to receive data via the OTLP protocol, process it in batches, and export it to the appropriate backends. To enable auto-instrumentation for java apps, you can define the following. YAML apiVersion: opentelemetry.io/v1alpha1 kind: Instrumentation metadata: name: java-instrumentation namespace: observability spec: exporter: endpoint: http://otel-collector.observability:4317 propagators: - tracecontext - baggage sampler: type: always_on argument: "1" java: env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: http://otel-collector.observability:4317 Leveraging Observability Data for Insights With the observability tools in place, you can now leverage the collected data to gain actionable insights into your application's performance, reliability, and security. Grafana can be used to visualize metrics and logs, while Tempo allows you to trace distributed transactions across microservices. Visualizing Data With Grafana Grafana offers a powerful platform for creating dashboards that visualize the metrics and logs collected by Prometheus and Loki, respectively. You can create custom dashboards or import existing ones tailored to Kubernetes monitoring. Tracing With Tempo Tempo, integrated with OpenTelemetry, provides a detailed view of traces across microservices, helping you pinpoint the root cause of issues and optimize performance. Illustrating Observability With a Weather Application Example To bring the concepts of observability to life, let's walk through a practical example using a simple weather application deployed in our Kubernetes cluster. This application, structured around microservices, showcases how OpenTelemetry can be utilized to gather crucial metrics, logs, and traces. The configuration for this demonstration is based on a sample Kubernetes deployment found here. Deploying the Weather Application Our weather application is a microservice that fetches weather data. It's a perfect candidate to illustrate how OpenTelemetry captures and forwards telemetry data to our observability stack. Here's a partial snippet of the deployment configuration. Full YAML is found here. YAML apiVersion: apps/v1 kind: Deployment metadata: labels: app: weather tier: front name: weather-front spec: replicas: 1 selector: matchLabels: app: weather tier: front template: metadata: labels: app: weather tier: front app.kubernetes.io/name: weather-front annotations: prometheus.io/scrape: "true" prometheus.io/port: "8888" prometheus.io/path: /actuator/prometheus instrumentation.opentelemetry.io/inject-java: "true" # sidecar.opentelemetry.io/inject: 'true' instrumentation.opentelemetry.io/container-names: "weather-front" spec: containers: - image: brainupgrade/weather:metrics imagePullPolicy: Always name: weather-front resources: limits: cpu: 1000m memory: 2048Mi requests: cpu: 100m memory: 1500Mi env: - name: APP_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.labels['app.kubernetes.io/name'] - name: NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: OTEL_SERVICE_NAME value: $(NAMESPACE)-$(APP_NAME) - name: spring.application.name value: $(NAMESPACE)-$(APP_NAME) - name: spring.datasource.url valueFrom: configMapKeyRef: name: app-config key: spring.datasource.url - name: spring.datasource.username valueFrom: secretKeyRef: name: app-secret key: spring.datasource.username - name: spring.datasource.password valueFrom: secretKeyRef: name: app-secret key: spring.datasource.password - name: weatherServiceURL valueFrom: configMapKeyRef: name: app-config key: weatherServiceURL - name: management.endpoints.web.exposure.include value: "*" - name: management.server.port value: "8888" - name: management.metrics.web.server.request.autotime.enabled value: "true" - name: management.metrics.tags.application value: $(NAMESPACE)-$(APP_NAME) - name: otel.instrumentation.log4j.capture-logs value: "true" - name: otel.logs.exporter value: "otlp" ports: - containerPort: 8080 This deployment configures the weather service with OpenTelemetry's OTLP (OpenTelemetry Protocol) exporter, directing telemetry data to our OpenTelemetry Collector. It also labels the service for clear identification within our telemetry data. Visualizing Observability Data Once deployed, the weather service starts sending metrics, logs, and traces to our observability tools. Here's how you can leverage this data. Trace the request across services using Tempo datasource Metrics: Prometheus, part of the kube-prometheus-stack, collects metrics on the number of requests, response times, and error rates. These metrics can be visualized in Grafana to monitor the health and performance of the weather service. For example, grafana dashboard (ID 17175) can be used to view Observability for Spring boot apps Logs: Logs generated by the weather service are collected by Promtail and stored in Loki. Grafana can query these logs, allowing you to search and visualize operational data. This is invaluable for debugging issues, such as understanding the cause of an unexpected spike in error rates. Traces: Traces captured by OpenTelemetry and stored in Tempo provide insight into the request flow through the weather service. This is crucial for identifying bottlenecks or failures in the service's operations. Gaining Insights With the weather application up and running, and observability data flowing, we can start to gain actionable insights: Performance optimization: By analyzing response times and error rates, we can identify slow endpoints or errors in the weather service, directing our optimization efforts more effectively. Troubleshooting: Logs and traces help us troubleshoot issues by providing context around errors or unexpected behavior, reducing the time to resolution. Scalability decisions: Metrics on request volumes and resource utilization guide decisions on when to scale the service to handle load more efficiently. This weather service example underscores the power of OpenTelemetry in a Kubernetes environment, offering a window into the operational aspects of applications. By integrating observability into the development and deployment pipeline, teams can ensure their applications are performant, reliable, and scalable. This practical example of a weather application illustrates the tangible benefits of implementing a comprehensive observability strategy with OpenTelemetry. It showcases how seamlessly metrics, logs, and traces can be collected, analyzed, and visualized, providing developers and operators with the insights needed to maintain and improve complex cloud-native applications. Conclusion Integrating OpenTelemetry with Kubernetes using tools like Loki, Tempo, and the kube-prometheus-stack offers a robust solution for observability. This setup not only simplifies the deployment and management of these tools but also provides a comprehensive view of your application's health, performance, and security. With the actionable insights gained from this observability stack, teams can proactively address issues, improve system reliability, and enhance the user experience. Remember, the key to successful observability lies in the strategic implementation and continuous refinement of your monitoring setup. Happy observability!

By Rajesh Gheware

Cloud Application Monitoring: Top 5 Metrics to Ensure Optimal Performance

Key Highlights Monitoring the health of cloud applications is crucial for ensuring optimal performance and user experience. Response time, error rate, traffic, resource utilization, and user satisfaction are the top metrics to monitor for cloud application health. These metrics provide insights into the performance, efficiency, and user experience of cloud applications. Cloud monitoring tools and techniques, such as real-time monitoring tools, log analysis, and AI-based predictive monitoring, can help in effective cloud application monitoring. Best practices for cloud application health monitoring include establishing KPIs, regularly reviewing and adjusting thresholds, fostering a culture of continuous improvement, and leveraging community knowledge and resources. Introduction to Cloud Application Monitoring Cloud applications have become an integral part of modern business operations. With the rapid adoption of cloud computing, organizations are leveraging cloud services to build and deploy scalable and flexible applications. However, ensuring the health and performance of these cloud applications is essential for delivering a seamless user experience and achieving business objectives. Monitoring the health of cloud applications involves tracking various performance metrics to identify any issues and take proactive measures to maintain optimal performance. Cloud application monitoring involves monitoring response time, error rate, traffic, and resource utilization. These metrics provide insights into the performance, efficiency, and user experience of cloud applications. In this blog, we will explore the top 5 metrics to monitor for cloud application health and discuss the importance of each metric in ensuring the optimal performance of cloud applications. We will also dive deeper into the understanding of cloud application metrics, the tools and techniques for effective cloud application monitoring, and the best practices for monitoring the health of cloud applications. By monitoring these metrics and following best practices, your organization can proactively detect and resolve issues, optimize resource utilization, and continuously improve the performance and user experience of your cloud applications. Understanding the Importance of Monitoring Cloud Applications Health Cloud application monitoring involves proactively tracking various key metrics to identify and address potential issues before they significantly impact user experience or business operations. Here's a deeper dive into why proactive monitoring is crucial: What Is the Significance of Proactive Monitoring? Reactive approaches, where you wait for problems to manifest before taking action, are risky. By the time issues become apparent, they might have already caused downtime, data loss, or frustrated users. Proactive cloud application monitoring allows you to: Identify performance bottlenecks: Before issues snowball, proactive monitoring helps pinpoint areas where your application is sluggish or inefficient. This enables you to optimize resources and improve overall performance. Prevent downtime: By identifying potential problems early on, you can take corrective actions to prevent outages entirely. This ensures uninterrupted service delivery and a positive user experience. Enhance scalability: Monitoring resource utilization helps you understand your application's scaling needs. By proactively scaling resources up or down, you can cater to fluctuating traffic demands without compromising performance. Reduce costs: Proactive monitoring helps prevent costly downtime and resource wastage. By optimizing resource allocation and identifying areas for cost savings, you can ensure a more cost-effective cloud environment. The Impact of Cloud Observability on Our Overall Performance The health of your cloud applications directly impacts your overall business performance. Here's how: User experience: Slow loading times, frequent errors, or unexpected crashes can significantly impact user experience. Proactive monitoring ensures smooth application functioning, leading to satisfied and engaged users. Employee productivity: When applications are slow or unavailable, employee productivity suffers. Monitoring helps maintain application health, allowing employees to focus on their tasks without disruptions. Brand reputation: Downtime or performance issues can damage your brand reputation. Proactive monitoring helps maintain application availability and performance, fostering trust and confidence in your brand. Revenue generation: Application downtime translates to lost revenue opportunities. Proactive monitoring safeguards against downtime and ensures your applications are always up and running, ready to serve customers. By effectively monitoring your cloud applications, you gain valuable insights and control, allowing you to optimize performance, ensure business continuity, and achieve your overall business goals. Diving into the Top 5 Metrics for Cloud Application Health Now that we understand the importance of monitoring cloud applications, let's explore the top five critical metrics you should track: 1. Response Time Response time is a critical metric that directly impacts user experience and satisfaction. It measures the duration between a user request and the corresponding response from the application. By monitoring response time, your organization can identify performance bottlenecks, such as network latency, inefficient code execution, or resource constraints. Best practices: Aim for sub-second response times for optimal user experience. Consider implementing caching mechanisms and optimizing backend processes to reduce response times. Impact on performance: Slow response times can lead to frustrated users who may abandon tasks or switch to a competitor. Dashboard interpretation: Track response times over time and identify any sudden spikes or increases. Investigate the cause of slowdowns and take corrective actions. 2. Error Rate Error rates quantify the frequency of errors encountered during application operation, such as HTTP errors, database query failures, or application-specific errors. A healthy application should have a minimal error rate. High error rates can indicate software bugs, compatibility issues, or infrastructure problems that undermine application reliability and functionality. Best practices: Strive for a low error rate, ideally below 1%. Implement robust error-handling mechanisms and conduct regular code reviews to minimize errors. Impact on performance: High error rates can hinder application functionality and prevent users from completing tasks. They can also damage user trust and confidence. Dashboard interpretation: Monitor the types of errors occurring and their frequency. Analyze error logs to identify the root cause and implement bug fixes. Image source: ServerGuy 3. Requests Per Minute (RPM) RPM measures the rate at which the application handles incoming requests. Monitoring RPM metrics allows you to gauge application scalability, identify peak usage periods, and allocate resources accordingly. By scaling infrastructure in response to changes in request volume, you can maintain optimal performance and ensure a seamless user experience during periods of high demand. Best practices: Analyze historical data to predict peak traffic periods and proactively scale resources to handle increased load. Impact on performance: A sudden surge in RPM can overwhelm the application, leading to slowdowns or crashes. Conversely, low RPM might indicate underutilization of resources. Dashboard interpretation: Track RPM alongside response times. Identify any correlations between high RPM and increased response times. This can indicate potential bottlenecks that need optimization. 4. CPU Utilization CPU utilization refers to the percentage of processing power your application is using. Monitoring CPU utilization helps ensure efficient resource allocation and prevents performance bottlenecks. Best practices: Aim for a CPU utilization rate between 30% and 70%. This leaves headroom for handling traffic spikes while avoiding resource waste. Utilize auto-scaling features offered by cloud providers to scale CPU resources dynamically based on demand. Impact on performance: High CPU utilization can lead to sluggish application performance and timeouts. Conversely, very low utilization indicates underutilized resources and potential cost inefficiencies. Dashboard interpretation: Monitor CPU utilization alongside other metrics like response time and RPM. Identify instances where high CPU usage coincides with performance degradation. This might indicate inefficient application processes that require optimization. 5. Memory Utilization Memory utilization refers to the percentage of available memory your application is using. Monitoring memory usage helps prevent memory leaks and ensures efficient application execution. Best practices: Aim for a memory utilization rate between 20% and 80%. This provides sufficient memory for smooth operation while avoiding overallocation. Consider code optimization techniques and memory leak detection tools to prevent memory-related issues. Impact on performance: Memory leaks or insufficient memory can lead to application crashes, slowdowns, and unexpected errors. Dashboard interpretation: Track memory utilization alongside CPU usage. Identify situations where both reach high levels simultaneously. This might indicate an application memory leak that requires investigation and patching. Using Dashboards for Effective Monitoring and Visibility Cloud monitoring tools provide dashboards that visually represent these key metrics. By creating custom dashboards, you can tailor the information to your specific needs and gain actionable insights. Here are some tips for using dashboards effectively: Combine metrics: Don't view metrics in isolation. Combine related metrics like response time and RPM on the same dashboard to identify correlations and pinpoint bottlenecks. Set thresholds: Configure alerts for critical metrics that exceed predefined thresholds. This allows for proactive intervention before issues escalate. Track trends: Monitor metrics over time to identify trends and predict potential problems. Look for sudden spikes or dips that might indicate underlying issues. Correlate events: Investigate incidents by correlating application logs with changes in metrics. This helps identify the root cause of performance issues. Conclusion By following these best practices and leveraging the power of cloud application monitoring tools, you can gain a comprehensive understanding of your application's health. Effective cloud application monitoring is essential for organizations seeking to optimize performance, reliability, and security in the cloud. By prioritizing key metrics such as response time, availability, CPU utilization, memory utilization, and requests per minute, your team can proactively identify and address issues, optimize resources, and enhance user experience. With comprehensive monitoring practices in place, you can unlock the full potential of cloud computing and drive business success for your company.

By Marija Naumovska

CORE

Performance Optimization in Agile IoT Cloud Applications: Leveraging Grafana and Similar Tools

In today's era of Agile development and the Internet of Things (IoT), optimizing performance for applications running on cloud platforms is not just a nice-to-have; it's a necessity. Agile IoT projects are characterized by rapid development cycles and frequent updates, making robust performance optimization strategies essential for ensuring efficiency and effectiveness. This article will delve into the techniques and tools for performance optimization in Agile IoT cloud applications, with a special focus on Grafana and similar platforms. Need for Performance Optimization in Agile IoT Agile IoT cloud applications often handle large volumes of data and require real-time processing. Performance issues in such applications can lead to delayed responses, a poor user experience, and ultimately, a failure to meet business objectives. Therefore, continuous monitoring and optimization are vital components of the development lifecycle. Techniques for Performance Optimization 1. Efficient Code Practices Writing clean and efficient code is fundamental to optimizing performance. Techniques like code refactoring and optimization play a significant role in enhancing application performance. For example, identifying and removing redundant code, optimizing database queries, and reducing unnecessary loops can lead to significant improvements in performance. 2. Load Balancing and Scalability Implementing load balancing and ensuring that the application can scale effectively during high-demand periods is key to maintaining optimal performance. Load balancing distributes incoming traffic across multiple servers, preventing any single server from becoming a bottleneck. This approach ensures that the application remains responsive even during traffic spikes. 3. Caching Strategies Effective caching is essential for IoT applications dealing with frequent data retrieval. Caching involves storing frequently accessed data in memory, reducing the load on the backend systems, and speeding up response times. Implementing caching mechanisms, such as in-memory caches or content delivery networks (CDNs), can greatly improve the overall performance of IoT applications. Tools for Monitoring and Optimization In the realm of performance optimization for Agile IoT cloud applications, having the right tools at your disposal is paramount. These tools serve as the eyes and ears of your development and operations teams, providing invaluable insights and real-time data to keep your applications running smoothly. One such cornerstone tool in this journey is Grafana, an open-source platform that empowers you with real-time dashboards and alerting capabilities. But Grafana doesn't stand alone; it collaborates seamlessly with other tools like Prometheus, New Relic, and AWS CloudWatch to offer a comprehensive toolkit for monitoring and optimizing the performance of your IoT applications. Let's explore these tools in detail and understand how they can elevate your Agile IoT development game. Grafana Grafana stands out as a primary tool for performance monitoring. It's an open-source platform for time-series analytics that provides real-time visualizations of operational data. Grafana's dashboards are highly customizable, allowing teams to monitor key performance indicators (KPIs) specific to their IoT applications. Here are some of its key features: Real-time dashboards: Grafana's real-time dashboards empower development and operations teams to track essential metrics in real-time. This includes monitoring CPU usage, memory consumption, network bandwidth, and other critical performance indicators. The ability to view these metrics in real-time is invaluable for identifying and addressing performance bottlenecks as they occur. This proactive approach to monitoring ensures that issues are dealt with promptly, reducing the risk of service disruptions and poor user experiences. Alerts: One of Grafana's standout features is its alerting system. Users can configure alerts based on specific performance metrics and thresholds. When these metrics cross predefined thresholds or exhibit anomalies, Grafana sends notifications to the designated parties. This proactive alerting mechanism ensures that potential issues are brought to the team's attention immediately, allowing for rapid response and mitigation. Whether it's a sudden spike in resource utilization or a deviation from expected behavior, Grafana's alerts keep the team informed and ready to take action. Integration: Grafana's strength lies in its ability to seamlessly integrate with a wide range of data sources. This includes popular tools and databases such as Prometheus, InfluxDB, AWS CloudWatch, and many others. This integration capability makes Grafana a versatile tool for monitoring various aspects of IoT applications. By connecting to these data sources, Grafana can pull in data, perform real-time analysis, and present the information in customizable dashboards. This flexibility allows development teams to tailor their monitoring to the specific needs of their IoT applications, ensuring that they can capture and visualize the most relevant data for performance optimization. Complementary Tools Prometheus: Prometheus is a powerful monitoring tool often used in conjunction with Grafana. It specializes in recording real-time metrics in a time-series database, which is essential for analyzing the performance of IoT applications over time. Prometheus collects data from various sources and allows you to query and visualize this data using Grafana, providing a comprehensive view of application performance. New Relic: New Relic provides in-depth application performance insights, offering real-time analytics and detailed performance data. It's particularly useful for detecting and diagnosing complex application performance issues. New Relic's extensive monitoring capabilities can help IoT development teams identify and address performance bottlenecks quickly. AWS CloudWatch: For applications hosted on AWS, CloudWatch offers native integration, providing insights into application performance and operational health. CloudWatch provides a range of monitoring and alerting capabilities, making it a valuable tool for ensuring the reliability and performance of IoT applications deployed on the AWS platform. Implementing Performance Optimization in Agile IoT Projects To successfully optimize performance in Agile IoT projects, consider the following best practices: Integrate Tools Early Incorporate tools like Grafana during the early stages of development to continuously monitor and optimize performance. Early integration ensures that performance considerations are ingrained in the project's DNA, making it easier to identify and address issues as they arise. Adopt a Proactive Approach Use real-time data and alerts to proactively address performance issues before they escalate. By setting up alerts for critical performance metrics, you can respond swiftly to anomalies and prevent them from negatively impacting user experiences. Iterative Optimization In line with Agile methodologies, performance optimization should be iterative. Regularly review and adjust strategies based on performance data. Continuously gather feedback from monitoring tools and make data-driven decisions to refine your application's performance over time. Collaborative Analysis Encourage cross-functional teams, including developers, operations, and quality assurance (QA) personnel, to collaboratively analyze performance data and implement improvements. Collaboration ensures that performance optimization is not siloed but integrated into every aspect of the development process. Conclusion Performance optimization in Agile IoT cloud applications is a dynamic and ongoing process. Tools like Grafana, Prometheus, and New Relic play pivotal roles in monitoring and improving the efficiency of these systems. By integrating these tools into the Agile development lifecycle, teams can ensure that their IoT applications not only meet but exceed performance expectations, thereby delivering seamless and effective user experiences. As the IoT landscape continues to grow, the importance of performance optimization in this domain cannot be overstated, making it a key factor for success in Agile IoT cloud application development. Embracing these techniques and tools will not only enhance the performance of your IoT applications but also contribute to the overall success of your projects in this ever-evolving digital age.

By Deep Manishkumar Dave

Achieving High Availability in CI/CD With Observability

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, The Modern DevOps Lifecycle: Shifting CI/CD and Application Architectures. Forbes estimates that cloud budgets will break all previous records as businesses will spend over $1 trillion on cloud computing infrastructure in 2024. Since most application releases depend on cloud infrastructure, having good continuous integration and continuous delivery (CI/CD) pipelines and end-to-end observability becomes essential for ensuring highly available systems. By integrating observability tools in CI/CD pipelines, organizations can increase deployment frequency, minimize risks, and build highly available systems. Complementing these practices is site reliability engineering (SRE), a discipline ensuring system reliability, performance, and scalability. This article will help you understand the key concepts of observability and how to integrate observability in CI/CD for creating highly available systems. Observability and High Availability in SRE Observability refers to offering real-time insights into application performance, whereas high availability means ensuring systems remain operational by minimizing downtime. Understanding how the system behaves, performs, and responds to various conditions is central to achieving high availability. Observability equips SRE teams with the necessary tools to gain insights into a system's performance. Figure 1. Observability in the DevOps workflow Components of Observability Observability involves three essential components: Metrics – measurable data on various aspects of system performance and user experience Logs – detailed event information for post-incident reviews Traces – end-to-end visibility in complex architectures to help you understand requests across services Together, they comprehensively picture the system's behavior, performance, and interactions. This observability data can then be analyzed by SRE teams to make data-driven decisions and swiftly resolve issues to make their system highly available. The Role of Observability in High Availability Businesses have to ensure that their development and SRE teams are skilled at predicting and resolving system failures, unexpected traffic spikes, network issues, and software bugs to provide a smooth experience to their users. Observability is vital in assessing high availability by continuously monitoring specific metrics that are crucial for system health, such as latency, error rates, throughput, saturation, and more, therefore providing a real-time health check. Deviations from normal behavior trigger alerts, allowing SRE teams to proactively address potential issues before they impact availability. How Observability Helps SRE Teams Each observability component contributes unique insights into different facets of system performance. These components empower SRE teams to proactively monitor, diagnose, and optimize system behavior. Some use cases of metrics, logs, and traces for SRE teams are post-incident reviews, identification of system weaknesses, capacity planning, and performance optimization. Post-Incident Reviews Observability tools allow SRE teams to look at past data to analyze and understand system behavior during incidents, anomalies, or outages. Detailed logs, metrics, and traces provide a timeline of events that help identify the root causes of issues. Identification of System Weaknesses Observability data aids in pinpointing system weaknesses by providing insights into how the system behaves under various conditions. By analyzing metrics, logs, and traces, SRE teams can identify patterns or anomalies that may indicate vulnerabilities, performance bottlenecks, or areas prone to failures. Capacity Planning and Performance Optimization By collecting and analyzing metrics related to resource utilization, response times, and system throughput, SRE teams can make informed decisions about capacity requirements. This proactive approach ensures that systems are adequately scaled to handle expected workloads and their performance is optimized to meet user demands. In short, resources can be easily scaled down during non-peak hours or scaled up when demands surge. SRE Best Practices for Reliability At its core, SRE practices aim to create scalable and highly reliable software systems using two key principles that guide SRE teams: SRE golden signals and service-level objectives (SLOs). Understanding SRE Golden Signals The SRE golden signals are a set of critical metrics that provide a holistic view of a system's health and performance. The four primary golden signals are: Latency – Time taken for a system to respond to a request. High latency negatively impacts user experience. Traffic – Volume of requests a system is handling. Monitoring helps anticipate and respond to changing demands. Errors – Elevated error rates can indicate software bugs, infrastructure problems, or other issues that may impact reliability. Saturation – Utilization of system resources such as CPU, memory, or disk. It helps identify potential bottlenecks and ensures the system has sufficient resources to handle the load. Setting Effective SLOs SLOs define the target levels of reliability or performance that a service aims to achieve. They are typically expressed as a percentage over a specific time period. SRE teams use SLOs to set clear expectations for a system’s behavior, availability, and reliability. They continuously monitor the SRE golden signals to assess whether the system meets its SLOs. If the system falls below the defined SLOs, it triggers a reassessment of the service's architecture, capacity, or other aspects to improve availability. Businesses can use observability tools to set up alerts based on predetermined thresholds for key metrics. Defining Mitigation Strategies Automating repetitive tasks, such as configuration management, deployments, and scaling, reduces the risk of human error and improves system reliability. Introducing redundancy in critical components ensures that a failure in one area doesn't lead to a system-wide outage. This could involve redundant servers, data centers, or even cloud providers. Additionally, implementing rollback mechanisms for deployments allows SRE teams to quickly revert to a stable state in the event of issues introduced by new releases. CI/CD Pipelines for Zero Downtime Achieving zero downtime through effective CI/CD pipelines enables services to provide users with continuous access to the latest release. Let’s look at some of the key strategies employed to ensure zero downtime. Strategies for Designing Pipelines to Ensure Zero Downtime Some strategies for minimizing disruptions and maximizing user experience include blue-green deployments, canary releases, and feature toggles. Let’s look at them in more detail. Figure 2. Strategies for designing pipelines to ensure zero downtime Blue-Green Deployments Blue-green deployments involve maintaining two identical environments (blue and green), where only one actively serves production traffic at a time. When deploying updates, traffic is seamlessly switched from the current (blue) environment to the new (green) one. This approach ensures minimal downtime as the transition is instantaneous, allowing quick rollback in case issues arise. Canary Releases Canary releases involve deploying updates to a small subset of users before rolling them out to everyone. This gradual and controlled approach allows teams to monitor for potential issues in a real-world environment with reduced impact. The deployment is released to a wider audience if the canary group experiences no significant issues. Feature Toggles Feature toggles, or feature flags, enable developers to control the visibility of new features in production independently of other features. By toggling features on or off, teams can release code to production but activate or deactivate specific functionalities dynamically without deploying new code. This approach provides flexibility, allowing features to be gradually rolled out or rolled back without redeploying the entire application. Best Practices in CI/CD for Ensuring High Availability Successfully implementing CI/CD pipelines for high availability often requires a good deal of consideration and lots of trial and error. While there are many implementations, adhering to best practices can help you avoid common problems and improve your pipeline faster. Some industry best practices you can implement in your CI/CD pipeline to ensure zero downtime are automated testing, artifact versioning, and Infrastructure as Code (IaC). Automated Testing You can use comprehensive test suites — including unit tests, integration tests, and end-to-end tests — to identify potential issues early in the development process. Automated testing during integration provides confidence in the reliability of code changes, reducing the likelihood of introducing critical bugs during deployments. Artifact Versioning By assigning unique versions to artifacts, such as compiled binaries or deployable packages, teams can systematically track changes over time. This practice enables precise identification of specific code iterations, thus simplifying debugging, troubleshooting, and rollback processes. Versioning artifacts ensures traceability and facilitates rollback to previous versions in the case of issues during deployment. Infrastructure as Code Utilize Infrastructure as Code to define and manage infrastructure configurations, using tools such as OpenTofu, Ansible, Pulumi, Terraform, etc. IaC ensures consistency between development, testing, and production environments, reducing the risk of deployment-related issues. Integrating Observability Into CI/CD Pipelines Observing key metrics such as build success rates, deployment durations, and resource utilization during CI/CD provides visibility into the health and efficiency of the CI/CD pipeline. Observability can be implemented during continuous integration (CI) and continuous deployment (CD) as well as post-deployment. Observability in Continuous Integration Observability tools capture key metrics during the CI process, such as build success rates, test coverage, and code quality. These metrics provide immediate feedback on the health of the codebase. Logging enables the recording of events and activities during the CI process. Logs help developers and CI/CD administrators troubleshoot issues and understand the execution flow. Tracing tools provide insights into the execution path of CI tasks, allowing teams to identify bottlenecks or areas for optimization. Observability in Continuous Deployment Observability platforms monitor the CD pipeline in real time, tracking deployment success rates, deployment durations, and resource utilization. Observability tools integrate with deployment tools to capture data before, during, and after deployment. Alerts based on predefined thresholds or anomalies in CD metrics notify teams of potential issues, enabling quick intervention and minimizing the risk of deploying faulty code. Post-Deployment Observability Application performance monitoring tools provide insights into the performance of deployed applications, including response times, error rates, and transaction traces. This information is crucial for identifying and resolving issues introduced during and after deployment. Observability platforms with error-tracking capabilities help pinpoint and prioritize software bugs or issues arising from the deployed code. Aggregating logs from post-deployment environments allows for a comprehensive view of system behavior and facilitates troubleshooting and debugging. Conclusion The symbiotic relationship between observability and high availability is integral to meeting the demands of agile, user-centric development environments. With real-time monitoring, alerting, and post-deployment insights, observability plays a major role in achieving and maintaining high availability. Cloud providers are now leveraging drag-and-drop interfaces and natural language tools to eliminate the need for advanced technical skills for deployment and management of cloud infrastructure. Hence, it is easier than ever to create highly available systems by combining the powers of CI/CD and observability. Resources: Continuous Integration Patterns and Anti-Patterns by Nicolas Giron and Hicham Bouissoumer, DZone Refcard Continuous Delivery Patterns and Anti-Patterns by Nicolas Giron and Hicham Bouissoumer, DZone Refcard "The 10 Biggest Cloud Computing Trends In 2024 Everyone Must Be Ready For Now" by Bernard Marr, Forbes This is an excerpt from DZone's 2024 Trend Report,The Modern DevOps Lifecycle: Shifting CI/CD and Application Architectures.For more: Read the Report

By Lipsa Das

CORE

Mastering Prometheus: Unlocking Actionable Insights and Enhanced Monitoring in Kubernetes Environments

In the dynamic world of cloud-native technologies, monitoring and observability have become indispensable. Kubernetes, the de-facto orchestration platform, offers scalability and agility. However, managing its health and performance efficiently necessitates a robust monitoring solution. Prometheus, a powerful open-source monitoring system, emerges as a perfect fit for this role, especially when integrated with Kubernetes. This guide outlines a strategic approach to deploying Prometheus in a Kubernetes cluster, leveraging helm for installation, setting up an ingress nginx controller with metrics scraping enabled, and configuring Prometheus alerts to monitor and act upon specific incidents, such as detecting ingress URLs that return 500 errors. Prometheus Prometheus excels at providing actionable insights into the health and performance of applications and infrastructure. By collecting and analyzing metrics in real-time, it enables teams to proactively identify and resolve issues before they impact users. For instance, Prometheus can be configured to monitor system resources like CPU, memory usage, and response times, alerting teams to anomalies or thresholds breaches through its powerful alerting rules engine, Alertmanager. Utilizing PromQL, Prometheus's query language, teams can dive deep into their metrics, uncovering patterns and trends that guide optimization efforts. For example, tracking the rate of HTTP errors or response times can highlight inefficiencies or stability issues within an application, prompting immediate action. Additionally, by integrating Prometheus with visualization tools like Grafana, teams can create dashboards that offer at-a-glance insights into system health, facilitating quick decision-making. Through these capabilities, Prometheus not only monitors systems but also empowers teams with the data-driven insights needed to enhance performance and reliability. Prerequisites Docker and KIND: A Kubernetes cluster set-up utility (Kubernetes IN Docker.) Helm, a package manager for Kubernetes, installed. Basic understanding of Kubernetes and Prometheus concepts. 1. Setting Up Your Kubernetes Cluster With Kind Kind allows you to run Kubernetes clusters in Docker containers. It's an excellent tool for development and testing. Ensure you have Docker and Kind installed on your machine. To create a new cluster: kind create cluster --name prometheus-demo Verify your cluster is up and running: kubectl cluster-info --context kind-prometheus-demo 2. Installing Prometheus Using Helm Helm simplifies the deployment and management of applications on Kubernetes. We'll use it to install Prometheus: Add the Prometheus community Helm chart repository: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update Install Prometheus: helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace helm upgrade prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \ --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false This command deploys Prometheus along with Alertmanager, Grafana, and several Kubernetes exporters to gather metrics. Also, customize your installation to scan for service monitors in all the namespaces. 3. Setting Up Ingress Nginx Controller and Enabling Metrics Scraping Ingress controllers play a crucial role in managing access to services in a Kubernetes environment. We'll install the Nginx Ingress Controller using Helm and enable Prometheus metrics scraping: Add the ingress-nginx repository: helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update Install the ingress-nginx chart: helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx --create-namespace \ --set controller.metrics.enabled=true \ --set controller.metrics.serviceMonitor.enabled=true \ --set controller.metrics.serviceMonitor.additionalLabels.release="prometheus" This command installs the Nginx Ingress Controller and enables Prometheus to scrape metrics from it, essential for monitoring the performance and health of your ingress resources. 4. Monitoring and Alerting for Ingress URLs Returning 500 Errors Prometheus's real power shines in its ability to not only monitor your stack but also provide actionable insights through alerting. Let's configure an alert to detect when ingress URLs return 500 errors. Define an alert rule in Prometheus: Create a new file called custom-alerts.yaml and define an alert rule to monitor for 500 errors: apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: ingress-500-errors namespace: monitoring labels: prometheus: kube-prometheus spec: groups: - name: http-errors rules: - alert: HighHTTPErrorRate expr: | sum (rate(nginx_ingress_controller_requests{status=~"5.."}[1m])) > 0.1 OR absent(sum (rate(nginx_ingress_controller_requests{status=~"5.."}[1m]))) for: 1m labels: severity: critical annotations: summary: High HTTP Error Rate description: "This alert fires when the rate of HTTP 500 responses from the Ingress exceeds 0.1 per second over the last 5 minutes." Apply the alert rule to Prometheus: You'll need to configure Prometheus to load this alert rule. If you're using the Helm chart, you can customize the values.yaml file or create a ConfigMap to include your custom alert rules. Verify the alert is working: Trigger a condition that causes a 500 error and observe Prometheus firing the alert. For example, launch the following application: kubectl create deploy hello --image brainupgrade/hello:1.0 kubectl expose deploy hello --port 80 --target-port 8080 kubectl create ingress hello --rule="hello.internal.brainupgrade.in/=hello:80" --class nginx Access the application using the below command: curl -H "Host: hello.internal.brainupgrade.in" 172.18.0.3:31080 Wherein: 172.18.0.3 is the IP of the KIND cluster node. 31080 is the node port of the ingress controller service. This could be different in your case. Bring down the hello service pods using the following command: kubectl scale --replicas 0 deploy hello You can view active alerts in the Prometheus UI (localhost:9999) by running the following command. kubectl port-forward -n monitoring svc/prometheus-operated 9999:9090 And you will see the alert being fired. See the following snapshot: Error alert on Prometheus UI. You can also configure Alertmanager to send notifications through various channels (email, Slack, etc.). Conclusion Integrating Prometheus with Kubernetes via Helm provides a powerful, flexible monitoring solution that's vital for maintaining the health and performance of your cloud-native applications. By setting up ingress monitoring and configuring alerts for specific error conditions, you can ensure your infrastructure not only remains operational but also proactively managed. Remember, the key to effective monitoring is not just collecting metrics but deriving actionable insights that lead to improved reliability and performance.

By Rajesh Gheware

O11y Guide, Cloud-Native Observability Pitfalls: Ignoring Existing Landscape

Are you looking at your organization's efforts to enter or expand into the cloud-native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud-native observability? When you're moving so fast with agile practices across your DevOps, SREs, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud-native initiatives that hasty decisions upfront lead to big headaches very quickly down the road. In the previous article, we looked at the problem of underestimating cardinality in our cloud-native observability solutions. Now it's time to move on to another common mistake organizations make, that of ignoring our existing landscape. By sharing common pitfalls in this series, the hope is that we can learn from them. This article could also have been titled, "Underestimating Our Existing Landscape." When we start planning to integrate our application landscape into our observability solution, we often end up with large discrepancies between planning and outcomes. They Can't Hurt Me The truth is we have a lot of applications out there in our architecture. The strange thing is during the decision-making process around cloud native observability and scoping solutions, they often are forgotten. Well, not necessarily forgotten, but certainly underestimated. The cost that they bring is in the hidden story around instrumentation. We have auto-instrumentation that suggests it's quick and easy, but often does not bring the exactly needed insights. On top of that, auto-instrumentation generates extra data from metrics and tracing activities that we are often not that interested in. Manual instrumentation is the real cost to provide our exact insights and the data we want to watch from our application landscape. This is what often results in unexpected or incorrectly scoped work (a.k.a., costs) with it as we change, test, and deploy new versions of existing applications. We want to stay with open source and open standards in our architecture, so we are going to end up in the cloud native standards found within the Cloud Native Computing Foundation. With that in mind, we can take a closer look at two technologies for our cloud-native observability solution: one for metrics and one for traces. Instrumenting Metrics Widely adopted and accepted standards for metrics can be found in the Prometheus project, including time-series storage, communication protocols to scrape (pull) data from targets, and PromQL, the query language for visualizing the data. Below you see an outline of the architecture used by Prometheus to collect metrics data. There are client libraries, exporters, and standards in communication to detect services across various cloud-native technologies. They make it look extremely low effort to ensure we can start collecting meaningful data in the form of standardized metrics from your applications, devices, and services. The reality is that we need to look much closer at scoping the efforts required to instrument our applications. Below you see an example of what is necessary to (either auto or manually) instrument a Java application. The process is the same for either method. While some of the data can be automatically gathered, that's just generic Java information for your applications and services. Manual instrumentation is the cost you can't forget, where you need to make code changes and redeploy. While it's nice to discuss manual instrumentation in the abstract sense, nothing beats getting hands-on with a real coding example. To that end, we can dive into what it takes to both auto and manually instrument a simple Java application in this workshop lab. Below you see a small example of the code you will apply to your example application in one of the workshop exercises to create a gauge metric: Java // Start thread and apply values to metrics. Thread bgThread = new Thread(() -> { while (true) { try { counter.labelValues("ok").inc(); counter.labelValues("ok").inc(); counter.labelValues("error").inc(); gauge.labelValues("value").set(rand(-5, 10)); TimeUnit.SECONDS.sleep(1); } catch (InterruptedException e) { e.printStackTrace(); } } }); bgThread.start(); Be sure to explore the free online workshop and get hands-on experience with what instrumentation for your Java applications entails. Instrumenting Traces In the case of tracing, a widely adopted and accepted standard is the OpenTelemetry (OTel) project, which is used to instrument and collect telemetry data through a push mechanism to an agent installed on the host. Below you see an outline of the architecture used by OTel to collect telemetry data: Whether we choose automatic or manual instrumentation, we have the same issues as previously discussed above. Our applications and services all require some form of cost to instrument our applications and we can't forget that when scoping our observability solutions. The telemetry data is pushed to an agent, known as the OTel Collector, which is installed on the application's host platform. It uses a widely accepted open standard to communicate known as the OpenTelemetry Protocol (OTLP). Note that OTel does not have a backend component, instead choosing to leverage other technologies for the backend and the collector sends all processed telemetry data onwards to that configured backend. Again, it's nice to discuss manual instrumentation in the abstract sense, but nothing beats getting hands-on with a real coding example. To that end, we can dive into what it takes to programmatically instrument a simple application using OTel in this workshop lab. Below, you see a small example of the code that you will apply to your example application in one of the workshop exercises to collect OTel telemetry data, and later in the workshop, view in the Jaeger UI: Python ... from opentelemetry.trace import get_tracer_provider, set_tracer_provider set_tracer_provider(TracerProvider()) get_tracer_provider().add_span_processor( BatchSpanProcessor(ConsoleSpanExporter()) ) instrumentor = FlaskInstrumentor() app = Flask(__name__) instrumentor.instrument_app(app) ... Be sure to explore the free online workshop and get hands-on yourself to experience how much effort it is to instrument your applications using OTel. The road to cloud-native success has many pitfalls. Understanding how to avoid the pillars and focusing instead on solutions for the phases of observability will save much wasted time and energy. Coming Up Next Another pitfall organizations struggle with in cloud native observability is the protocol jungle. In the next article in this series, I'll share why this is a pitfall and how we can avoid it wreaking havoc on our cloud-native observability efforts.

By Eric D. Schabell

CORE

The Cost Crisis in Observability Tooling

The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying. In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best-case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up. Observability 1.0 and the Cost Multiplier Effect Are you familiar with this chestnut? “Observability has three pillars: metrics, logs, and traces.” This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say it's definitionally true of a particular generation of tools. Let’s call it “observability 1.0.” From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive, and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon, we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes. Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also be collecting: Profiling data Product analytics Business intelligence data Database monitoring/query profiling tools Mobile app telemetry Behavioral analytics Crash reporting Language-specific profiling data Stack traces CloudWatch or hosting provider metrics …and so on. So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.) There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood, there are really only three basic data types: metric, unstructured logs, and structured logs. Each of these has its own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them. Metrics Metrics are the great-granddaddy of telemetry formats: tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the contexts of the request get discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data. Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance. When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application. Your metrics bill is usually dominated by the cost of these custom metrics. At a minimum, your bill goes up linearly with the number of custom metrics you create. This is unfortunate because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight. Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design). Nobody can understand their software or their systems with a metrics tool alone because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs. Unstructured Logs You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line plus the request ID. Unstructured logs are still the default, although this is slowly changing. The cost of unstructured logs is driven by a few things: Write amplification: If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request. Noisiness: It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up. Constraints on physical resources Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume: Log levels Consistent hashes Dumb sample rates When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes, over half the bits are consumed by request ID, process ID, and timestamp. This can be quite meaningful from a cost perspective. All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is a full-text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it. Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down. Structured Logs Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps! Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data. There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability: Instrument your code using the principles of canonical logs, which collect all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this for reasons of usefulness and usability as well as cost control. Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool. Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions of the future you can search or compute based on. Use a storage engine that supports high cardinality with an explorable interface. If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real-time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second. Ballooning Costs Are Baked Into Observability 1.0 To recap, high costs are baked into the observability 1.0 model. Every pillar has a price. You have to collect and store your data—and pay to store it—again and again and again for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more. It gets worse. As your costs go up, the value you get out of your tools goes down. Your logs get slower and slower to search. You have to know what you’re searching for in order to find it. You have to use a blunt force sampling technique to keep the log volume from blowing up. Any time you want to be able to ask a new question, you first have to commit to a new code and deploy it. You have to guess which custom metrics you’ll need and which fields to index in advance. As the volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately. And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. This means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs and logs with traces—exists only in the heads of a few people. At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, and choosing and maintaining indexes. But all this time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up due to low visibility and, thus, low confidence. Is there an alternative? Yes. The Cost Model of Observability 2.0 Is Very Different Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily wide structured log events, also known as spans. From these wide, context-rich structured log events, you can derive the other data types (metrics, logs, or traces). Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations. You also effectively get infinite custom metrics since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, but your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.” Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers. Engineering Time Is Your Most Precious Resource But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly and with confidence. Modern software engineering is all about hooking up fast feedback loops. Observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops. Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things. Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0 when your bill goes up, the value you derive from your telemetry goes up too. You pay more money; you get more out of your tools than you should. Note: Earlier, I said, “Nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But it’s not nothing.

By Charity Majors

Logging and Monitoring in Microsoft Azure

Microsoft Azure is a major cloud computing platform that provides a comprehensive set of services for developing, deploying, and managing applications and infrastructure. Effective logging and monitoring are critical for ensuring the performance, security, and cost-effectiveness of your Azure cloud services. In this post, we will look at the significance of logging and monitoring in Azure, as well as numerous alternatives and best practices for logging and monitoring, as well as popular Azure services and tools that may assist you in achieving these goals. The Importance of Logging and Monitoring in Azure Before diving into the technical aspects of logging and monitoring in Azure, it’s crucial to understand why these activities are vital in a cloud-based environment. 1. Troubleshooting Azure environments can be complex, with numerous services, resources, and dependencies. When issues arise, you need the ability to identify and resolve them quickly. Logging and monitoring provide the visibility required to pinpoint problems, whether it’s a misconfigured resource, performance bottlenecks, or network connectivity issues. 2. Performance Optimization To ensure that your applications run efficiently in Azure, you need insights into resource utilization, response times, and other performance metrics. Monitoring tools help you fine-tune your infrastructure, optimizing resource allocation and preventing performance degradation. 3. Security and Compliance Security is a top priority in Azure. Logging and monitoring are essential for detecting and responding to security threats and vulnerabilities. Azure environments are frequently targeted by cyberattacks, making it critical to maintain visibility into security-related events. 4. Cost Management Azure usage costs can escalate quickly if resources are not appropriately managed. Effective monitoring can help you track resource utilization and costs, enabling you to make informed decisions about scaling and optimizing your infrastructure. Logging in Azure Logging in Azure involves capturing and managing logs generated by Azure services, applications, and resources. Azure provides various services and options for collecting and storing logs, each with its own characteristics and use cases. Let’s explore some of the key options for logging in Azure. 1. Azure Monitor Logs Azure Monitor Logs is a centralized log management service that allows you to collect and store logs from various Azure services, applications, and infrastructure. It provides advanced features for searching, analyzing, and monitoring log data. Azure Monitor Logs also supports custom log queries and alerting, making it a comprehensive logging solution. 2. Azure Activity Logs Azure Activity Logs capture all administrative activity within your Azure subscription. They provide a detailed audit trail of actions taken on your Azure resources, making them crucial for auditing and compliance requirements. Activity Logs can be accessed and analyzed through Azure Monitor Logs. 3. Azure Application Insights Azure Application Insights is a service that provides detailed application performance and usage telemetry. It collects data about application requests, dependencies, exceptions, and custom events. Application Insights is ideal for monitoring web applications and microservices. 4. Azure Network Watcher Azure Network Watcher is a network performance monitoring and diagnostic service. It captures network traffic data, monitors connectivity, and helps troubleshoot network issues. Network Watcher is useful for monitoring and optimizing network performance. 5. Azure Security Center Azure Security Center provides threat protection across Azure resources. It collects and analyzes security data and logs from Azure services and infrastructure, helping you identify and mitigate security threats. 6. Azure Functions Logs If you use Azure Functions for serverless computing, these functions automatically generate logs for each execution. You can access these logs through Azure Monitor Logs to track the performance and behavior of your serverless functions. Best Practices for Logging in Azure To ensure effective logging in Azure, follow these best practices: 1. Centralized Log Management Use a centralized log management solution like Azure Monitor Logs to aggregate logs from various Azure services and applications. Centralized logging simplifies log analysis and monitoring. 2. Set up Log Retention Policies Establish log retention policies to manage log storage effectively. Determine how long logs should be retained based on compliance and business requirements. Configure automatic log deletion or archiving. 3. Implement Security Measures Protect your log data by applying appropriate access controls and encryption. Ensure that only authorized users and services can access and modify log data. Encrypt sensitive log data at rest and in transit. 4. Create Log Hierarchies Organize logs into hierarchies or groups based on the Azure service, application, or resource generating the logs. This structuring simplifies log management and search. 5. Define Log Sources Clearly define the sources of logs and the format in which they are generated. This information is crucial for setting up effective log analysis and monitoring. 6. Monitor and Alert on Logs Use Azure Monitor Logs features to monitor log data for specific events or patterns. Configure alerts to trigger notifications when predefined conditions are met, such as errors or security breaches. 7. Regularly Review and Analyze Logs Frequently review log data to identify anomalies, errors, and potential security threats. Automated log analysis tools can help in this process, flagging issues and trends for further investigation. Monitoring in Azure Monitoring in Azure involves collecting and analyzing performance metrics, resource utilization, and other data to ensure the efficient operation of your Azure environment. Azure offers a range of services and tools for monitoring that can help you gain insights into your infrastructure’s health and performance. 1. Azure Monitor Azure Monitor is the primary service for monitoring Azure resources and applications. It collects and stores metrics, sets alarms, and provides insights into resource utilization, application performance, and system behavior. 2. Azure Metrics Azure Metrics provide a wealth of information about your Azure resources and services. These metrics can be used to track performance, monitor resource usage, and trigger alarms when specific conditions are met. 3. Azure Application Insights Azure Application Insights provides detailed application performance and usage telemetry. It helps you monitor application performance, detect anomalies, and gain insights into application behavior. 4. Azure Security Center Azure Security Center provides threat protection across Azure resources. It collects and analyzes security data and logs from Azure services and infrastructure, helping you identify and mitigate security threats. 5. Azure Automation Azure Automation offers a range of features for monitoring and managing resources in Azure. It can be used to create runbooks that automate tasks and remediation based on monitoring data. 6. Azure Monitor for Containers Azure Monitor for Containers provides monitoring and diagnostics capabilities for containers in Azure Kubernetes Service (AKS) and Azure Container Instances. It captures performance and health data from containerized applications. Best Practices for Monitoring in Azure To ensure effective monitoring in Azure, follow these best practices: 1. Define Monitoring Objectives Clearly define what you want to achieve with monitoring. Determine the key metrics and alerts that are critical to your applications’ performance, security, and cost management. 2. Collect Relevant Metrics Collect metrics that are relevant to your applications, including resource usage, application-specific metrics, and business-related KPIs. Avoid collecting excessive data that can lead to information overload. 3. Set up Alarms Configure alarms in Azure Monitor to trigger notifications when specific conditions are met. Alarms should be actionable and not generate unnecessary alerts. 4. Automate Remediation Implement automated remediation actions based on alarms and events. For example, you can use Azure Logic Apps to automatically scale resources, shut down compromised instances, or trigger other responses. 5. Use Visualization and Dashboards Create interactive dashboards to visualize your metrics and performance data. Dashboards provide a real-time, at-a-glance view of your Azure environment’s health. They are especially useful during incidents and investigations. 6. Regularly Review and Analyze Data Frequently review and analyze the data collected by Azure monitoring services. This practice helps you identify performance issues, security breaches, and areas for optimization. 7. Involve All Stakeholders Collaborate with all relevant stakeholders, including developers, operators, and business teams, to define monitoring requirements and objectives. This ensures that monitoring aligns with the overall business goals. Conclusion Logging and monitoring are critical components of efficiently managing an Azure system. They give the visibility and information required to solve issues, optimize performance, and keep your cloud-based infrastructure secure. You can keep your Azure environment strong, resilient, and cost-effective by following best practices and employing the correct tools and services. Remember that logging and monitoring are dynamic procedures that should change in tandem with your apps and infrastructure. Review and update your logging and monitoring techniques on a regular basis to adapt to changing requirements and keep ahead of possible problems. Your Azure installation can function smoothly and give the performance and dependability your users demand with the correct strategy.

By Aditya Bhuyan

Revolutionizing Observability: How AI-Driven Observability Unlocks a New Era of Efficiency

Observability is the ability to measure the state of a service or software system with the help of tools such as logs, metrics, and traces. It is a crucial aspect of distributed systems, as it allows stakeholders such as Software Engineers, Site Reliability Engineers, and Product Managers to troubleshoot issues with their service, monitor performance, and gain insights into the software system's behavior. It also helps to bring visibility into important Product decisions such as monitoring the adoption rate of a new feature, analyzing user feedback, and identifying and fixing any performance issues to ensure a stable and delightful customer experience. In this article, we will discuss the importance of observability in distributed systems, the different tools used for monitoring, and the future of observability and Generative AI. Importance of Observability in Distributed Systems Distributed systems are a type of software architecture that involves multiple services and servers working together to achieve a common goal. Some examples of distributed applications include: Streaming services: Streaming services like Netflix and Spotify use distributed systems to handle large volumes of data and ensure smooth playback for users. Rideshare applications: Rideshare applications like Uber and Lyft rely on distributed systems to match drivers with passengers, track vehicle locations, and process payments. Distributed systems have several advantages, such as: Availability: If one server or pod on the network goes down, another can be spun up and pick up the work, thus ensuring high availability. Scalability: Distributed systems can scale out to accommodate increased load by adding more servers, making it easier to scale quickly, handle more users, or process more data. Maintainability: Distributed systems are more maintainable than centralized systems, as individual servers can be updated or replaced without affecting the overall system. However, distributed systems also come with disadvantages, such as increased complexity of management and the need for a deep understanding of the system's components. Observability helps to address these challenges. Troubleshooting Observability allows Engineers to diagnose issues in distributed systems more effectively by providing insightful information on system performance and behavior. Let’s take an example: when users of a video streaming service experience unexpected buffering, observability tools can help engineers quickly identify if the cause is a server overload, a network bottleneck, or a bad deployment, enabling a swift resolution to keep binge-watchers happily streaming. Preventive Measures By identifying potential problems before they occur, observability helps to prevent failures and improve system reliability. For example, if our video streaming service's metrics show a spike in CPU usage, engineers can identify the cause as a memory leak in a specific microservice. By addressing this issue proactively, they can prevent the service from crashing and ensure a smooth streaming experience for users. Business Insights Observability patterns for distributed systems provide valuable information for business decision-making. In the case of our video streaming service, observability tools can reveal user engagement patterns, such as peak viewing times, which can inform server scaling strategies to handle high traffic during new episode releases, thereby enhancing user satisfaction and reducing churn. The Three Pillars of Observability Logs, metrics, and traces are often known as the three pillars of observability. These powerful tools, if understood well, can unlock the ability to build better systems. 1. Logs Event logs are immutable, timestamped records of discrete events that happened over time. They provide information on system activity and timestamps. Let’s go back to our example of a video streaming service. Every time a user watches a video, an event log is created. This log contains details like the user ID, video ID, playback start time, timestamp of the event, and any errors encountered during streaming. If there are errors observed during video playback, engineers can look at these logs to understand what happened during that specific viewing session. 2. Metrics Metrics are quantitative data points that measure various aspects of system performance and product usage. Metrics such as CPU usage, memory usage, and network bandwidth of the servers delivering the video content are constantly monitored. Alerts can be configured on metric thresholds. If there's a sudden spike in page load latency, an alert would go off indicating there’s a problem that needs to be addressed to prevent a downgraded customer experience. 3. Traces Traces provide a detailed view of the path that a request takes through a distributed system. For a video streaming service, a trace could show the journey of a user's request from the moment they log in to the platform and hit play to the point where the video begins streaming. This trace would include all the microservices involved, such as authentication, content delivery, and data storage. If there's a delay in video start time, tracing can help pinpoint exactly where in the process the delay is occurring. Some popular examples of observability tools include DataDog, New Relic, and Splunk and open-source alternatives such as Prometheus and Grafana, which offer robust capabilities. Additionally, several tech companies build internal observability platforms by leveraging the flexibility and power of open-source tools like Prometheus and Grafana. Future of Observability and Generative AI As we look towards the future of observability in distributed systems, the applications of artificial intelligence (AI), and specifically generative AI, introduce innovative solutions that potentially simplify the lives of engineers, helping them focus on critical problems. Automated Pattern Recognition Generative AI shines in analyzing vast datasets and automatically recognizing abnormal patterns within them. This capability could save on-call engineers a lot of time as it can quickly identify issues, allowing them to focus on resolving problems rather than searching for the needle in the haystack. Cognitive Incident Response AI-powered systems can offer cognitive incident response by understanding the context of errors and suggesting diagnosis for the error based on past incidents. This capability allows for more intelligent alerting, alerting teams only for new and critical incidents and letting the observability tool take care of known issues. Enhanced Observability With AI Chatbot Picture a scenario where engineers on your team can simply ask for the data they need in everyday language, and AI-powered observability tools do the heavy lifting. These tools can sift through logs, metrics, and traces to deliver the answers you're looking for. For example, with Coralogix's Query Assistant, users can ask questions like "What metrics are available for each Redis instance?" and the system will not only understand the query but also present the information in an easy-to-digest dashboard or visualization. This level of interaction simplifies the debugging process for both engineers and those less familiar with complex query languages, making data exploration easier. Given the rapid advancements in the field of Artificial Intelligence and its integration into Observability tools, I’m super excited for what’s to come in the future. The future of observability, enriched by AI, promises not only a single source of truth for complex systems but also a smarter and more intuitive way for Engineers and other stakeholders to engage with data, driving better business outcomes and enabling a focus on creativity and critical incidents over routine tasks.

By Deepita Pai

Effective Log Data Analysis With Amazon CloudWatch: Harnessing Machine Learning

In today's cloud computing world, all types of logging data are extremely valuable. Logs can include a wide variety of data, including system events, transaction data, user activities, web browser logs, errors, and performance metrics. Managing logs efficiently is extremely important for organizations, but dealing with large volumes of data makes it challenging to detect anomalies and unusual patterns or predict potential issues before they become critical. Efficient log management strategies, such as implementing structured logging, using log aggregation tools, and applying machine learning for log analysis, are crucial for handling this data effectively. One of the latest advancements in effectively analyzing a large amount of logging data is Machine Learning (ML) powered analytics provided by Amazon CloudWatch. It is a brand new capability of CloudWatch. This innovative service is transforming the way organizations handle their log data. It offers a faster, more insightful, and automated log data analysis. This article specifically explores utilizing the machine learning-powered analytics of CloudWatch to overcome the challenges of effectively identifying hidden issues within the log data. Before deep diving into some of these features, let's have a quick refresher about Amazon CloudWatch. What Is Amazon CloudWatch? It is an AWS-native monitoring and observability service that offers a whole suite of capabilities: Monitoring: Tracks performance and operational health. Data collection: Gathers logs, metrics, and events, providing a comprehensive view of AWS resources. Unified operational view: Provides insights into applications running on AWS and on-premises servers. Challenges With Logs Data Analysis Volume of Data There's too much log data. In this modern era, applications emit a tremendous amount of log events. Log data can grow so rapidly that developers often find it difficult to identify issues within it; it is like finding a needle in a haystack. Change Identification Another common problem we have often seen is the fundamental problem of log analysis that goes back as long as logs have been around, identifying what has changed in your logs. Proactive Detection Proactive detection is another common challenge. It's great if you can utilize logs to dive in when an application's having an issue, find the root cause of that application issue, and fix it. But how do you know when those issues are occurring? How do you proactively detect them? Of course, you can implement metrics, alarms, etc., for the issues you know about. But there's always the problem of unknowns. So, we're often instrumenting observability and monitoring for past issues. Now, let's dive deep into the machine learning capabilities from CloudWatch that will help you overcome the challenges we have just discussed. Machine Learning Capabilities From CloudWatch Pattern Analysis Imagine you are troubleshooting a real-time distributed application accessed by millions of customers globally and generating a significant amount of application logs. Analyzing tens of thousands of log events manually is challenging, and it can take forever to find the root cause. That is where the new AWS CloudWatch machine learning-based capability can quickly help by grouping log events into patterns within the Logs Insight page of CloudWatch. It is much easier to identify through a limited number of patterns and quickly filter the ones that might be interesting or relevant based on the issue you are trying to troubleshoot. It also allows you to expand the specific pattern to look for the relevant events along with related patterns that might be pertinent. In simple words, Pattern Analysis is the automated grouping and categorization of your log events. Comparison Analysis How can we elevate pattern analysis to the next level? Now that we've seen how pattern analysis works let's see how we can extend this feature to perform comparison analysis. "Comparison Analysis" aims to solve the second challenge of identifying the log changes. Comparison analysis lets you effectively profile your logs using patterns from one time period and then compare them to the patterns extracted for another period and analyze the differences. This will help us answer this fundamental question of what changed to my logs. You can quickly compare your logs while your application's having an issue to a known healthy period. Any changes between two time periods are a strong indicator of the possible root cause of your problem. CloudWatch Logs Anomaly Detection Anomaly detection, in simple terms, is the process of identifying unusual patterns or behaviors in the logs that do not conform to expected norms. To use this feature, we need to first select the LogGroup for the application and enable CloudWatch Logs anomaly detection for it. At that point, CloudWatch will train a machine-learning model on the expected patterns and the volume of each pattern associated with your application. CloudWatch will take five minutes to train the model using logs from your application, and the feature will become active and automatically start servicing these anomalies any time they occur. So things like a brand new error message occurring that wasn't there before, a sudden spike in the volume, or if there's a spike in HTTP 400s are some examples that will result in an anomaly being generated for that. Generate Logs Insight Queries Using Generative AI With this capability, you can give natural language commands to filter log events, and CloudWatch can generate queries using Generative AI. If you are unfamiliar with CloudWatch query language or are from a non-technical background, you can easily use this feature to generate queries and filter logs. It's an iterative process; you need to learn precisely what you want from the first query. So you can update and iterate the query based on the results you see. Let's look at a couple of examples: Natural Language Prompt: "Check API Response Times" Auto-generated query by CloudWatch: In this query: fields @timestamp, @message selects the timestamp and message fields from your logs. | parse @message "Response Time: *" as responseTime parses the @message field to extract the value following the text "Response Time: " and labels it as responseTime. | stats avg(responseTime) calculates the average of the extracted responseTime values. Natural Language Prompt: "Please provide the duration of the ten invocations with the highest latency." Auto-generated query by CloudWatch In this query: fields @timestamp, @message, latency selects the @timestamp, @message, and latency fields from the logs. | stats max(latency) as maxLatency by @message computes the maximum latency value for each unique message. | sort maxLatency desc sorts the results in descending order based on the maximum latency, showing the highest values at the top. | limit 10 restricts the output to the top 10 results with the highest latency values. We can execute these queries in the CloudWatch “Logs Insights” query box to filter the log events from the application logs. These queries extract specific information from the logs, such as identifying errors, monitoring performance metrics, or tracking user activities. The query syntax might vary based on the particular log format and the information you seek. Conclusion CloudWatch's machine learning features offer a robust solution for managing the complexities of log data. These tools make log analysis more efficient and insightful, from automating pattern analysis to enabling anomaly detection. The addition of generative AI for query generation further democratizes access to these powerful insights.

By Rajat Gupta

Monitoring and Observability

DZone's Featured Monitoring and Observability Resources

Top Monitoring and Observability Experts

The Latest Monitoring and Observability Topics