Participate in DZone Research Surveys: You Can Shape Trend Reports! (+ Enter the Raffles)
Securing the Generative AI Frontier: Specialized Tools and Frameworks for AI Firewall
Modern API Management
When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.
Open Source Migration Practices and Patterns
MongoDB Essentials
What Is a Message Broker? A message broker is an important component of asynchronous distributed systems. It acts as a bridge in the producer-consumer pattern. Producers write messages to the broker, and the consumer reads the message from the broker. The broker handles queuing, routing, and delivery of messages. The diagram below shows how the broker is used in the producer-consumer pattern: This article discusses the popular brokers used today and when to use them. Simple Queue Service (SQS) Simple Queue Service (SQS) is offered by Amazon Web Services (AWS) as a managed message queue service. AWS fully manages the queue, making SQS an easy solution for passing messages between different components of software running on AWS infrastructure. The section below details what is supported and what is not in SQS Supported Pay for what you use: SQS only charges for the messages read and written to the queue. There is no recurring or base charge for using SQS. Ease of setup: SQS is a fully managed AWS service, no infrastructure setup is required for using SQS. Reading and writing are also simple either using direct REST APIs provided by SQS or using AWS lambda functions. Support for FIFO queues: Besides regular standard queues, SQS also supports FIFO queues. For applications that need strict ordering of messages, FIFO queues come in handy. Scale: SQS scales elastically with the application, no need to worry about capacity and pre-provisioning. There is no limit to the number of messages per queue, and queues offer nearly unlimited output. Queue for failed messages/dead-letter queue: All the messages that can't be processed are sent to a "dead-letter" queue. SQS takes care of moving messages automatically into the dead-letter queue based on the retry configuration of the main queue. Not Supported Lack of message broadcast: SQS doesn't have a way for multiple consumers to retrieve the same message with its "exactly once" transmission. For multiple consumer use cases, SQS needs to be used along with AWS SNS, which needs multiple queues subscribed to the same SNS topic. Replay: SQS doesn't have the ability to replay old messages. Replay is sometimes required for debugging and testing. Kinesis Kinesis is another offering of AWS. Kinesis streams enable large-scale data ingestion and real-time processing of streaming data. Like SQS, Kinesis is also a fully managed service. Below are details of what is supported and what is not in Kinesis. Supported Ease of setup: Kinesis like SQS is a fully managed AWS service, no infrastructure setup is required. Message broadcast: Kinesis allows multiple consumers to read the same message from the stream concurrently. AWS integration: Kinesis integrates seamlessly with other AWS services as part of the other AWS services. Replay: Kinesis allows the replay of messages as long as seven days in the past, and provides the ability for a client to consume messages at a later time. Real-time analytics: Provides support for ingestion, processing, and analysis of large data streams in real-time. Not Supported Strict message ordering: Kinesis supports in-order processing within a shard, however, provides no guarantee on ordering between shards. Lack of dead-letter queue: There is no support for dead dead-letter queue out of the box. Every application that consumes the stream has to deal with failure on its own. Auto-scaling: Kinesis streams don't scale dynamically in response to demand. Streams need to be provisioned ahead of time to meet the anticipated demand of both producer and consumer. Cost: For a large volume of data, pricing can be really high in comparison to other brokers. Kafka Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation. Apache is famous for its high throughput and scalability. It excels in real-time analytics and monitoring. Below are details of what is supported and what is not in Kafka. Supported Message broadcast: Kafka allows multiple consumers to read the same message from the stream. Replay: Kafka allows messages to be replayed from a specific point in a topic. Message retention policy decides how far back a message can be replayed. Unlimited message retention: Kafka allows unlimited message retention based on the retention policy configured. Real-time analytics: Provides support for ingestion, processing, and analysis of large data streams in real-time. Open source: Kafka is an open project, which resulted in widespread adoption and community support. It has lots of configuration options available which gives the opportunity to fine-tune based on the specific use case. Not Supported Automated setup: Since Kafka is an open source, the developer needs to set up the infrastructure and Kafka cluster setup. That said, most of the public cloud providers provide managed Kafka. Simple onboarding: For Kafka clusters that are not through managed services understanding the infrastructure can become a daunting task. Apache does provide lots of documentation, but it takes time for new developers to understand. Queue semantics: In the true sense, Kafka is a distributed immutable event log, not a queuing system. It does not inherently support distributing tasks to multiple workers so that each task is processed exactly once. Dynamic partition: It is difficult to dynamically change a number of Kafka topics. This limits the scalability of the system when workload increases. The large number of partitions needs to be pre-provisioned to support the maximum load. Pulsar Pulsar is an open-source, distributed messaging and streaming platform. It is an open-source system developed by the Apache Software Foundation. It provides highly scalable, flexible, and durable for real-time data streaming and processing. Below are details of what is supported and what is not in Pulsar. Supported Multi-tenancy: Pulsar supports multi-tenancy as a first-class citizen. It provides access control across data and actions using tenant policies. Seamless geo-replication: Pulsar synchronizes data across multiple regions without any third-party replication tools. Replay: Similar to Kafka, Pulsar allows messages to be replayed from a specific point in a topic. Message retention policy decides how far back a message can be replayed. Unlimited message retention: Similar to Kafka, Pulsar allows unlimited message retention based on the retention policy configured. Flexible models: Pulsar supports both streaming and queuing models. It provides strict message ordering within a partition. Not Supported Automated setup: Similar to Kafka, Apache is open-source, and the developer needs to set up the infrastructure. Robust ecosystem: Pulsar is relatively new compared to Kafka. It doesn't have large community support similar to Kafka. Out-of-the-box Integration: Pulsar lacks out-of-the-box integration and support compared to Kafka and SQS. Conclusion Managed services require minimal maintenance effort, but non-managed services need regular, dedicated maintenance capacity. On the flip side, non-managed services provide better flexibility and tuning opportunities than managed services. In the end, choosing the right broker depends on the project's needs. Understanding the strengths and gaps of each broker helps developers make informed decisions.
Go through your code and follow the business logic. Whenever a question or doubt arises, there is potential for improvement. Your Code May Come back to You for Various Reasons The infrastructure, environment, or dependencies have evolved You want to reuse your code or logic in another context You need to introduce someone else or present your work before a wider audience The business requirements have changed Some improvements are needed There is a functional bug; etc. There are two, equally valid approaches here — either you fix the issue(s) with minimal effort and move on to the next task, or you take the chance to revisit what you have done, evaluate and possibly improve it, or even decide it is no longer needed, based on the experience and knowledge you have gained in the meantime. The big difference is that when you re-visit your code, you improve your skills as a side effect of doing your daily job. You may consider this a small investment that will pay for itself by increasing your efficiency in the future. A Few Examples Why did I do all this, where can I find the requirements? Developers often context switch between unrelated tasks — you can save time for onboarding yourself and others by maintaining better comments/documentation. A reference to a ticket could do the job, especially if there are multiple tickets. If possible, keep the requirements together with your code, otherwise try to summarize them. Hmm, this part is inefficient! In many cases this happens due to chasing deadlines, blindly copying code around, or not considering the real amount of data during development. You may find yourself retrieving the same data many times too. Writing efficient code always pays off by saving on iterations to improve performance. When you revisit your code, you may find that there are new and better ways to achieve the same goal. Oh, this is brittle — my assumptions may not hold in the future! "This will never happen" — you have heard it so many times at all levels of competence. No comment is needed here — a good reason why you should avoid writing brittle code is that you may want to reuse it in a different context. It's really hard to make no assumptions, but when you revisit your code, you should do your best to make as few assumptions as possible. Also consider that your code may run in different environments, where defaults and conventions may differ — never rely on things like date and number formats, order or completeness of data, availability of configuration or external services, etc. Oops, it is incomplete — it only covers a subset of the business requirements! You have no one to blame — this is your own code. Don't leave it incomplete, because it will come back to you and that always happens at the worst time possible. I'm lost following my own logic ... You definitely hit technical debt — and technical debt is immortal. As you develop professionally, you start doing things in more standard and widely recognized ways, so they are easier to maintain. It is quite tempting not to touch something that works. However, remember that, even if it works, it is only useable in the present context. Unreadable code is not reusable, not to mention it is hard to maintain. Fighting the technical debt pays by saving time and effort by allowing you to reuse code and logic. Uh, it's so big, it will take too much time to improve and I don't have enough time right now! Yet another type of technical debt. In a large and complex piece of code, some parts may appear unreachable in the actual context, making the code even less readable. This could be a problem, but nobody complained so far, so let's wait... Don't trust this line of thinking. The complaints will always come at the worst times. Summary Even when it isn't recognized by management or your peers, the effort of revisiting your own code makes you a better professional, which in turn gives you a better position on the market. Additionally, keeping your code clean and high-quality is satisfying, without the need for someone else's assessment — and being satisfied with your work is a good motivation to keep going. For myself, I would summarize all of the above in a single phrase — don't copy code but revisit it, especially if it's your own. It's like re-entering your new password when you change it — it can help you memorize it better, even if it's easier to copy and paste the same string twice. Nothing stops you from doing all this when developing new code too.
I recently read 6 Ways To Pass Parameters to Spring REST API. Though the title is a bit misleading, as it's unrelated to REST, it does an excellent job listing all ways to send parameters to a Spring application. I want to do the same for Apache APISIX; it's beneficial when you write a custom plugin. General Setup The general setup uses Docker Compose and static configuration. I'll have one plugin per way to pass parameters. YAML services: httpbin: image: kennethreitz/httpbin #1 apisix: image: apache/apisix:3.9.0-debian volumes: - ./apisix/conf/config.yml:/usr/local/apisix/conf/config.yaml:ro - ./apisix/conf/apisix.yml:/usr/local/apisix/conf/apisix.yaml:ro #2 - ./apisix/plugins:/opt/apisix/plugins:ro #3 ports: - "9080:9080" Local httpbin for more reliable results and less outbound network traffic Static configuration file Plugins folder, one file per plugin YAML deployment: role: data_plane role_data_plane: config_provider: yaml #1 apisix: extra_lua_path: /opt/?.lua #2 plugins: - proxy-rewrite #3 - path-variables #4 # ... Set static configuration Use every Lua file under /opt/apisix/plugins as a plugin Regular plugin Custom plugin, one per alternative Path Variables Path variables are a straightforward way to pass data. Their main issue is that they are limited to simple values, e.g., /links/{n}/{offset}. The naive approach is to write the following Lua code: Lua local core = require("apisix.core") function _M.access(_, ctx) local captures, _ = ngx.re.match(ctx.var.uri, '/path/(.*)/(.*)') --1-2 for k, v in pairs(captures) do core.log.warn('Order-Value pair: ', k, '=', v) end end APISIX stores the URI in ctx.var.uri Nginx offers a regular expression API Let's try: Shell curl localhost:9080/path/15/3 The log displays: Plain Text Order-Value pair: 0=/path/15/3 Order-Value pair: 1=15 Order-Value pair: 2=3 I didn't manage errors, though. Alternatively, we can rely on Apache APISIX features: a specific router. The default router, radixtree_host_uri, uses both the host and the URI to match requests. radixtree_uri_with_parameter lets go of the host part but also matches parameters. YAML apisix: extra_lua_path: /opt/?.lua router: http: radixtree_uri_with_parameter We need to update the route: YAML routes: - path-variables - uri: /path/:n/:offset #1 upstream_id: 1 plugins: path-variables: ~ Store n and offset in the context, under ctx.curr_req_matched We keep the plugin just to log the path variables: Lua function _M.access(_, ctx) core.log.warn('n: ', ctx.curr_req_matched.n, ', offset: ', ctx.curr_req_matched.offset) end The result is as expected with the same request as above: Plain Text n: 15, offset: 3 Query Parameters Query parameters are another regular way to pass data. Like path variables, you can only pass simple values, e.g., /?foo=bar. The Lua code doesn't require regular expressions: Lua local core = require("apisix.core") function _M.access(_, _) local args, _ = ngx.req.get_uri_args() for k, v in pairs(args) do core.log.warn('Order-Value pair: ', k, '=', v) end end Let's try: Shell curl localhost:9080/query\?foo=one\&bar=three The log displays: Plain Text Key-Value pair: bar=three Key-Value pair: foo=one Remember that query parameters have no order. Our code contains an issue, though. The ngx.req.get_uri_args() accepts parameters. Remember that the client can pass a query parameter multiple times with different values, e.g., ?foo=one&foo=two? The first parameter is the maximum number of values returned for a single query parameter. To avoid ignoring value, we should set it to 0, i.e., unbounded. Since every plugin designer must remember it, we can add the result to the context for other plugins down the chain. The updated code looks like this: Lua local core = require("apisix.core") function _M.get_uri_args(ctx) if not ctx then ctx = ngx.ctx.api_ctx end if not ctx.req_uri_args then local args, _ = ngx.req.get_uri_args(0) ctx.req_uri_args = args end return ctx.req_uri_args end function _M.access(_, ctx) for k, v in pairs(ctx.req_uri_args) do core.log.warn('Key-Value pair: ', k, '=', v) end end Request Headers Request headers are another way to pass parameters. While they generally only contain simple values, you can also use them to send structured values, e.g., JSON. Depending on your requirement, APISIX can list all request headers or a specific one. Here, I get all of them: Lua local core = require("apisix.core") function _M.access(_, _) local headers = core.request.headers() for k, v in pairs(headers) do core.log.warn('Key-Value pair: ', k, '=', v) end end We test with a simple request: Shell curl -H 'foo: 1' -H 'bar: two' localhost:9080/headers And we got more than we expected because curl added default headers: Plain Text Key-Value pair: user-agent=curl/8.4.0 Key-Value pair: bar=two Key-Value pair: foo=1 Key-Value pair: host=localhost:9080 Key-Value pair: accept=*/* Request Body Setting a request body is the usual way to send structured data, e.g, JSON. Nginx offers a simple API to collect such data. Lua local core = require("apisix.core") function _M.access(_, _) local args = core.request.get_post_args() --1 local body = next(args, nil) --2 core.log.warn('Body: ', body) end Access the body as a regular Lua table A table is necessary in case of multipart payloads, e.g., file uploads. Here, we assume there's a single arg, the content body. It's time to test: Shell curl localhost:9080/body -X POST -d '{ "foo": 1, "bar": { "baz": "two" } }' The result is as expected: JSON Body: { "foo": 1, "bar": { "baz": "two" } } Cookies Last but not least, we can send parameters via cookies. The difference with previous alternatives is that cookies persist on the client side, and the browser sends them with each request. On the Lua side, we need to know the cookie name instead of listing all query parameters or headers. Lua local core = require("apisix.core") function _M.access(_, ctx) local foo = ctx.var.cookie_foo --1 core.log.warn('Cookie value: ', foo) end The cookie is named foo and is case-insensitive Let's test: Shell curl --cookie "foo=Bar" localhost:9080/cookies The result is correct: Plain Text Cookie value: Bar Summary In this post, we listed five alternatives to pass parameters server-side and explained how to access them on Apache APISIX. Here's the API summary: Alternative Source API Path variable APISIX Router Use the radixtree_uri_with_parameter router Query parameter Nginx ngx.req.get_uri_args(0) Request header APISIX core lib core.request.headers() Request body APISIX core lib core.request.get_post_args() Cookie Method context parameter ctx.var.cookie_ Thanks a lot to Zeping Bai for his review and explanations. The complete source code for this post can be found on GitHub. To Go Further 6 Ways To Pass Parameters to Spring REST API How to Build an Apache APISIX Plugin From 0 to 1?
Here, I'd like to talk you through three Java katas, ranging from the simplest to the most complex. These exercises should help you gain experience working with JDK tools such as javac, java, and jar. By doing them, you'll get a good understanding of what goes on behind the scenes of your favorite IDE or build tools like Maven, Gradle, etc. None of this denies the benefits of an IDE. But to be truly skilled at your craft, understand your essential tools and don’t let them get rusty. - Gail Ollis, "Don’t hIDE Your Tools" Getting Started The source code can be found in the GitHub repository. All commands in the exercises below are executed inside a Docker container to avoid any particularities related to a specific environment. Thus, to get started, clone the repository and run the command below from its java-javac-kata folder: Shell docker run --rm -it --name java_kata -v .:/java-javac-kata --entrypoint /bin/bash maven:3.9.6-amazoncorretto-17-debian Kata 1: "Hello, World!" Warm Up In this kata, we will be dealing with a simple Java application without any third-party dependencies. Let's navigate to the /class-path-part/kata-one-hello-world-warm-up folder and have a look at the directory structure. Within this directory, we can see the Java project structure and two classes in the com.example.kata.one package. Compilation Shell javac -d ./target/classes $(find -name '*.java') The compiled Java classes should appear in the target/classes folder, as shown in the screenshot above. Try using the verbose option to see more details about the compilation process in the console output: Shell javac -verbose -d ./target/classes $(find -name '*.java') With that covered, let's jump into the execution part. Execution Shell java --class-path "./target/classes" com.example.kata.one.Main As a result, you should see Hello World! in your console. Try using different verbose:[class|gc|jni] options to get more details on the execution process: Shell java -verbose:class --class-path "./target/classes" com.example.kata.one.Main As an extra step, it's worth trying to remove classes or rename packages to see what happens during both the complication and execution stages. This will give you a better understanding of which problems result in particular errors. Packaging Building Jar Shell jar --create --file ./target/hello-world-warm-up.jar -C target/classes/ . The built jar is placed in the target folder. Don't forget to use the verbose option as well to see more details: Shell jar --verbose --create --file ./target/hello-world-warm-up.jar -C target/classes/ . You can view the structure of the built jar using the following command: Shell jar -tf ./target/hello-world-warm-up.jar With that, let's proceed to run it: Shell java --class-path "./target/hello-world-warm-up.jar" com.example.kata.one.Main Building Executable Jar To build an executable jar, the main-class must be specified: Shell jar --create --file ./target/hello-world-warm-up.jar --main-class=com.example.kata.one.Main -C target/classes/ . It can then be run via jar option: Shell java -jar ./target/hello-world-warm-up.jar Kata 2: Third-Party Dependency In this kata, you will follow the same steps as in the previous one. The main difference is that our Hello World! application uses guava-30.1-jre.jar as a third-party dependency. Also, remember to use the verbose option to get more details. So, without further ado, let's get to the /class-path-part/kata-two-third-party-dependency folder and check out the directory's structure. Compilation Shell javac --class-path "./lib/*" -d ./target/classes/ $(find -name '*.java') The class-path option is used to specify the path to the lib folder where our dependency is stored. Execution Shell java --class-path "./target/classes:./lib/*" com.example.kata.two.Main Packaging Building Jar Shell jar --create --file ./target/third-party-dependency.jar -C target/classes/ . And let us run it: Shell java --class-path "./target/third-party-dependency.jar:./lib/*" com.example.kata.two.Main Building Executable Jar Our first step here is to create a MANIFEST.FM file with the Class-Path specified: Shell echo 'Class-Path: ../lib/guava-30.1-jre.jar' > ./target/MANIFEST.FM Next up, we build a jar with the provided manifest option: Shell jar --create \ --file ./target/third-party-dependency.jar \ --main-class=com.example.kata.two.Main \ --manifest=./target/MANIFEST.FM \ -C target/classes/ . Finally, we execute it: Shell java -jar ./target/third-party-dependency.jar Building Fat Jar First of all, we need to unpack our guava-30.1-jre.jar into the ./target/classes/ folder (be patient, this can take some time): Shell cp lib/guava-30.1-jre.jar ./target/classes && \ cd ./target/classes && \ jar xf guava-30.1-jre.jar && \ rm ./guava-30.1-jre.jar && \ rm -r ./META-INF && \ cd ../../ With all the necessary classes in the ./target/classes folder, we can build our fat jar (again, be patient as this can take some time): Shell jar --create --file ./target/third-party-dependency-fat.jar --main-class=com.example.kata.two.Main -C target/classes/ . Now, we can run our built jar: Shell java -jar ./target/third-party-dependency-fat.jar Kata 3: Spring Boot Application Conquest In the /class-path-part/kata-three-spring-boot-app-conquest folder, you will find a Maven project for a simple Spring Boot application. The main goal here is to apply everything that we have learned so far to manage all its dependencies and run the application, including its test code. As a starting point, let's run the following command: Shell mvn clean package && \ find ./target/ -mindepth 1 ! -regex '^./target/lib\(/.*\)?' -delete This will leave only the source code and download all necessary dependencies into the ./target/lib folder. Compilation Shell javac --class-path "./target/lib/compile/*" -d ./target/classes/ $(find -P ./src/main/ -name '*.java') Execution Shell java --class-path "./target/classes:./target/lib/compile/*" com.example.kata.three.Main As an extra step for both complication and execution, you can try specifying all necessary dependencies explicitly in the class-path. This will help you understand that not all artifacts in the ./target/lib/compile are needed to do that. Packaging Let's package our compiled code as a jar and try to run it. It won't be a Spring Boot jar because Spring Boot uses a non-standard approach to build fat jars, including its own class loader. See the documentation on The Executable Jar Format for more details. In this exercise, we will package our source code as we did before to demonstrate that everything can work in the same way with Spring Boot, too. Shell jar --create --file ./target/spring-boot-app-conquest.jar -C target/classes/ . Now, let's run it to verify that it works: Shell java --class-path "./target/spring-boot-app-conquest.jar:./target/lib/compile/*" com.example.kata.three.Main Test Compilation Shell javac --class-path "./target/classes:./target/lib/test/*:./target/lib/compile/*" -d ./target/test-classes/ $(find -P ./src/test/ -name '*.java') Take notice that this time we are searching for source files in the ./src/test/ directory, and both the application source code and test dependencies are added to the class-path. Test Execution To be able to run code via java, we need an entry point (a class with the main method). Traditionally, tests are run via a Maven plugin or by an IDE, which have their own launchers to make this process comfortable for developers. To demonstrate test execution, the junit-platform-console-standalone dependency, which includes the org.junit.platform.console.ConsoleLauncher with the main method, is added to our pom.xml. Its artifact can also be seen in the ./target/lib/test/* folder. Shell java --class-path "./target/classes:./target/test-classes:./target/lib/compile/*:./target/lib/test/*" \ org.junit.platform.console.ConsoleLauncher execute --scan-classpath --disable-ansi-colors Wrapping Up Gail's article, "Don’t hIDE Your Tools" quoted at the very beginning of this article, taken from 97 Things Every Java Programmer Should Know by Kevlin Henney and Trisha Gee, inspired me to start thinking in this direction and eventually led to the creation of this post. Hopefully, by doing these katas and not just reading them, you have developed a better understanding of how the essential JDK tools work.
In any microservice, managing database interactions with precision is crucial for maintaining application performance and reliability. Usually, we will unravel weird issues with database connection during performance testing. Recently, a critical issue surfaced within the repository layer of a Spring microservice application, where improper exception handling led to unexpected failures and service disruptions during performance testing. This article delves into the specifics of the issue and also highlights the pivotal role of the @Transactional annotation, which remedied the issue. Spring microservice applications rely heavily on stable and efficient database interactions, often managed through the Java Persistence API (JPA). Properly managing database connections, particularly preventing connection leaks, is critical to ensuring these interactions do not negatively impact application performance. Issue Background During a recent round of performance testing, a critical issue emerged within one of our essential microservices, which was designated for sending client communications. This service began to experience repeated Gateway time-out errors. The underlying problem was rooted in our database operations at the repository layer. An investigation into these time-out errors revealed that a stored procedure was consistently failing. The failure was triggered by an invalid parameter passed to the procedure, which raised a business exception from the stored procedure. The repository layer did not handle this exception efficiently; it bubbled up. Below is the source code for the stored procedure call: Java public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String groupId) throws EDeliveryException { try { StoredProcedureQuery query = entityManager.createStoredProcedureQuery("p_create_notification"); DbUtility.setParameter(query, "v_notif_code", notifCode); DbUtility.setParameter(query, "v_user_uuid", userId); DbUtility.setNullParameter(query, "v_user_id", Integer.class); DbUtility.setParameter(query, "v_acct_id", acctId); DbUtility.setParameter(query, "v_message_url", s3KeyName); DbUtility.setParameter(query, "v_ecomm_attributes", attributes); DbUtility.setParameter(query, "v_notif_title", notifTitle); DbUtility.setParameter(query, "v_notif_subject", notifSubject); DbUtility.setParameter(query, "v_notif_preview_text", notifPreviewText); DbUtility.setParameter(query, "v_content_type", contentType); DbUtility.setParameter(query, "v_do_not_delete", doNotDelete); DbUtility.setParameter(query, "v_hard_copy_comm", isLetter); DbUtility.setParameter(query, "v_group_id", groupId); DbUtility.setOutParameter(query, "v_notif_id", BigInteger.class); query.execute(); BigInteger notifId = (BigInteger) query.getOutputParameterValue("v_notif_id"); return notifId.longValue(); } catch (PersistenceException ex) { logger.error("DbRepository::createInboxMessage - Error creating notification", ex); throw new EDeliveryException(ex.getMessage(), ex); } } Issue Analysis As illustrated in our scenario, when a stored procedure encountered an error, the resulting exception would propagate upward from the repository layer to the service layer and finally to the controller. This propagation was problematic, causing our API to respond with non-200 HTTP status codes—typically 500 or 400. Following several such incidents, the service container reached a point where it could no longer handle incoming requests, ultimately resulting in a 502 Gateway Timeout error. This critical state was reflected in our monitoring systems, with Kibana logs indicating the issue: `HikariPool-1 - Connection is not available, request timed out after 30000ms.` The issue was improper exception handling, as exceptions bubbled up through the system layers without being properly managed. This prevented the release of database connections back into the connection pool, leading to the depletion of available connections. Consequently, after exhausting all connections, the container was unable to process new requests, resulting in the error reported in the Kibana logs and a non-200 HTTP error. Resolution To resolve this issue, we could handle the exception gracefully and not bubble up further, letting JPA and Spring context release the connection to the pool. Another alternative is to use @Transactional annotation for the method. Below is the same method with annotation: Java @Transactional public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String groupId) throws EDeliveryException { ……… } The implementation of the method below demonstrates an approach to exception handling that prevents exceptions from propagating further up the stack by catching and logging them within the method itself: Java public long createInboxMessage(String notifCode, String acctId, String userId, String s3KeyName, List<Notif> notifList, String attributes, String notifTitle, String notifSubject, String notifPreviewText, String contentType, boolean doNotDelete, boolean isLetter, String loanGroupId) { try { ....... query.execute(); BigInteger notifId = (BigInteger) query.getOutputParameterValue("v_notif_id"); return notifId.longValue(); } catch (PersistenceException ex) { logger.error("DbRepository::createInboxMessage - Error creating notification", ex); } return -1; } With @Transactional The @Transactional annotation in Spring frameworks manages transaction boundaries. It begins a transaction when the annotated method starts and commits or rolls it back when the method completes. When an exception occurs, @Transactional ensures that the transaction is rolled back, which helps appropriately release database connections back to the connection pool. Without @Transactional If a repository method that calls a stored procedure is not annotated with @Transactional, Spring does not manage the transaction boundaries for that method. The transaction handling must be manually implemented if the stored procedure throws an exception. If not properly managed, this can result in the database connection not being closed and not being returned to the pool, leading to a connection leak. Best Practices Always use @Transactional when the method's operations should be executed within a transaction scope. This is especially important for operations involving stored procedures that can modify the database state. Ensure exception handling within the method includes proper transaction rollback and closing of any database connections, mainly when not using @Transactional. Conclusion Effective transaction management is pivotal in maintaining the health and performance of Spring Microservice applications using JPA. By employing the @Transactional annotation, we can safeguard against connection leaks and ensure that database interactions do not degrade application performance or stability. Adhering to these guidelines can enhance the reliability and efficiency of our Spring Microservices, providing stable and responsive services to the consuming applications or end users.
This article is part of a range of articles called “Mastering Object-Oriented Design Patterns.” The collection consists of four articles and aims to provide profound guidance on object-oriented design patterns. The articles address the introduction of the design patterns issues, their sources, and the advantages of their use. In addition, the tutorial series provides full explanations of the common design patterns. Every article starts with real-life analogies, discusses the pros and cons of each pattern, and provides a Java example implementation. Once you find the title, “Mastering Object-Oriented Design Patterns,” you can explore the whole series and master object-oriented design patterns. Once upon a time, there was a new notion called “design patterns” in software engineering. This concept has revolutionized how developers approach complex software design. Design patterns are verified solutions to frequently encountered problems. However, where did this idea originate, and how did it significantly contribute to object-oriented programming? Origin of Design Pattern Design patterns first appeared in architecture, not in software. An architect and design theorist, Christopher Alexander, introduced the idea in his influential work, “A Pattern Language: Towns, Buildings, Construction.” Alexander sought to develop a pattern language to solve some city spatial and communal problems. These patterns included several details, such as window heights and the organization of green zones within the neighborhoods. This way, it sets the ground for a design approach focusing on reusable solutions to the same problems. Captivated by the concept of Alexander, a group of four software engineers (Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides), also known as the Gang of Four (GoF), recognized the potential of using this concept in software development. In 1994, they published “Design Patterns: Book “Elements of Reusable Object-Oriented” Software that translated the pattern language of architecture into the world of object-oriented programming (OOP). This seminal publication presented twenty-three design patterns targeted at addressing typical design issues. It soon became a best-seller and a vital tool in software engineering instruction. Introduction to Design Patterns What Are Design Patterns? Design patterns are not recipes but recommendations and tips for solving typical design problems. They are a pool of bright ideas and experiences of the software development community. These patterns assist the developers in building flexible, low-maintenance, and reusable code. Design patterns guide common language and methodology for solving design problems, simplifying collaboration among developers, and speeding up the development process. Picture-making software is like assembling a puzzle, except that you can continuously be given the same piece. Design patterns are your map indicating how you can fit those pieces every time. Design patterns are helpful techniques for resolving common coding issues. They can be understood as a set of coding challenge cookbooks. Rather than giving you ready-made code snippets, they present ways to solve particular problems in your projects. The purpose of design patterns is to reduce coding complexities, help you solve problems faster, and keep your code as flexible as possible for the future. Design Patterns vs. Algorithms Nevertheless, both provide solutions, but an algorithm is a sequence of steps to reach a goal, just like a cooking recipe. On the other hand, a design pattern is more of a template. It provides the layout and major components of the solution but does not specify building details; consequently, it is flexible in how this solution is being implemented in your project. Both algorithms and design patterns provide solutions. An algorithm is like a process, a recipe in the kitchen that makes you reach a target. Alternatively, a design pattern is like a blueprint. It gives the framework and the factor elements of the solution but lets you select the structure details, which makes it flexible for your project demands. Inside a Design Pattern A design pattern typically includes: A design pattern typically includes: Intent: What the pattern does and what it solves. Motivation: The reason and the way it can help. Structure of classes: A schematic indicating how its parts communicate. Code example: Commonly made available in popular programming languages to facilitate comprehension. Some will also address when to use the pattern, how to apply it, and its interaction with other patterns, leaving you with a complete toolset for more innovative coding. Why Use Design Patterns? Design patterns in coding are a kind of secret toolset. They make solving common problems easier, and here’s why embracing design patterns can be a game-changer: They make solving common issues more accessible and that’s why embracing design patterns can be a game-changer: Proven and ready-to-use solutions: Imagine owning a treasure chest of brilliant hacks already worked out by professional coders. That’s what design patterns are—several clever, immediately applicable, professional-quality solutions that allow you to solve problems quickly and correctly. Simplifying complexity: Any great software is minimalistic in a sense. Design patterns assist you in splitting large and daunting problems into small and manageable chunks, thus making your code neater and your life simpler. Big picture focus: Design patterns allow you to spend less time on code structure and more time on doing cool stuff. This will enable you to concentrate more on producing great features rather than struggling with the fundamentals. Common language: Design patterns provide the developers with a common language, so when you say, “Let’s use a Singleton here,” everyone gets it. This leads to more efficient work and less confusion. Reusability and maintainability: The design patterns encourage code reuse via inheritance and interfaces, which allows classes to be adaptable and systems easy to maintain. This method shortens development cycles and keeps systems fortified over time. Improved scalability and flexibility: The MVC pattern allows for a more defined separation of the different parts of your code, making your system more flexible and able to grow with little adjustments. Boosted readability and understandability: Properly implemented design patterns increase the readability and understandability of your code, making it easier for other people to understand and contribute without too much explanation. In a nutshell, design patterns are all about making coding more comfortable, efficient, and even entertaining. They enable you to work on extension rather than invention, which allows you to improve the software without reinventing the wheel. Navigating the Tricky Side of Design Patterns Design patterns are secret ingredients that make writing code more accessible and practical. But they are not ideal. Here are a couple of things to be aware of: Not suitable for every programming language: However, using a design pattern may sometimes not be appropriate for a specific language in a programming language. For instance, a complex pattern may be redundant if the language has a simple feature that can do the job. It is just like employing an intelligent instrument while a simple one is sufficient. Being too rigid with patterns: Although design patterns are derived from best practices, their strict adherence may cause undesirable behavior. It’s similar to sticking to a recipe so rigidly that you do not make it according to your taste. At times, you need to modify to suit the particular requirements of your project. Overusing patterns: It is pretty simple to lose control and believe that every problem can be addressed through a design pattern. Yet, not all problems need a pattern. It is akin to using a hammer for all tasks when, at times, a screwdriver is sufficient. Adding unnecessary complexity: Design patterns can also introduce complexity to your code. If not handled with care, they can complicate your project. How To Avoid the Pitfalls However, despite the troubles, design patterns are still quite helpful. The key is to use them wisely: Choose the appropriate tool for the task: Not all problems need a design pattern. Sometimes, simpler is better. Adapt and customize: Never be afraid to adjust a pattern to make it suit you better. Please keep it simple: Do not make your code more complicated by using patterns that are not required. In summary, design patterns are similar to spices in cooking: applied correctly, they can improve your dish (or project). Yet, it’s necessary to employ them in moderation and not let them overcome the food. Types of Design Patterns Design patterns are beneficial methods applied in software design. They facilitate code organization and management during the development and preservation of applications. Regard them as clever construction techniques and improvements to your software projects. Let’s quickly check out the three main types: Creational Patterns: Building Blocks Creational patterns are equivalent to picking up the suitable LEGO blocks to begin your model building. Their attention is directed to simplifying the process of creating objects or groups of objects. This way, you can build up the software flexibly and efficiently, as if picking out the LEGO pieces that fit your design. Structural Patterns: Putting It All Together Structural patterns are all about how you build your LEGO bricks. They help you arrange the pieces (or objects) into more significant structures, with everything neat and well-arranged. It is akin to following a LEGO manual to guarantee your spaceship or castle will be sturdy and neat. Behavioral Patterns: Making It Work LEGO behavioral patterns are just about making your LEGO creation do extraordinary things. For instance, think about making the wings of your LEGO spaceship move. In software, these patterns enable various program components to interact and cooperate, ensuring everything functions as intended. Design patterns could be as simple as idioms that only run in a programming language or as complicated as architectural patterns that shape the entire application. They are your tool in the tool kit, available during a small function and throughout the software’s structure. Comprehending these patterns is like learning the tricks of constructing the most incredible LEGO sets. They make you a software genius; all your coding will seem relaxed and fun! Conclusion Our first module is finally over. It has been a fantastic trip into the principles behind design patterns and how the patterns are leveraged in software engineering. We found it fascinating to understand the concept of design patterns and their role in software engineering. Design patterns are not merely coding shortcuts but crystallized wisdom that provides reusable solutions for typical design issues. They simplify the object-oriented programming process and make it work faster, thus creating cleaner codes. On the other hand, they are not simple. We have pointed out that it is essential to know when and how to use them appropriately. In closing this chapter, we invite you to browse the other parts of the “Mastering Object-Oriented Design Patterns” series. Each part reinforces your comprehension and skill, making you more confident when applying design patterns to your projects. If you want to develop your architectural skills, speed up your development process, or improve the quality of your code, this series is here to help you. References Design Patterns Head First Design
Flyway is a popular open-source tool for managing database migrations. It makes it easy to manage and version control the database schema for your application. Flyway supports almost all popular databases including Oracle, SQL Server, DB2, MySQL, Amazon RDS, Aurora MySQL, MariaDB, PostgreSQL, and more. For the full list of supported databases, you can check the official documentation here. How Flyway Migrations Works Any changes to the database are called migrations. Flyway supports two types of migrations; versioned or repeatable migrations. Versioned migrations are the most common type of migration, they are applied once in the order they appear. Versioned migrations are used for creating, altering, and dropping tables, indexes or foreign keys. Versioned migration files use naming conventions using [Prefix][Separator][Migration Description][Suffix] for example, V1__add_user_table.sql and V2__alter_user_table.sql Repeatable migrations, on the other hand, are (re-)applied every time they change. Repeatable migrations are useful for managing views, stored procedures, or bulk reference data updates where the latest version should replace the previous one without considering versioning. Repeatable migrations are always applied last after all pending versioned migrations are been executed. Repeatable migration files use naming conventions such as R__add_new_table.sql The migration schemas can be written in either SQL or Java. When we start the application to an empty database, Flyway will first create a schema history table (flyway_schema_history) table. This table IS used to track the state of the database. After the flyway_schema_history table is created, it will scan the classpath for the migration files. The migrations are then sorted based on their version number and applied in order. As each migration gets applied, the schema history table is updated accordingly. Integrating Flyway in Spring Boot In this tutorial, we will create a Spring Boot application to deal with MySQL8 database migration using Flyway. This example uses Java 17, Spring Boot 3.2.4, and MySQL 8.0.26. For the database operation, we will use Spring boot JPA. Install Flyway Dependencies First, add the following dependencies to your pom.xml or your build.gradle file. The spring-boot-starter-data-jpa dependency is used for using Spring Data Java Persistence API (JPA) with Hibernate. The mysql-connector-j is the official JDBC driver for MySQL databases. It allows your Java application to connect to a MySQL database for operations such as creating, reading, updating, and deleting records. The flyway-core dependency is essential for integrating Flyway into your project, enabling migrations and version control for your database schema. The flyway-mysql dependency adds the Flyway support for MySQL databases. It provides MySQL-specific functionality and optimizations for Flyway operations. It's necessary when your application uses Flyway for managing database migrations on a MySQL database. pom.xml XML <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>com.mysql</groupId> <artifactId>mysql-connector-j</artifactId> <scope>runtime</scope> </dependency> <dependency> <groupId>org.flywaydb</groupId> <artifactId>flyway-core</artifactId> </dependency> <dependency> <groupId>org.flywaydb</groupId> <artifactId>flyway-mysql</artifactId> </dependency> <!-- Other dependencies--> </dependencies> Configure the Database Connection Now let us provide the database connection properties in your application.properties file. # DB properties spring.datasource.url=jdbc:mysql://localhost:3306/flyway_demo spring.datasource.username=root spring.datasource.password=Passw0rd spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver #JPA spring.jpa.show-sql=true Create Database Changelog Files Let us now create a couple of database migration schema files inside the resources/db/migrations directory. V1__add_movies_table SQL CREATE TABLE movie ( id bigint NOT NULL AUTO_INCREMENT, title varchar(255) DEFAULT NULL, headline varchar(255) DEFAULT NULL, language varchar(255) DEFAULT NULL, region varchar(255) DEFAULT NULL, thumbnail varchar(255) DEFAULT NULL, rating enum('G','PG','PG13','R','NC17') DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB; V2__add_actor_table.sql SQL CREATE TABLE actor ( id bigint NOT NULL AUTO_INCREMENT, first_name varchar(255) DEFAULT NULL, last_name varchar(255) DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB; V3__add_movie_actor_relations.sql SQL CREATE TABLE movie_actors ( actors_id bigint NOT NULL, movie_id bigint NOT NULL, PRIMARY KEY (actors_id, movie_id), KEY fk_ref_movie (movie_id), CONSTRAINT fk_ref_movie FOREIGN KEY (movie_id) REFERENCES movie (id), CONSTRAINT fl_ref_actor FOREIGN KEY (actors_id) REFERENCES actor (id) ) ENGINE=InnoDB; R__create_or_replace_movie_view.sql SQL CREATE OR REPLACE VIEW movie_view AS SELECT id, title FROM movie; V4__insert_test_data.sql SQL INSERT INTO movie (title, headline, language, region, thumbnail, rating) VALUES ('Inception', 'A thief who steals corporate secrets through the use of dream-sharing technology.', 'English', 'USA', 'inception.jpg', 'PG13'), ('The Godfather', 'The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.', 'English', 'USA', 'godfather.jpg', 'R'), ('Parasite', 'A poor family, the Kims, con their way into becoming the servants of a rich family, the Parks. But their easy life gets complicated when their deception is threatened with exposure.', 'Korean', 'South Korea', 'parasite.jpg', 'R'), ('Amélie', 'Amélie is an innocent and naive girl in Paris with her own sense of justice. She decides to help those around her and, along the way, discovers love.', 'French', 'France', 'amelie.jpg', 'R'); -- Inserting data into the 'actor' table INSERT INTO actor (first_name, last_name) VALUES ('Leonardo', 'DiCaprio'), ('Al', 'Pacino'), ('Song', 'Kang-ho'), ('Audrey', 'Tautou'); -- Leonardo DiCaprio in Inception INSERT INTO movie_actors (actors_id, movie_id) VALUES (1, 1); -- Al Pacino in The Godfather INSERT INTO movie_actors (actors_id, movie_id) VALUES (2, 2); -- Song Kang-ho in Parasite INSERT INTO movie_actors (actors_id, movie_id) VALUES (3, 3); -- Audrey Tautou in Amélie INSERT INTO movie_actors (actors_id, movie_id) VALUES (4, 4); These tables are mapped to the following entity classes. Movie.java Java @Entity @Data public class Movie { @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; private String title; private String headline; private String thumbnail; private String language; private String region; @Enumerated(EnumType.STRING) private ContentRating rating; @ManyToMany Set<Actor> actors; } public enum ContentRating { G, PG, PG13, R, NC17 } Actor.java Java @Entity @Data public class Actor { @Id @GeneratedValue(strategy = GenerationType.IDENTITY) Long id; String firstName; String lastName; } Configure Flyway We can control the migration process using the following properties in the application.properties file: application.properties spring.flyway.enabled=true spring.flyway.locations=classpath:db/migrations spring.flyway.baseline-on-migrate=true spring.flyway.validate-on-migrate=true Property Use spring.flyway.enabled=true Enables or disables Flyway's migration functionality for your application spring.flyway.validate-on-migrate=true When this property is set to true, Flyway will validate the applied migrations against the migration scripts every time it runs a migration. This ensures that the migrations applied to the database match the ones available in the project. If validation fails, Flyway will prevent the migration from running, which helps catch potential problems early. spring.flyway.baseline-on-migrate=true Used when you have an existing database that wasn't managed by Flyway and you want to start using Flyway to manage it. Setting this to true allows Flyway to baseline an existing database, marking it as a baseline and starting to manage subsequent migrations. spring.flyway.locations Specifies the locations of migration scripts within your project. Run the Migrations When you start your Spring Boot application, Flyway will automatically check the db/migrations directory for any new migrations that have not yet been applied to the database and will apply them in version order. ./mvnw spring-boot:run Reverse/Undo Migrations in Flyway Flyway allows you to revert migrations that were applied to the database. However, this feature requires you to have a Flyway Teams (Commercial) license. If you're using the community/free version of Flyway, the workaround is to create a new migration changelog file to undo the changes made by the previous migration and apply them. For example, V5__delete_movie_actors_table.sql DROP TABLE movie_actors; Now run the application to apply the V5 migration changelog to your database. Using Flyway Maven Plugin Flyway provides a maven plugin to manage the migrations from the command line. It provides 7 goals. Goal Description flyway:baseline Baselines an existing database, excluding all migrations up to and including baselineVersion. flyway:clean Drops all database objects (tables, views, procedures, triggers, ...) in the configured schemas. The schemas are cleaned in the order specified by the schemas property.. flyway:info Retrieves the complete information about the migrations including applied, pending and current migrations with details and status flyway:migrate Triggers the migration of the configured database to the latest version. flyway:repair Repairs the Flyway schema history table. This will remove any failed migrations on databases without DDL transactions flyway:undo Undoes the most recently applied versioned migration. Flyway teams only flyway:validate Validate applied migrations against resolved ones on the classpath. This detect accidental changes that may prevent the schema(s) from being recreated exactly. To integrate the flyway maven plugin into your maven project, we need to add flyway-maven-plugin plugin to your pom.xml file. XML <properties> <database.url>jdbc:mysql://localhost:3306/flyway_demo</database.url> <database.username>YOUR_DB_USER</database.username> <database.password>YOUR_DB_PASSWORD</database.password> </properties> <build> <plugins> <plugin> <groupId>org.flywaydb</groupId> <artifactId>flyway-maven-plugin</artifactId> <version>10.10.0</version> <configuration> <url>${database.url}</url> <user>${database.username}</user> <password>${database.password}</password> </configuration> </plugin> <!-- other plugins --> </plugins> </build> Now you can use the Maven goals ./mvnw flyway:migrate Maven allows you to define properties in the project's POM and pass the value from the command line. ./mvnw -Ddatabase.username=root -Ddatabase.password=Passw0rd flyway:migrate
Tech teams do their best to develop amazing software products. They spent countless hours coding, testing, and refining every little detail. However, even the most carefully crafted systems may encounter issues along the way. That's where reliability models and metrics come into play. They help us identify potential weak spots, anticipate failures, and build better products. The reliability of a system is a multidimensional concept that encompasses various aspects, including, but not limited to: Availability: The system is available and accessible to users whenever needed, without excessive downtime or interruptions. It includes considerations for system uptime, fault tolerance, and recovery mechanisms. Performance: The system should function within acceptable speed and resource usage parameters. It scales efficiently to meet growing demands (increasing loads, users, or data volumes). This ensures a smooth user experience and responsiveness to user actions. Stability: The software system operates consistently over time and maintains its performance levels without degradation or instability. It avoids unexpected crashes, freezes, or unpredictable behavior. Robustness: The system can gracefully handle unexpected inputs, invalid user interactions, and adverse conditions without crashing or compromising its functionality. It exhibits resilience to errors and exceptions. Recoverability: The system can recover from failures, errors, or disruptions and restore normal operation with minimal data loss or impact on users. It includes mechanisms for data backup, recovery, and rollback. Maintainability: The system should be easy to understand, modify, and fix when necessary. This allows for efficient bug fixes, updates, and future enhancements. This article starts by analyzing mean time metrics. Basic probability distribution models for reliability are then highlighted with their pros and cons. A distinction between software and hardware failure models follows. Finally, reliability growth models are explored including a list of factors for how to choose the right model. Mean Time Metrics Some of the most commonly tracked metrics in the industry are MTTA (mean time to acknowledge), MTBF (mean time before failure), MTTR (mean time to recovery, repair, respond or resolve), and MTTF (mean time to failure). They help tech teams understand how often incidents occur and how quickly the team bounces back from those incidents. The acronym MTTR can be misleading. When discussing MTTR, it might seem like a singular metric with a clear definition. However, it actually encompasses four distinct measurements. The 'R' in MTTR can signify repair, recovery, response, or resolution. While these four metrics share similarities, each carries its own significance and subtleties. Mean Time To Repair: This focuses on the time it takes to fix a failed component. Mean Time To Recovery: This considers the time to restore full functionality after a failure. Mean Time To Respond: This emphasizes the initial response time to acknowledge and investigate an incident. Mean Time To Resolve: This encompasses the entire incident resolution process, including diagnosis, repair, and recovery. While these metrics overlap, they provide a distinct perspective on how quickly a team resolves incidents. MTTA, or Mean Time To Acknowledge, measures how quickly your team reacts to alerts by tracking the average time from alert trigger to initial investigation. It helps assess both team responsiveness and alert system effectiveness. MTBF or Mean Time Between Failures, represents the average time a repairable system operates between unscheduled failures. It considers both the operating time and the repair time. MTBF helps estimate how often a system is likely to experience a failure and require repair. It's valuable for planning maintenance schedules, resource allocation, and predicting system uptime. For a system that cannot or should not be repaired, MTTF, or Mean Time To Failure, represents the average time that the system operates before experiencing its first failure. Unlike MTBF, it doesn't consider repair times. MTTF is used to estimate the lifespan of products that are not designed to be repaired after failing. This makes MTTF particularly relevant for components or systems where repair is either impossible or not economically viable. It's useful for comparing the reliability of different systems or components and informing design decisions for improved longevity. An analogy to illustrate the difference between MTBF and MTTF could be a fleet of delivery vans. MTBF: This would represent the average time between breakdowns for each van, considering both the driving time and the repair time it takes to get the van back on the road. MTTF: This would represent the average lifespan of each van before it experiences its first breakdown, regardless of whether it's repairable or not. Key Differentiators Feature MTBF MTTF Repairable System Yes No Repair Time Considered in the calculation Not considered in the calculation Failure Focus Time between subsequent failures Time to the first failure Application Planning maintenance, resource allocation Assessing inherent system reliability The Bigger Picture MTTR, MTTA, MTTF, and MTBF can also be used all together to provide a comprehensive picture of your team's effectiveness and areas for improvement. Mean time to recovery indicates how quickly you get systems operational again. Incorporating mean time to respond allows you to differentiate between team response time and alert system efficiency. Adding mean time to repair further breaks down how much time is spent on repairs versus troubleshooting. Mean time to resolve incorporates the entire incident lifecycle, encompassing the impact beyond downtime. But the story doesn't end there. Mean time between failures reveals your team's success in preventing or reducing future issues. Finally, incorporating mean time to failure provides insights into the overall lifespan and inherent reliability of your product or system. Probability Distributions for Reliability The following probability distributions are commonly used in reliability engineering to model the time until the failure of systems or components. They are often employed in reliability analysis to characterize the failure behavior of systems over time. Exponential Distribution Model This model assumes a constant failure rate over time. This means that the probability of a component failing is independent of its age or how long it has been operating. Applications: This model is suitable for analyzing components with random failures, such as memory chips, transistors, or hard drives. It's particularly useful in the early stages of a product's life cycle when failure data might be limited. Limitations: The constant failure rate assumption might not always hold true. As hardware components age, they might become more susceptible to failures (wear-out failures), which the Exponential Distribution Model wouldn't capture. Weibull Distribution Model This model offers more flexibility by allowing dynamic failure rates. It can model situations where the probability of failure increases over time at an early stage (infant mortality failures) or at a later stage (wear-out failures). Infant mortality failures: This could represent new components with manufacturing defects that are more likely to fail early on. Wear-out failures: This could represent components like mechanical parts that degrade with use and become more likely to fail as they age. Applications: The Weibull Distribution Model is more versatile than the Exponential Distribution Model. It's a good choice for analyzing a wider range of hardware components with varying failure patterns. Limitations: The Weibull Distribution Model requires more data to determine the shape parameter that defines the failure rate behavior (increasing, decreasing, or constant). Additionally, it might be too complex for situations where a simpler model like the Exponential Distribution would suffice. The Software vs Hardware Distinction The nature of software failures is different from that of hardware failures. Although both software and hardware may experience deterministic as well as random failures, their failures have different root causes, different failure patterns, and different prediction, prevention, and repair mechanisms. Depending on the level of interdependence between software and hardware and how it affects our systems, it may be beneficial to consider the following factors: 1. Root Cause of Failures Hardware: Hardware failures are physical in nature, caused by degradation of components, manufacturing defects, or environmental factors. These failures are often random and unpredictable. Consequently, hardware reliability models focus on physical failure mechanisms like fatigue, corrosion, and material defects. Software: Software failures usually stem from logical errors, code defects, or unforeseen interactions with the environment. These failures may be systematic and can be traced back to specific lines of code or design flaws. Consequently, software reliability models do not account for physical degradation over time. 2. Failure Patterns Hardware: Hardware failures often exhibit time-dependent behavior. Components might be more susceptible to failures early in their lifespan (infant mortality) or later as they wear out. Software: The behavior of software failures in time can be very tricky and usually depends on the evolution of our code, among others. A bug in the code will remain a bug until it's fixed, regardless of how long the software has been running. 3. Failure Prediction, Prevention, Repairs Hardware: Hardware reliability models that use MTBF often focus on predicting average times between failures and planning preventive maintenance schedules. Such models analyze historical failure data from identical components. Repairs often involve the physical replacement of components. Software: Software reliability models like Musa-Okumoto and Jelinski-Moranda focus on predicting the number of remaining defects based on testing data. These models consider code complexity and defect discovery rates to guide testing efforts and identify areas with potential bugs. Repair usually involves debugging and patching, not physical replacement. 4. Interdependence and Interaction Failures The level of interdependence between software and hardware varies for different systems, domains, and applications. Tight coupling between software and hardware may cause interaction failures. There can be software failures due to hardware and vice-versa. Here's a table summarizing the key differences: Feature Hardware Reliability Models Software Reliability Models Root Cause of Failures Physical Degradation, Defects, Environmental Factors Code Defects, Design Flaws, External Dependencies Failure Patterns Time-Dependent (Infant Mortality, Wear-Out) Non-Time Dependent (Bugs Remain Until Fixed) Prediction Focus Average Times Between Failures (MTBF, MTTF) Number of Remaining Defects Prevention Strategies Preventive Maintenance Schedules Code Review, Testing, Bug Fixes By understanding the distinct characteristics of hardware and software failures, we may be able to leverage tailored reliability models, whenever necessary, to gain in-depth knowledge of our system's behavior. This way we can implement targeted strategies for prevention and mitigation in order to build more reliable systems. Code Complexity Code complexity assesses how difficult a codebase is to understand and maintain. Higher complexity often correlates with an increased likelihood of hidden bugs. By measuring code complexity, developers can prioritize testing efforts and focus on areas with potentially higher defect density. The following tools can automate the analysis of code structure and identify potential issues like code duplication, long functions, and high cyclomatic complexity: SonarQube: A comprehensive platform offering code quality analysis, including code complexity metrics Fortify: Provides static code analysis for security vulnerabilities and code complexity CppDepend (for C++): Analyzes code dependencies and metrics for C++ codebases PMD: An open-source tool for identifying common coding flaws and complexity metrics Defect Density Defect density illuminates the prevalence of bugs within our code. It's calculated as the number of defects discovered per unit of code, typically lines of code (LOC). A lower defect density signifies a more robust and reliable software product. Reliability Growth Models Reliability growth models help development teams estimate the testing effort required to achieve desired reliability levels and ensure a smooth launch of their software. These models predict software reliability improvements as testing progresses, offering insights into the effectiveness of testing strategies and guiding resource allocation. They are mathematical models used to predict and improve the reliability of systems over time by analyzing historical data on defects or failures and their removal. Some models exhibit characteristics of exponential growth. Other models exhibit characteristics of power law growth while there exist models that exhibit both exponential and power law growth. The distinction is primarily based on the underlying assumptions about how the fault detection rate changes over time in relation to the number of remaining faults. While a detailed analysis of reliability growth models is beyond the scope of this article, I will provide a categorization that may help for further study. Traditional growth models encompass the commonly used and foundational models, while the Bayesian approach represents a distinct methodology. The advanced growth models encompass more complex models that incorporate additional factors or assumptions. Please note that the list is indicative and not exhaustive. Traditional Growth Models Musa-Okumoto Model It assumes a logarithmic Poisson process for fault detection and removal, where the number of failures observed over time follows a logarithmic function of the number of initial faults. Jelinski-Moranda Model It assumes a constant failure intensity over time and is based on the concept of error seeding. It postulates that software failures occur at a rate proportional to the number of remaining faults in the system. Goel-Okumoto Model It incorporates the assumption that the fault detection rate decreases exponentially as faults are detected and fixed. It also assumes a non-homogeneous Poisson process for fault detection. Non-Homogeneous Poisson Process (NHPP) Models They assume the fault detection rate is time-dependent and follows a non-homogeneous Poisson process. These models allow for more flexibility in capturing variations in the fault detection rate over time. Bayesian Approach Wall and Ferguson Model It combines historical data with expert judgment to update reliability estimates over time. This model considers the impact of both defect discovery and defect correction efforts on reliability growth. Advanced Growth Models Duane Model This model assumes that the cumulative MTBF of a system increases as a power-law function of the cumulative test time. This is known as the Duane postulate and it reflects how quickly the reliability of the system is improving as testing and debugging occur. Coutinho Model Based on the Duane model, it extends to the idea of an instantaneous failure rate. This rate involves the number of defects found and the number of corrective actions made during testing time. This model provides a more dynamic representation of reliability growth. Gooitzen Model It incorporates the concept of imperfect debugging, where not all faults are detected and fixed during testing. This model provides a more realistic representation of the fault detection and removal process by accounting for imperfect debugging. Littlewood Model It acknowledges that as system failures are discovered during testing, the underlying faults causing these failures are repaired. Consequently, the reliability of the system should improve over time. This model also considers the possibility of negative reliability growth when a software repair introduces further errors. Rayleigh Model The Rayleigh probability distribution is a special case of the Weibull distribution. This model considers changes in defect rates over time, especially during the development phase. It provides an estimation of the number of defects that will occur in the future based on the observed data. Choosing the Right Model There's no single "best" reliability growth model. The ideal choice depends on the specific project characteristics and available data. Here are some factors to consider. Specific objectives: Determine the specific objectives and goals of reliability growth analysis. Whether the goal is to optimize testing strategies, allocate resources effectively, or improve overall system reliability, choose a model that aligns with the desired outcomes. Nature of the system: Understand the characteristics of the system being analyzed, including its complexity, components, and failure mechanisms. Certain models may be better suited for specific types of systems, such as software, hardware, or complex systems with multiple subsystems. Development stage: Consider the stage of development the system is in. Early-stage development may benefit from simpler models that provide basic insights, while later stages may require more sophisticated models to capture complex reliability growth behaviors. Available data: Assess the availability and quality of data on past failures, fault detection, and removal. Models that require extensive historical data may not be suitable if data is limited or unreliable. Complexity tolerance: Evaluate the complexity tolerance of the stakeholders involved. Some models may require advanced statistical knowledge or computational resources, which may not be feasible or practical for all stakeholders. Assumptions and limitations: Understand the underlying assumptions and limitations of each reliability growth model. Choose a model whose assumptions align with the characteristics of the system and the available data. Predictive capability: Assess the predictive capability of the model in accurately forecasting future reliability levels based on past data. Flexibility and adaptability: Consider the flexibility and adaptability of the model to different growth patterns and scenarios. Models that can accommodate variations in fault detection rates, growth behaviors, and system complexities are more versatile and applicable in diverse contexts. Resource requirements: Evaluate the resource requirements associated with implementing and using the model, including computational resources, time, and expertise. Choose a model that aligns with the available resources and capabilities of the organization. Validation and verification: Verify the validity and reliability of the model through validation against empirical data or comparison with other established models. Models that have been validated and verified against real-world data are more trustworthy and reliable. Regulatory requirements: Consider any regulatory requirements or industry standards that may influence the choice of reliability growth model. Certain industries may have specific guidelines or recommendations for reliability analysis that need to be adhered to. Stakeholder input: Seek input and feedback from relevant stakeholders, including engineers, managers, and domain experts, to ensure that the chosen model meets the needs and expectations of all parties involved. Wrapping Up Throughout this article, we explored a plethora of reliability models and metrics. From the simple elegance of MTTR to the nuanced insights of NHPP models, each instrument offers a unique perspective on system health. The key takeaway? There's no single "rockstar" metric or model that guarantees system reliability. Instead, we should carefully select and combine the right tools for the specific system at hand. By understanding the strengths and limitations of various models and metrics, and aligning them with your system's characteristics, you can create a comprehensive reliability assessment plan. This tailored approach may allow us to identify potential weaknesses and prioritize improvement efforts.
Services, or servers, are software components or processes that execute operations on specified inputs, producing either actions or data depending on their purpose. The party making the request is the client, while the server manages the request process. Typically, communication between client and server occurs over a network, utilizing protocols such as HTTP for REST or gRPC. Services may include a User Interface (UI) or function solely as backend processes. With this background, we can explore the steps and rationale behind developing a scalable service. NOTE: This article does not provide instructions on service or UI development, leaving you the freedom to select the language or tech stack that suits your requirements. Instead, it offers a comprehensive perspective on constructing and expanding a service, reflecting what startups need to do in order to scale a service. Additionally, it's important to recognize that while this approach offers valuable insights into computing concepts, it's not the sole method for designing systems. The Beginning: Version Control Assuming clarity on the presence of a UI and the general purpose of the service, the initial step prior to service development involves implementing a source control/version control system to support the code. This typically entails utilizing tools like Git, Mercurial, or others to back up the code and facilitate collaboration, especially as the number of contributors grows. It's common for startups to begin with Git as their version control system, often leveraging platforms like github.com for hosting Git repositories. An essential element of version control is pull requests, facilitating peer reviews within your team. This process enhances code quality by allowing multiple individuals to review and approve proposed changes before integration. While I won't delve into specifics here, a quick online search will provide ample information on the topic. Developing the Service Once version control is established, the next step involves setting up a repository and initiating service development. This article adopts a language-agnostic approach, as delving into specific languages and optimal tech stacks for every service function would be overly detailed. For conciseness, let's focus on a service that executes functions based on inputs and necessitates backend storage (while remaining neutral on the storage solution, which will be discussed later). As you commence service development, it's crucial to grasp how to run it locally on your laptop or in any developer environment. One should consider this aspect carefully, as local testing plays a pivotal role in efficient development. While crafting the service, ensure that classes, functions, and other components are organized in a modular manner, into separate files as necessary. This organizational approach promotes a structured repository and facilitates comprehensive unit test coverage. Unit tests represent a critical aspect of testing that developers should rigorously prioritize. There should be no compromises in this regard! Countless incidents or production issues could have been averted with the implementation of a few unit tests. Neglecting this practice can potentially incur significant financial costs for a company. I won't delve into the specifics of integrating the gRPC framework, REST packages, or any other communication protocols. You'll have the freedom to explore and implement these as you develop the service. Once the service is executable and tested through unit tests and basic manual testing, the next step is to explore how to make it "deployable." Packaging the Service Ensuring the service is "deployable" implies having a method to run the process in a more manageable manner. Let's delve into this concept further. What exactly does this entail? Now that we have a runnable process, who will initiate it initially? Moreover, where will it be executed? Addressing these questions is crucial, and we'll now proceed to provide answers. In my humble opinion, managing your own compute infrastructure might not be the best approach. There are numerous intricacies involved in ensuring that your service is accessible on the Internet. Opting for a cloud service provider (CSP) is a wiser choice, as they handle much of the complexity behind the scenes. For our purposes, any available cloud service provider will suffice. Once a CSP is selected, the next consideration is how to manage the process. We aim to avoid manual intervention every time the service crashes, especially without notification. The solution lies in orchestrating our process through containerization. This involves creating a container image for our process, essentially a filesystem containing all necessary dependencies at the application layer. A "Dockerfile" is used to specify the steps for including the process and dependencies in the container image. Upon completion of the Dockerfile, the docker build cli can be used to generate an image with tags. This image is then stored locally or pushed to a container registry, serving as a repository for container images that can later be pulled onto a compute instance. With these steps outlined, the next question arises: how does containerization orchestrate our process? This will be addressed in the following section on executing a container. Executing the Container After building a container image, the subsequent step is its execution, which in turn initiates the service we've developed. Various container runtimes, such as containerd, podman, and others, are available to facilitate this process. In this context, we utilize the "docker" cli to manage the container, which interacts with containerd in the background. Running a container is straightforward: "docker run" executes the container and consequently, the developed process. You may observe logs in the terminal (if not run as a daemon) or use "docker logs" to inspect service logs if necessary. Additionally, options like "--restart" can be included in the command to automatically restart the container (i.e., the process) in the event of a crash, allowing for customization as required. At this stage, we have our process encapsulated within a container, ready for execution/orchestration as required. While this setup is suitable for local testing, our next step involves exploring how to deploy this on a basic compute instance within our chosen CSP. Deploying the Container Now that we have a container, it's advisable to publish it to a container registry. Numerous container registries are available, managed by CSPs or docker itself. Once the container is published, it becomes easily accessible from any CSP or platform. We can pull the image and run it on a compute instance, such as a Virtual Machine (VM), allocated within the CSP. Starting with this option is typically the most cost-effective and straightforward. While we briefly touch on other forms of compute infrastructure later in this article, deploying on a VM involves pulling a container image and running it, much like we did in our developer environment. Voila! Our service is deployed. However, ensuring accessibility to the world requires careful consideration. While directly exposing the VM's IP to the external world may seem tempting, it poses security risks. Implementing TLS for security is crucial. Instead, a better approach involves using a reverse proxy to route requests to specific services. This ensures security and facilitates the deployment of multiple services on the same VM. To enable internet access to our service, we require a method for inbound traffic to reach our VM. An effective solution involves installing a reverse proxy like Nginx directly on the VM. This can be achieved by pulling the Nginx container image, typically labeled as "nginx:latest". Before launching the container, it's necessary to configure Nginx settings such as servers, locations, and additional configurations. Security measures like TLS can also be implemented for enhanced protection. Once the Nginx configuration is established, it can be exposed to the container through volumes during container execution. This setup allows the reverse proxy to effectively route incoming requests to the container running on the same VM, using a specified port. One notable advantage is the ability to host multiple services within the VM, with routing efficiently managed by the reverse proxy. To finalize the setup, we must expose the VM's IP address and proxy port to the internet, with TLS encryption supported by the reverse proxy. This configuration adjustment can typically be configured through the CSP's settings. NOTE: The examples of solutions provided below may reference GCP as the CSP. This is solely for illustrative purposes and should not be interpreted as a recommendation. The intention is solely to convey concepts effectively. Consider the scenario where managing a single VM manually becomes laborious and lacks scalability. To address this challenge, CSPs offer solutions akin to managed instance groups, comprising multiple VMs configured identically. These groups often come with features like startup scripts, which execute upon VM initialization. All the configurations discussed earlier can be scripted into these startup scripts, simplifying the process of VM launch and enhancing scalability. This setup proves beneficial when multiple VMs are required to handle requests efficiently. Now, the question arises: when dealing with multiple VMs, how do we decide where to route requests? The solution is to employ a load balancer provided by the CSP. This load balancer selects one VM from the pool to handle each request. Additionally, we can streamline the process by implementing general load balancing. To remove individual reverse proxies, we can utilize multiple instance groups for every service needed, accompanied by load balancers for each. The general load balancer can expose its IP with TLS configuration and route setup, ensuring that only service containers run on the VM. It's essential to ensure that VM IPs and ports are accessible solely by the load balancer in the ingress path, a task achievable through configurations provided by the CSP. At this juncture, we have a load balancer securely managing requests, directing them to the specific container service within a VM from a pool of VMs. This setup itself contributes to scaling our service. To further enhance scalability and eliminate the need for continuous VM operation, we can opt for an autoscaler policy. This policy dynamically scales the VM group up or down based on parameters such as CPU, memory, or others provided by the CSP. Now, let's delve into the concept of Infrastructure as Code (IaC), which holds significant importance in efficiently managing CSP components that promote scale. Essentially, IaC involves managing CSP infrastructure components through configuration files, interpreted by an IaC tool (like Terraform) to manage CSP infrastructure accordingly. For more detailed information, refer to the wiki. Datastore We've previously discussed scaling our service, but it's crucial to remember that there's typically a requirement to maintain a state somewhere. This is where databases or datastores play a pivotal role. From experience, handling this aspect can be quite tricky, and I would once again advise against developing a custom solution. CSP solutions are ideally suited for this purpose. CSPs generally handle the complexity associated with managing databases, addressing concepts such as master-slave architecture, replica management, synchronous-asynchronous replication, backups/restores, consistency, and other intricate aspects more effectively. Managing a database can be challenging due to concerns about data loss arising from improper configurations. Each CSP offers different database offerings, and it's essential to consider the specific use cases the service deals with to choose the appropriate offering. For instance, one may need to decide between using a relational database offering versus a NoSQL offering. This article does not delve into these differences. The database should be accessible from the VM group and serve as a central datastore for all instances where the state is shared. It's worth noting that the database or datastore should only be accessible within the VPC, and ideally, only from the VM group. This is crucial to prevent exposing the ingress IP for the database, ensuring security and data integrity. Queues In service design, we often encounter scenarios where certain tasks need to be performed asynchronously. This means that upon receiving a request, part of the processing can be deferred to a later time without blocking the response to the client. One common approach is to utilize databases as queues, where requests are ordered by some identifier. Alternatively, CSP services such as Amazon SQS or GCP pub/sub can be employed for this purpose. Messages published to the queue can then be retrieved for processing by a separate service that listens to the queue. However, we won't delve into the specifics here. Monitoring In addition to the VM-level monitoring typically provided by the CSP, there may be a need for more granular insights through service-level monitoring. For instance, one might require latency metrics for database requests, metrics based on queue interactions, or metrics for service CPU and memory utilization. These metrics should be collected and forwarded to a monitoring solution such as Datadog, Prometheus, or others. These solutions are typically backed by a time-series database (TSDB), allowing users to gain insights into the system's state over a specific period of time. This monitoring setup also facilitates debugging certain types of issues and can trigger alerts or alarms if configured to do so. Alternatively, you can set up your own Prometheus deployment, as it is an open-source solution. With the aforementioned concepts, it should be feasible to deploy a scalable service. This level of scalability has proven sufficient for numerous startups that I have provided consultation for. Moving forward, we'll explore the utilization of a "container orchestrator" instead of deploying containers in VMs, as described earlier. In this article, we'll use Kubernetes (k8s) as an example to illustrate this transition. Container Orchestration: Enter Kubernetes (K8s) Having implemented the aforementioned design, we can effectively manage numerous requests to our service. Now, our objective is to achieve decoupling to further enhance scalability. This decoupling is crucial because a bug in any service within a VM could lead to the VM crashing, potentially causing the entire ecosystem to fail. Moreover, decoupled services can be scaled independently. For instance, one service may have sufficient scalability and effectively handle requests, while another may struggle with the load. Consider the example of a shopping website where the catalog may receive significantly more visits than the checkout page. Consequently, the scale of read requests may far exceed that of checkouts. In such cases, deploying multiple service containers into Kubernetes (K8s) as distinct services allows for independent scaling. Before delving into specifics, it's worth noting that CSPs offer Kubernetes as a compute platform option, which is essential for scaling to the next level. Kubernetes (K8s) We won't delve into the intricacies of Kubernetes controllers or other aspects in this article. The information provided here will suffice to deploy a service on Kubernetes. Kubernetes (K8s) serves as an abstraction over a cluster of nodes with storage and compute resources. Depending on where the service is scheduled, the node provides the necessary compute and storage capabilities. Having container images is essential for deploying a service on Kubernetes (K8s). Resources in K8s are represented by creating configurations, which can be in YAML or JSON format, and they define specific K8s objects. These objects belong to a particular "namespace" within the K8s cluster. The basic unit of compute within K8s is a "Pod," which can run one or more containers. Therefore, a config for a pod can be created, and the service can then be deployed onto a namespace using the K8s CLI, kubectl. Once the pod is created, your service is essentially running, and you can monitor its state using kubectl with the namespace as a parameter. To deploy multiple pods, a "deployment" is required. Kubernetes (K8s) offers various resources such as deployments, stateful sets, and daemon sets. The K8s documentation provides sufficient explanations for these abstractions, we won't discuss each of them here. A deployment is essentially a resource designed to deploy multiple pods of a similar kind. This is achieved through the "replicas" option in the configuration, and you can also choose an update strategy according to your requirements. Selecting the appropriate update strategy is crucial to ensure there is no downtime during updates. Therefore, in our scenario, we would utilize a deployment for our service that scales to multiple pods. When employing a Deployment to oversee your application, Pods can be dynamically generated and terminated. Consequently, the count and identities of operational and healthy Pods may vary unpredictably. Kubernetes manages the creation and removal of Pods to sustain the desired state of your cluster, treating Pods as transient resources with no assured reliability or durability. Each Pod is assigned its own IP address, typically managed by network plugins in Kubernetes. As a result, the set of Pods linked with a Deployment can fluctuate over time, presenting a challenge for components within the cluster to consistently locate and communicate with specific Pods. This challenge is mitigated by employing a Service resource. After establishing a service object, the subsequent topic of discussion is Ingress. Ingress is responsible for routing to multiple services within the cluster. It facilitates the exposure of HTTP, HTTPS, or even gRPC routes from outside the cluster to services within it. Traffic routing is managed by rules specified on the Ingress resource, which is supported by a load balancer operating in the background. With all these components deployed, our service has attained a commendable level of scalability. It's worth noting that the concepts discussed prior to entering the Kubernetes realm are mirrored here in a way — we have load balancers, containers, and routes, albeit implemented differently. Additionally, there are other objects such as Horizontal Pod Autoscaler (HPA) for scaling pods based on memory/CPU utilization, and storage constructs like Persistent volumes (PV) or Persistent Volume Claims (PVC), which we won't delve into extensively. Feel free to explore these for a deeper understanding. CI/CD Lastly, I'd like to address an important aspect of enhancing developer efficiency: Continuous Integration/Deployment (CI/CD). Continuous Integration (CI) involves running automated tests (such as unit, end-to-end, or integration tests) on any developer pull request or check-in to the version control system, typically before merging. This helps identify regressions and bugs early in the development process. After merging, CI generates images and other artifacts required for service deployment. Tools like Jenkins (Jenkins X), Tekton, Git actions and others facilitate CI processes. Continuous Deployment (CD) automates the deployment process, staging different environments for deployment, such as development, staging, or production. Usually, the development environment is deployed first, followed by running several end-to-end tests to identify any issues. If everything functions correctly, CD proceeds to deploy to other environments. All the aforementioned tools also support CD functionalities. CI/CD tools significantly improve developer efficiency by reducing manual work. They are essential to ensure developers don't spend hours on manual tasks. Additionally, during manual deployments, it's crucial to ensure no one else is deploying to the same environment simultaneously to avoid conflicts, a concern that can be addressed effectively by our CD framework. There are other aspects like dynamic config management and securely storing secrets/passwords and logging system, though we won't delve into details, I would encourage readers to look into the links provided. Thank you for reading!
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices. Microservices-based applications are distributed in nature, and each service can run on a different machine or in a container. However, splitting business logic into smaller units and deploying them in a distributed manner is just the first step. We then must understand the best way to make them communicate with each other. Microservices Communication Challenges Communication between microservices should be robust and efficient. When several small microservices are interacting to complete a single business scenario, it can be a challenge. Here are some of the main challenges arising from microservice-to-microservice communication. Resiliency There may be multiple instances of microservices, and an instance may fail due to several reasons — for example, it may crash or be overwhelmed with too many requests and thus unable to process requests. There are two design patterns that make communication between microservices more resilient: retry and circuit breakers. Retry In a microservices architecture, transient failures are unavoidable due to communication between multiple services within the application, especially on a cloud platform. These failures could occur due to various scenarios such as a momentary connection loss, response time-out, service unavailability, slow network connections, etc. (Shrivastava, Shrivastav 2022). Normally, these errors resolve by themselves by retrying the request either immediately or after a delay, depending on the type of error that occurred. The retry is carried out for a preconfigured number of times until it times out. However, a point of note is that the logical consistency of the operation must be maintained during the request to obtain repeatable responses and avoid potential side effects outside of our expectations. Circuit Breaker In a microservices architecture, as discussed in the previous section, failures can occur due to several reasons and are typically self-resolving. However, this may not always be the case since a situation of varying severity may arise where the errors take longer than estimated to be resolved or may not be resolved at all. The circuit breaker pattern, as the name implies, causes a break in a function operation when the errors reach a certain threshold. Usually, this break also triggers an alert that can be monitored. As opposed to the retry pattern, a circuit breaker prevents an operation that’s likely to result in failure from being performed. This prevents congestion due to failed requests and the escalation of failures downstream. The operation can be continued with the persisting error enabling the efficient use of computing resources. The error does not stall the completion of other operations that are using the same resource, which is inherently limited (Shrivastava, Shrivastav 2022). Distributed Tracing Modern-day microservices-architecture-based applications are made up of distributed systems that are exceedingly complex to design, and monitoring and debugging them becomes even more complicated. Due to the large number of microservices involved in an application that spans multiple development teams, systems, and infrastructures, even a single request involves a complex network of communication. While this complex distributed system enables a scalable, efficient, and reliable system, it also makes system observability more challenging to achieve, thereby creating issues with troubleshooting. Distributed tracing helps us overcome this observability challenge by using a request-centric view. As a request is processed by the components of a distributed system, distributed tracing captures the detailed execution of the request and its causally related actions across the system's components (Shkuro 2019). Load Balancing Load balancing is the method used to utilize resources optimally and to ensure smooth operational performance. In order to be efficient and scalable, more than one instance of a service is used, and the incoming requests are distributed across these instances for a smooth process flow. In Kubernetes, load balancing algorithms are implemented in a more effective manner using a service mesh, which is based on recorded metrics such as latency. Service meshes mainly manage the traffic between services on the network, ensuring that inter-service communications are safe and reliable by enabling the services to detect and communicate with each other. The use of a service mesh improves observability and aids in monitoring highly distributed systems. Security Each service must be secured individually, and the communication between services must be secure. In addition, there needs to be a centralized way to manage access controls and authentication across all services. One of the most popular ways for securing microservices is to use API gateways, which act as proxies between the clients and the microservices. API gateways can perform authentication and authorization checks, rate limiting, and traffic management. Service Versioning The deployment of a microservice version update often leads to unexpected issues and breaking errors between the new version of the microservice and other microservices in the system, or even external clients using that microservice. While the team deploying the new version attempts to mitigate and reduce these breaks, multiple versions of the same microservice can be run simultaneously, thereby allowing requests to be routed to the appropriate version of the microservice. This is done using API versioning for API contracts. Communication Patterns Communication between microservices can be designed by using two main patterns: synchronous and asynchronous. In Figure 1, we see a basic overview of these communication patterns along with their respective implementation styles and choices. Figure 1. Synchronous and asynchronous communication with common implementation technologies Synchronous Pattern Synchronous communication between microservices is one-to-one communication. The microservice that generates the request is blocked until a response is received from the other service. This is done using HTTP requests or gRPC — a high-performance remote procedure call (RPC) framework. In synchronous communication, the microservices are tightly coupled, which is advantageous for less distributed architectures where communication happens in real time, thereby reducing the complexity of debugging (Newman 2021). Figure 2. Synchronous communication depicting the request-response model The following table shows a comparison between technologies that are commonly used to implement the synchronous communication pattern. Table 1. REST vs. gRPC vs. GraphQL REST gRPC GraphQL Architectural principles Uses a stateless client-server architecture; relies on URIs and HTTP methods for a layered system with a uniform interface Uses the client-server method of remote procedure call; methods are directly called by the client and behave like local methods, although they are on the server side Uses client-driven architecture principles; relies on queries, mutations, and subscriptions via APIs to request, modify, and update data from/on the server HTTP methods POST, GET, PUT, DELETE Custom methods POST Payload data structure to send/receive data JSON- and XML-based payload Protocol Buffers-based serialized payloads JSON-based payloads Request/response caching Natively supported on client and server side Unsupported by default Supported but complex as all requests have a common endpoint Code generation Natively unsupported; requires third-party tools like Swagger Natively supported Natively unsupported; requires third-party tools like GraphQL code generator Asynchronous Pattern In asynchronous communication, as opposed to synchronous, the microservice that initiates the request is not blocked until the response is received. It can proceed with other processes without receiving a response from the microservice it sends the request to. In the case of a more complex distributed microservices architecture, where the services are not tightly coupled, asynchronous message-based communication is more advantageous as it improves scalability and enables continued background operations without affecting critical processes (Newman 2021). Figure 3. Asynchronous communication Event-Driven Communication The event-driven communication pattern leverages events to facilitate communication between microservices. Rather than sending a request, microservices generate events without any knowledge of the other microservices' intents. These events can then be used by other microservices as required. The event-driven pattern is asynchronous communication as the microservices listening to these events have their own processes to execute. The principle behind events is entirely different from the request-response model. The microservice emitting the event leaves the recipient fully responsible for handling the event, while the microservice itself has no idea about the consequences of the generated event. This approach enables loose coupling between microservices (Newman 2021). Figure 4. Producers emit events that some consumers subscribe to Common Data Communication through common data is asynchronous in nature and is achieved by having a microservice store data at a specific location where another microservice can then access that data. The data's location must be persistent storage, such as data lakes or data warehouses. Although common data is frequently used as a method of communication between microservices, it is often not considered a communication protocol because the coupling between microservices is not always observable when it is used. This communication style finds its best use case in situations that involve large volumes of data as a common data location prevents redundancy, makes data processing more efficient, and is easily scalable (Newman 2021). Figure 5. An example of communication through common data Request-Response Communication The request-response communication model is similar to the synchronous communication that was previously discussed — where a microservice provides a request to another microservice and has to await a response. Along with the previously discussed protocols (HTTP, gPRC, etc.), message queues are used as well. Request-response is implemented as one of the following two methods: Blocking synchronous – Microservice A opens a network connection and sends a request to Microservice B along this connection. The established connection stays open while Microservice A waits for Microservice B to respond. Non-blocking asynchronous – Microservice A sends a request to Microservice B, and Microservice Bneeds to know implicitly where to route the response. Also, message queues can be used; they provide an added benefit of buffering multiple requests in the queue to await processing. This method is helpful in situations where the rate of requests received exceeds the rate of handling these requests. Rather than trying to handle more requests than its capacity, the microservice can take its time generating a response before moving on to handle the next request (Newman 2021). Figure 6. An example of request-response non-blocking asynchronous communication Conclusion In recent years, we have observed a paradigm shift from designing large, clunky, monolithic applications that are complex to scale and maintain to using microservices-based architectures that enable the design of distributed applications — ones that can integrate multiple communication patterns and protocols across systems. These complex distributed systems can be developed, deployed, scaled, and maintained independently by different teams with fewer conflicts, resulting in a more robust, reliable, and resilient application. Using the most optimal communication pattern and protocol for the exact operation that a microservice must achieve is a crucial task and has a huge impact on the functionality and performance of an application. The aim is to make the communication between microservices as seamless as possible to establish an efficient system. In-depth knowledge regarding the available communication patterns and protocols is an essential aspect of modern-day cloud-based application design that is not only dynamic but also highly competitive with multiple contenders providing identical applications and services. Speed, scalability, efficiency, security, and other additional features are often crucial in determining the overall quality of an application, and proper microservices communication is the backbone to achieving those capabilities. References: Shrivastava, Saurabh. Shrivastav, Neelanjali. 2022. Solutions Architect's Handbook, 2nd Edition. Packt. Shkuro, Yuri. 2019. Mastering Distributed Tracing. Packt. Newman, Sam. 2021. Building Microservices, 2nd Edition. O'Reilly. This is an excerpt from DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices.Read the Free Report
How To Use LLMs: Summarize Long Documents
May 6, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by CORE
Implementing EKS Multi-Tenancy Using Capsule (Part 4)
May 6, 2024 by
May 6, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by