Defining Architecture Capabilities
Some years ago, I sat in on a conference session hosted by Llewelyn Falco. In one timeboxed exercise, the class collaboratively wrote and iterated over a single line of code. Within about 10 minutes we had come up with several dozen ways to express that single line and, when time ran out, we weren’t slowing down we were speeding up. I left that experience with conviction that there are a near infinite number of ways to write code to implement a set of features. All implementations may provide equivalent functionality, but they don’t necessarily offer the same system capabilities.
In our previous post we discussed how architecture is focused on the capabilities of a system beyond its features and functions. As architects, it’s our job to carefully and mindfully constrain the degrees of freedom on code with the goal of having the desired set of capabilities emerge. But what are the capabilities? In this post we will explore and define various architectural characteristics.
This post is part of a series on Tailor-Made Software Architecture, a set of concepts, tools, models, and practices to improve fit and reduce uncertainty in the field of software architecture. Concepts are introduced sequentially and build upon one another. Think of this series as a serially published leanpub architecture book. If you find these ideas useful and want to dive deeper, join me for Next Level Software Architecture Training in Santa Clara, CA March 4-6th for an immersive, live, interactive, hands-on three-day software architecture masterclass.
In our industry these go by any number of names: non-functional requirements, system quality attributes, architectural capabilities, system capabilities, architectural characteristics, or simply “-illities” (since so many of these have the -illity suffix).
Architectural Capabilities of Key Interest
Although many such attributes exist we are going to focus on a subset. A lot of these capabilities fall into some kind of natural grouping, so we’ll try to compare and contrast capabilities that are conceptually adjacent.
Category: Speed and Scale
In the previous post we talked about the mental shift from thinking about functions to thinking about capabilities. Although generally, a lot of developers don’t think in those terms there is one notable exception: Performance. Performance is typically easy to measure, and given two functionally identical pieces of code, it may seem reasonable to conclude that the more performant option is superior. That may be true, or it may not; as always… “it depends.” Remember that every decision and architect makes involves a trade-off. Mark Richards and Neal Ford stated this as their first “Law of software architecture”
“Everything in software architecture is a tradeoff”
The First Law of Software Architecture - Neal Ford, Mark Richards
Growing from developer to architect means not only thinking more in terms of system capabilities, but also thinking about such capabilities in context. In the context of the business problems being address and the relative importance of any given capability. If performance is important, but evolvability is more important, we must avoid decisions that improve performance but hurt evolvability.
Scalability also falls under this category. Scale is cool and all–and certainly confers bragging rights–but like performance we must be cautious against both pre-mature optimization as well as optimizing in a way that will adversely affect the key system qualities. I’m reminded of the time I met “The Mad Potter” Byron Seeley (note, this like includes a photo gallery and, if you page through it, you will see Byron naked). Byron is an artist who, years ago, set up shop in a ghost town in the middle of nowhere in Wyoming. The ghost town is located on one of the harshest and loneliest highways in the state. When I first met Byron, I asked if he gets a lot of business. His response? “How much is ‘a lot’? I know what ‘enough’ is, and I get enough.” That line of thinking continues to stick with me to this day. When it comes to all the capabilities in this category, our goal shouldn’t be “a lot” but rather it should be “enough.”
Performance is a broad topic and can refer to a variety of things. It can be useful to break these down at times, and likewise sometimes the umbrella of performance is optimal level of abstraction. Since we’re defining these things, I’ll break these down into sub-categories.
Almost any application we design and build today will have some kind of network component. This could be a simple client-server architecture (like an Angular app and a monolithic backend) or it could be a complex distributed system. When we talk about network performance we are primarily measuring the latency of network interactions. There are a number of contributing factors, some within our control and some outside of our control.
User Perceived Performance
“Perception is reality to the one in the experience”
- Toba Beta
While we can often obsess over the raw metrics of performance, sometimes perception of performance is enough. We can sometimes mitigate performance trade-offs with decisions that enhance perceived performance. Caching, asynchronous processing, careful user experience design, and more can dramatically improve the perception of performance to the end user.
Although a number of factors affecting network performance may be outside of our control, it is sometimes useful to think with specificity about network efficiency. As always, there are many dimensions to this characteristic. Often the default thinking is something like “Oh, use protobuf instead of json, it’s a more efficient serialization” or “Adopt the Backend-For-Frontends (BFF) pattern as it allows for smaller, fine-tuned payloads.” Those are both true statements and, although both options are superficially functionally similar, each can have significant architectural ramifications and trade-offs that can only be seen when we think about the distinct capabilities of each. There are other options available to us as well (e.g. caching), remember the most efficient network request is the one that doesn’t have to happen at all.
At a high level, compute efficiency refers to the amount of CPU time or energy needed to produce a given result (and many optimizations may be either-or when it comes to CPU time efficiency vs power efficiency). Like those characteristics already mentioned, there are potentially a lot of dimensions to this. Sure, we can optimize code or take advantage of various available compiler optimizations. It’s also common to run certain workloads on GPU hardware which is optimized for certain types of compute. But what else might a well-rounded architect consider? Some technologies to explore and be aware of:
- GraalVM a Java Virtual Machine that supports ahead-of-time compilation of JVM applications and has a goal of improving performance of JVM-based languages to match the performance of native languages.
- LLVM is a compiler toolchain that supports a number of different optimizations (cpu time, energy, etc). The architecture of LLVM is very pluggable, meaning it is very easy to leverage the toolchain for any number of languages, and likewise it can target a wide and diverse set of processor architectures. LLVM is the linchpin technology that allowed apple to design their new System-on-Chip (SoC) processor (Apple Silicon) that had a highly specialized design to lower power consumption and reduce wall-clock time of many common operations. With LLVM, leveraging the benefits of this chip was as simple as recompiling existing code.
- WebAssembly is a highly portable compilation target that delivers on the “write-once-run-anywhere” promise (with 25 years of experience and hindsight). This is hugely significant for a number of reasons, but in this context Compute Efficiency may be achieved by making decisions that enable the system to be flexible where the compute happens (i.e. on the server, on the client, on the edge, on highly specialized hardware, etc.) WebAssembly enables existing code to be compiled as is to target WebAssembly, and that single output can run efficiently in any of those areas (WASM isn’t just about the browser, there are already a number of non-browser, general purpose runtimes). LLVM-powered JIT implementations also mean WASM binaries can run efficiently on a wide variety of hardware architectures - including very specialized hardware.
- ASICs Application-specific integrated circuits are special purpose, high performance hardware that do one thing extremely efficiently. Historically the lift to write code to leverage such hardware has been significant but, again, technologies like LLVM make that very accessible to almost every developer.
Scalability is an obvious–and often top-of-mind–capability. Scalability defines how easily the system can grow to consume more resources as needed. Once again, there are many dimensions. How we scale for total number of users might be different from concurrent users or even simply scaling for data/storage. Consequently conversations around scalability should revolve around the idea of “what is enough” and which specific resources must be scalable. I find that ‘scalability’ is becoming overloaded to the point of being a buzzword (and thus, increasingly meaningless) so, as with any conversation around capabilities - we must must qualify, quantify, and verify this capability in context and through conversations with the business.
Unfortunately, many default to the microservices architecture pattern as a knee-jerk reaction to scalability needs. An oft-repeated–yet fallacious–statement in some architecture circles and “conventional wisdom” is “microservices are scalable, monoliths are not.” Scaling anything involves some complexity. Although microservices allow an almost surgical-level of precision about what areas to scale, Facebook (meta?) has been very successful in scaling their PHP monolith to many billions of users.
There are many routes to scalability, each with notable trade-offs.
Closely related to scalability is the concept of elasticity. Sometimes we hear the terms used interchangeably. If we follow the elastic metaphor, the elastic waistband on my shorts can grow as needed (after, say, a particularly large meal) but it also has the capability of returning to a smaller size with such growth is no longer needed. Certain decisions can afford some degree of elasticity to almost any architectural pattern, the question always comes back to “what is enough” as well as how well with any given architecture fit the organization and the problem as a whole.
If scalability is approaching buzzword-level, “agility” has passed that point with such velocity that a sonic boom follows in its wake. Everyone wants to be “agile” (or, at least, everyone wants to be able to say they are agile… there’s a difference).
So what is “agility?” At its core, it refers to the ability of a system, process, or organization to quickly respond to change in an efficient and effective manner. Several capabilities help get us there:
Evolvability is the ability for a system to gracefully absorb and adopt both business and technical change. This is an attribute that rarely materializes by accident. Building most software systems is like working with concrete - easy to pour, mold, and shape in the beginning, but once it hardens changes require a jackhammer and can be very disruptive. It’s no wonder that often we would rather just rebuild a system from scratch rather than try to make significant changes (although this almost never makes economic sense).
Numerous decisions from the micro-architecture of code to the macro-architecture of a system can impact Evolvability. A great case-study on evolvability for architects is the World Wide Web. Some brilliant decisions were made that enabled the web to grow and change radically from what was originally envisioned without ever stopping for a rewrite. I don’t know of a single system that can compare with the web in terms of its evolvability. There are many architectural ideas and lessons that can be taken away from this example. Evolvability is therefore another area where architects often need a certain amount of vision to anticipate the potential rate of change of a system and make sure decisions accommodate this.
Although these two concepts don’t entirely overlap, they can be closely adjacent. Effectively this describes how easy it is to use a system in unanticipated ways, or how easy it is to extend the functionality of the system without breaking/disrupting what is already there. How we think about modularity, interfaces and abstraction at the system level are usually an important starting point.
A specific example of extensibility is that of composability or what snarky former tech-blogger Ted Dziuba calls Taco Bell Programming. I should point out that almost everything Ted writes is probably NSFW, containing strong language and spicy takes.
“Every item on the menu at Taco Bell is just a different configuration of roughly eight ingredients. With this simple periodic table of meat and produce, the company pulled down $1.9 billion last year… The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set.”
A centerpiece of the Unix Philosophy is a wide-array of small, self-contained, single-purpose tools that can be composed in any number of combinations and configurations to solve a wide variety of problems. The Unix toolset achieves this with a high degree of modularity and a uniform interface defined by the POSIX standard.
The web adapted this idea for large, distributed systems, using the uninform interface constraint of the REST architectural style. The Resource Abstraction creates a flexible, stable, mechanism to build highly composable systems. Peter Rogers at 1060 Research has generalized these ideas and built NetKernel a fabric for resource-oriented computing and decoupled, composable components that scales linearly and allows real-time definition and reshaping of system architectures. In my estimation, this work is Turing-award level genius.
Also notably, is efforts to create a composable data fabric using ideas from the architecture of the web and linked data. The core ideas and motivation are detailed in the data-centric manifesto. As always, there are significant trade-offs to all of these ideas, but if agility (or any subcomponent) is highly aligned with the business drivers, these are ideas worth exploring and adding to your technical breadth.
Change does not exist in a vacuum and change necessarily involves risk. If we want to be agile–if we operate in a problem space where the risk of stagnation is greater than the risk of change–we must be able to make changes confidently. How easily can we test and validate changes before they are released? Generally more modular architectures with clear boundaries and contracts for interactions produce more testable systems. The surface area of the risk becomes smaller as does the blast radius of a problem.
Another component of agility is deployability - how easily and confidently we can release changes. Generally the more granular the architecture pattern, and implementations with well-defined interfaces, the easier it is to deploy changes. Patterns like microservices are considered to be more deployable due to the highly granular nature. Notably there are other implied decisions that are part of that pattern that are enablers for deployability.
Category: Integration & Interoperability
There are several distributed architecture patterns, and they are generally fairly popular (even when the promised benefits rarely materialize - there are reasons for this that we will discuss in a later post). While most literature on distributed architectures focuses on taking systems apart, remember that we eventually have to put Humpty Dumpty together again. Moreover, when building new systems, often they need to interact and interoperate with legacy and third-party systems. Finally, enterprises are not static entities. Mergers and acquisitions are often almost inevitable. Depending on the organization, it may be necessary to design systems that will be able to integrate with yet unknown systems in the future. There are a few capabilities that support this category of business problem.
Just as a symphony is incomplete without the synchronized harmony of various instruments, our modern digital solutions are built upon an interconnected ensemble of systems. Integration as a capability is determined by measuring the capability of merging distinct systems or components, allowing them to function as one. It’s not just about connecting A to B; it’s about ensuring that A and B communicate effectively, efficiently, and seamlessly. As an analogy, consider the trains and rail networks. Separate rail companies may operate independently, but at junctions, they rely on meticulously designed intersections to allow trains to switch tracks, combine routes, or operate side by side. Similarly, in the software realm, integration can be a daunting task due to diverse technologies, but it can be valuable to foster a unified (or flexible) solution. Architecture decisions around tools, adherence to standards, and different approaches to APIs or messaging services act can affect the amount of friction systems experience when cross-communicating. As always, there are many options, and each bring their own trade-offs. This can be an important area for enterprise architects to continue to build breadth.
Interoperability goes deeper than just connectivity between systems–it’s the ability of the systems to exchange, interpret, and cooperatively use information. To continue to abuse the metaphor, if integration is about joining two ends of a bridge, interoperability is ensuring those ends are built with the same blueprint, materials, rules, and conventions. When designing a distributed, domain-partitioned system we quickly realize that, even in the same organization, different business units define terms and concepts in very different ways. One of the first and most important steps is to do the work to define each business domain’s ubiquitous language (to use the DDD parlance). From there we have options.
We can try to build consensus between upstream and downstream components using the conformist pattern, we can “agree to disagree” and build an anti-corruption layer to translate terms across domains, we can define a shared kernel, or we can think about the problem differently and exchange information (data with context) rather than mere data (e.g. JSON serialized decontextualized name/value pairs). Linked data, which was mentioned earlier, is one of the best options to achieve this. Linked data embraces the non-unique naming assumption and resolves the inherent issues that has historically introduced. None of these are easy, and involve trade-offs.
Category: Feasibility & Manageability
Just like there are a lot of ways to write code to deliver a set of features, there are a lot of ways to design a system architecture. The work to think about system architecture is interesting and challenging, but we can’t escape two realities:
- At some point, we have to build and release the software
- Once it’s out there, we have to be able to understand and keep it running
With that in mind, we’ll talk about the last category of characteristics.
Visibility & Observability
As systems grow, complexity can be a big problem. With a distributed system, we can’t just set a breakpoint or step through the code anymore. While closely related, visibility and observability differ in depth, scope, and application. Visibility refers to the ability to “see” into a system. It could be as simple as having an up-to-date map of microservices, or what components are online/offline, heathy or not. Visibility focuses on what’s happening, but may not explain the “why” of the current state. While visibility provides crucial insights, it may not offer the depth required to understand complex issues, especially in distributed systems where problems might arise due to intricate interactions. An architect might select tools or standards around how the components produce metrics and logs, or standardize how health checks are performed.
Observability is a measure of how well you can understand the system’s internal state based on its external outputs. It not only lets you see what’s happening, but understand why it is happening. If this is an important concern, an architect might prescribe tools like distributed tracing tools, or logging solutions that can correlate data from various sources to provide a more complete picture. Observability excels in environments where the system is too complex to predict all potential problems beforehand. Instead of trying to foresee every possible issue, teams build systems that can be interrogated for insights when unexpected situations arise.
At some point, things will fail. As my friend and fellow speaker Matt Stine once said “failure is the only option.” Sometimes the systems we build can withstand the occasional service disruptions, other times it can literally be the difference between life and death. Most of the systems I have worked on (thankfully) operate somewhere in the middle. We want to avoid small failures cascading into larger failures and thinking about the characteristic of fault tolerance can be helpful. Architecture decisions, choice of pattern, how we implement inter-component communication, and how we coordinate/manage distributed transactions all impact the system’s and component’s level of fault-tolerance.
Availability is a broader look at the concept of fault-tolerance and generally measured in uptime as a percentage (e.g. 99.99% uptime). Generally, in discussion with the business, architects will determine how much downtime is acceptable for a system or component and make decisions to maintain that service level agreement (SLA). There are a lot of paths to ensure a minimum level of uptime.
Cost generally refers total cost of ownership (TCO) of the system. What will it cost to build the proposed system, what will it cost to run the system. Given a fixed amount of money, an architect will inevitably need to make trade-off decisions to keep a lid on cost. This is a constraining reality of every project.
Finally, there is a point where the solution space of a project becomes too hard. Microservices are quite possibly one of the most difficult architecture patterns to pull off. Breaking apart teams is HARD, breaking apart data is HARD, DDD is HARD, reorgs are HARD, building the infrastructure and tooling to support development, deployment, and management of microservices is HARD. I believe one of the reasons most microservice implementations fail is organizations massively underestimate the difficulty of this pattern. We have to look at architecture holistically and determine what level of complexity and disruption the organization can withstand. This architectural intuition tends to develop with time and experience. Getting the simplicity wrong can be make or break to a project. It is often advisable to err slightly on the side of simplicity, it is much easier to add complexity than remove it.
This is just scratching the surface of these architectural capabilities and just a handful of things to consider around some of these. For reference, here is a summary of the capabilities/characterizes and their definitions:
|Broad measure of system’s efficiency and speed. Can be broken into various dimensions.
|Measures latency of network interactions and factors influencing it.
|User Perceived Performance
|The end user’s perception of system performance which might not align with actual, objective metrics.
|The efficient utilization of network resources, optimizing for factors like payload size and serialization.
|Amount of CPU time or energy needed to produce a result, optimized for specific hardware or platforms.
|How easily the system can grow and handle increasing workloads or demands.
|System’s ability to adapt to workload changes by provisioning and de-provisioning resources automatically.
|Ability for a system to gracefully absorb and adopt both business and technical change.
|Ease of extending the system’s functionality or using it in unanticipated ways.
|System’s ability to combine and configure components in various ways to create different functionalities.
|Ease with which a system can be tested to validate changes and ensure they meet expected outcomes.
|Ease and confidence with which system changes can be released and deployed.
|Merging distinct systems or components, allowing them to function as one.
|Systems’ ability to exchange, interpret, and cooperatively use information across different environments.
|Visibility & Observability
|System’s capability to provide insights into its operations, both superficially and in-depth.
|System’s ability to continue functioning correctly in the event of failures of one or more of its components.
|System’s operational performance and uptime, usually measured as a percentage.
|Total cost of ownership, both in terms of building and running the system.
|Ease of understanding, developing, and maintaining the system, keeping complexity at a manageable level.