A non-artistic visualization of databases, pipelines, etc. Caption reads Navigating the Challenges of CI/CD for Relational Databases

Everyone seems to want agility, and the DevOps movement has provided tools and practices to enable this. While the tools available today are wonderful, there remain unique challenges when it comes to delivering code changes that also include database changes.

I’ve been building database pipelines on-and-off for over 20 years now. The tools have may have changed but the core principles have not. It would be trivial to write a talk, essay or blog post that was just a tour of syntax and an entertaining summarization of the documentation. You don’t need me to read the docs only to regurgitate them for you into prose or PowerPoint. Instead, I’m going to take a holistic approach that encompasses mindset, skillset, and toolset. What I hope will make this work stand apart from similar content out there is I will be writing from the perspective of a (recovering) DBA and share vital insights that are more relevant than ever.

The Reluctant DBA

I never wanted to become a DBA, that was never a career goal for me. In fact, in high school I took a career aptitude test and one of the top 5 options it spat out was “Database Administrator.” I immediately crossed that off my list because it sounded terrible yet, a mere six years later my career my official title was Database Administrator and that was my job. Although I am not a DBA anymore, the skills I developed have been my ace-in-the-hole for my entire 25 year career and that will continue for the foreseeable future. What I will teach you in this post will be your ace-in-the-hole as you drive more agility in your teams and organizations.

My first job involved building a data-intensive client-server application. Out of convenience we placed a lot of the business logic in the database layer. There are, of course, notable trade-offs to this approach but in this context it was generally a worthwhile trade-off. The two driving factors were as follows:

First, it was more efficient at that scale. Pushing computation closer to the data induced network efficiency; virtually every domain workflow involved making an atomic change to database state that involved inserting, updating, and/or deleting multiple records. It just didn’t make sense to pull all that data to the client, perform a bunch of manipulations to that data, then push it back when a single call to a stored procedure would accomplish the same thing without latency and consistency issues.

Second, it eliminated a lot of day-to-day deployment complexity. With a fat-client installed on dozens or hundreds of computers it was generally easier to patch a stored procedure once than update all of those clients. Client deployment didn’t go away, but they became relatively rare.

The consequence of these facts is I became reasonably skilled with SQL. As a language, SQL is relatively simple; the syntax can be learned in a matter of hours but to be truly effective, the paradigms must be understood as well.

Because I divided my time almost equally between the code and the database, management and administration just became part of my job. It was a time when the stack was simple enough that one could style oneself as a “full stack developer” with sincerity.

One or two jobs later, I found myself in an organization where my initial model no longer fit. We were still building fat-client applications but the original developers and current maintainers were much more comfortable in the code space; the database side was simply a necessary evil. In that organization, I was not only most proficient with the DBMS, but I was also perfectly happy to work in that space. I became the de-facto subject matter expert (SME).

A few years later I found myself emigrating back to the United States at the worst possible time. There was a tectonic shift in the development landscape and the coding skills in the language I had immersed myself in for the past six years had become obsolete. I had a few choices:

  1. Lie on my resume and misrepresent my fluency with the new language
  2. Join a company with a large legacy codebase to replace the developers who fled to pursue more interesting and modern pastures.
  3. Swallow my pride and press the reset button on my career, accepting a junior developer position.

I didn’t like any of the options. I realized that the only current skill I had was my SQL skills. “It’s too bad I can’t get a job as a database developer.” I thought to myself. It turns out there was a great need for this as many organizations had developers who were great with code, but lacked deep SQL skills to round out the necessary skillset to succeed.

The environment I found myself in was radically different. These databases were orders of magnitude larger than those I had previously worked with and operated at a scale and throughput that we would describe today as “web-scale.” Every millisecond mattered and uptime was everything. At one time that organization was an outlier, but this is increasingly becoming the norm in our industry. That DBA job 20 years ago taught me that, to succeed in this space, it was necessary to not only understand the database language, syntax, and paradigms; but also possess sufficient understanding of the internals of the database. It was there that I learned how complex database changes could be–and how to manage that.

This is what I aim to teach you in this two-part series.

DBOps and the Impedance Mismatch

Database and application development have always been at odds, particularly at the intersection of OOP and relational paradigms. The meteoric rise in popularity of ORM tools has succeeded in hiding this problem for a time, but the rise of DevOps and the increasing need for business and product agility is changing the equation. We can no longer paper over the fundamental difference between these two worlds. To effectively practice DevOps, developers must have sufficient working knowledge of both development concerns and operational concerns.

ORMs tell us a comforting lie–that persistence is just an implementation detail that can be safely abstracted away. This abstraction is fine until the fundamental differences between these two worlds rear their ugly heads and bite us in the ass in the form of performance bottlenecks, data consistency issues, or deployment friction. ORMs have been so convincing, however, that there is a tantalizing allure of the fantasy that we are just one more tool or abstraction away from eliminating database change deployment friction.

Difference #1 - Code is stateless, data is stateful

Code is inherently stateless. Containers may be easily started, stopped, and replaced with few side effects. State is generally held in the database. Many database changes, however, are stateful and database changes are often a necessary component of code changes. Despite all the advancements of developer tooling over the two decades since I wrote my first database migration pipelines, the real challenges haven’t gotten much easier. The tools that exist today wish to tell us another comforting lie - that database changes can be managed as easily as code changes but, without the addition of a skillful hand, these tools are simply a loaded gun that will shoot you in the foot when you least expect it. The consequences range from the minor to major–with the major consequences being significant downtime or data loss.

Data > Code

In my career, I’ve done some dumb things. So many, in fact, that it’s difficult to know where to begin. Some 20 years ago (almost to the day) I started a new job at a company and on my first day I was given a tour of the server room. Somehow we ended up behind the racks, and I noticed these structures surrounding the power cable on each server. I (incorrectly) deduced that these things prevented the power cable from being unplugged. “Oh, this is cool” I said, as I clutched the body of the power cable “they must prevent the cord from getting yanked out.” Before my new manager could say anything or stop me, I tested my (faulty) hypothesis by pulling the power cable right out of the server. The server immediately lost power (I guess I was the very first chaos monkey). It was an app server, and for the next several minutes, the apps it hosted were unavailable. It was an inconvenience for the 300 or so employees that were relying on that app’s availability, but within a few minutes everyone continued as though nothing had happened. My boss laughed and it never came up again (except, of course, for the remainder of my tenure where I was regularly reminded of my stupidity “Hey, remember when you unplugged a server on your first day?” Good times.) In short, of all the dumb things I’ve done in my career, this ranks somewhere near the middle.

Arguably, the worst things I’ve even done have resulted in data loss… sometimes crippling data loss. A decade later I was working for a company with a fairly mature CI/CD pipeline for code - but database changes were still run manually. I received an email from my boss’ boss asking me to clean up some spurious data for a particular client. I misread the email and started to delete data from the wrong table; the “master” table. There were cascading deletes in place to maintain referential integrity (my boss liked his referential integrity like he likes his restraining orders… enforced) so each row deleted resulted in hundreds of other rows being deleted across dozens of tables. This delete was going to take all night. My SQL-fu was strong, of course, so I whipped up a quick administrative script that would delete all the related records in batches, then ultimately delete the core records. Now my all-night delete would be completed in about an hour or so; I let it loop and churn. About an hour later I get an email asking if I had completed that delete yet and I replied “It’s still in progress, I expect it will be done in about 20 minutes.” Seconds later I looked up and boss’ boss was standing next to me asking “What the fuck are you doing?!” I was quite proud of my delete script, so I showed him. I ended up deleting almost all of the company’s #1 client’s data. Although we had backups, a great deal of data was gone for good–there was no bringing it back.

I’ve deployed defects to production, I’ve brought down production apps, but it was never more than an inconvenience; data loss, on the other hand, is almost always a disaster.

Difference # 2 - The customer’s data is more valuable than your code.

Embracing the DevOps Mindset

I find DevOps fascinating. Rarely in our industry do we fundamentally rethink how we do things and make difficult changes for the better; generally we just throw more code at the problem. Done right, the practice of DevOps is transformative, but most organizations do just enough to check the box just under “agile” and continue business as usual. You know what I mean by that, right?

For a lot of organizations, some CIO somewhere saw the cover of a magazine that asked is large, intimidating letters “Are you agile?” and this CIO gulped and said “OMG - NO!” The next thing you know, everyone is sent to the two-day training but the only thing that really changes is you get your orders standing up now instead of sitting down. The “Agile Transformation” looks something like this:

  1. “So we’re doing sprints now instead of milestones… But we need to spend some time building a backlog. Why don’t we start with a ‘Sprint 0’ to build that backlog.”
  2. “Ok, we’re starting sprint 1 and actually building this thing.”
  3. “It’s the end of sprint one, I’ve finished all my stories, but there’s no time for testing left in the sprint… Do they roll over? What do I do while I’m waiting for QA? I guess I’ll just start the next stories.”
  4. “Ok, I’m getting defects from last sprint… Do I fix those or continue on the stories I’ve committed to? I’ll schedule those in the next sprint I guess”
  5. “New Plan, we’ll have a stabilization sprint at the end to address all the defects that come in.”
  6. “We finally cut a release - Hand it over to ops, while we have a party!”

Renaming milestone to “sprints” and the Requirements and Analysis phase to “Sprint 0” and the QA phase to “stabilization sprint” does not, an agile team, make. Nothing has changed except the company paid six figures for a worthless certificate.

So now the devs are at the release party and a call comes in after they are all 3-5 beers in. Production is down. The devs are flabbergasted, but shrug and say “it worked on my machine - must be a problem on your side.”

Worked Fine In Dev Ops Problem Now Meme

There’s nothing agile about any of that. There are developers in industry who weren’t alive when the agile manifesto was signed and, for many, all they’ve ever known is the worthless bureaucracy marketed as agile. Agile is not SAFe, agile is not Scrum, or any of the other frameworks, processes, or anything else the slick salespeople are pedaling. Agile is the 12 principles, agile is fast feedback. If a framework helps follow these ideals, great! But a framework is not a replacement for the core principles of the agile movement.

“Don’t ‘do’ agile, be agile.”

Agile had it’s genesis in the late 1990s when folks like Kent Beck began building the first unit testing tools. Getting feedback on the quality and correctness of code in seconds from automated tests instead of days and weeks from QA was game-changing. The feedback didn’t stop there, this practice also gave fast-feedback on his thought processes and mental models. Beck once said “It makes problem solving so easy, it felt like I was cheating somehow.”

The ideas grew, blossomed, and intermingled with other forward thinking folks. TDD might give us fast feedback on the code and thought process but we couldn’t get fast feedback on the actual value of software unless we put it in front of a customer.

“Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.”
-Agile Principle #1

The earlier we get feedback, the faster we can course correct, the faster we can deliver valuable software. The side effect of this is as things change, we can respond to market uncertainty and market changes very quickly. TDD was a step in the right direction, but there were many bottlenecks.

Bit by bit we have chipped away at those bottlenecks. Over the next decade or so, we continuously looked for ways to get feedback faster. Automated unit tests tell me I changed behavior as soon as I make the change, static analysis spots errors, broken windows, and vulnerabilities in near real-time, Integration tests give me feedback on breaking or broken dependencies, continuous integration give me daily feedback on clashes and issues with other developer’s work. Build pipeline automation enables this feedback at the team or repository level which, in turn, begins to enable Continuous Delivery… In theory.

The ongoing work is to evaluate every error and failure, posing the question “Is there is any way this failure could have been caught sooner.” Could a prod defect been caught with an integration test? Could an integration test failure been caught with a unit test? If the development process is plotted on a horizontal axis of time, where development time is left, and production is right; how can we keep shifting that feedback further to the left.

Enter DevOps

DevOps has been a natural extension of the shift-left process. How do we get earlier feedback on deployment and environmental issues without actually taking down production in the process?

Docker — “But it works on my machine…” Then we’ll ship your machine ...

We had virtualization, and later containers. Self-contained artifacts that behave identically everywhere. To shift the feedback left, the role of operations had to shift left - hence DevOps. Developers became responsible for defining that runtime environment. We got increasingly better tools and built them into our pipeline. If something was going to fail due to environmental issues, we’d get that feedback immediately during development. We could deploy with much more confidence. Moreover, ops was no longer the bottleneck. Dev teams could deploy at any time without manually coordinating with ops, and quickly get the most valuable feedback of all, the customer feedback.

We are now very very good at deploying code. In my pipelines, I can deploy a code change within a matter of minutes. I have a deployment strategy in place that will automatically rollback a faulty or failed deployment. I can deploy code changes confidently with zero downtime. Stateful database changes remain a large problem. There may be a rich tapestry of tooling for DevOps, but DBOps remains a major challenge to this day and one, for which, we cannot simply throw more code at the problem; code may be part of the solution, but it is far from a complete solution.

If you intend to actually practice DevOps, you must know about the Ops part to be effective.

Captain Phillips Meme captioned "Look at me, I'm the DBA now"

Difference #3 - YOU are now the “reluctant DBA.”

To Be Continued…

Coming up next, we will look at the “how” of DBOps. Automation is a key, but the tools alone won’t save you. To fully adopt this practice, a baseline of knowledge is necessary. Part two will cover “just enough” database internals, popular DBOps tools and strategies, and introduce techniques to manage and avoid disruptive changes.

Stay tuned, and thanks for reading!