An UberService to test them all

One of Elon’s tweets brought back an interesting case I handled a few years back where we squeezed about 15 services into 1.

The issue

Part of building a healthy codebase from day one, is making sure that you have good test coverage. In fact, you might become a victim of your own success, which is what happened to us: Early on, we built such a friendly and streamlined integration testing framework, that our engineers used it extensively, way more than we should have. We wrote tests at the scope of either unit tests or ITs. This had two effects on the system:

  1. Quality was excellent! – Practically anything deployed to production had full test coverage and very few bugs.
  2. Productivity was coming to a halt – Our friendly IT testing framework was causing everyone to write tests that took too long to run, required most services to be up and running, and caused your laptop to glow in the dark.

Keep in mind that we were an early-stage startup; we didn’t have a dedicated Devops team to handle Devex. At this point in time, developers were still responsible for doing Devops work.

To solve this issue, we quickly realized that we had to cut the habit of constantly adding new ITs, and move to component testing – tests that only required a single service to be up for the test to execute.

We needed to migrate to component tests instead of our IT testing framework. But in the meantime, dev machines where still wheezing from running all those services locally, and we couldn’t delay new services from coming online, just because we hadn’t yet made the move to component tests.

Let’s try something

We needed to buy some time. Was it possible to reduce the footprint required for running ITs on dev machines?

Each JVM-based service required a few hundred megabytes of memory. Multiply that by 15 and you see the issue. We quickly ran out of memory on our machines.

What if we could “squish” the services into a single process? This is the experiment I tried out.

Technical background

Let’s get a few technical details out of the way first:

Our services were based on Dropwizard, which is a group of technologies that work well together. As far as the web server, it’s an embedded Jetty server. Each service is an instance of class Service which inherits io.dropwizard.Application and has a main() function that calls a run() command.

The solution

What if we could bundle all of these run() commands into a single main() function? Would they run well together?

Turns out, that multiple Jetty servers can work well within a single JVM process! All we had to do was a bit of deduplication of ports and resources:

  • We had to make sure that each service listened to a different port
  • Resources, such as upgrade scripts that sat in the same subdirectory in all services had to move to a uniquely named subdirectory

The code ended up looking something like this:

class UberService {
  private val allServices by lazy {
    mapOf(
      "foo" to FooService(),
      "bar" to BarService(),
      ...
    )
  }

  fun startAllServices(servicesToStart: List<String>) {
    // Loop on the services that we want to start and launch each of them in a separate thread
  }
}

fun main(args: Array<String>) {
  // Get list of services that we want to start from args
  include = .... args .....
  UberServer().startAllServices(include)
}

Nothing too complex going on here! Since many of the services rarely needed to work in scale on a developer machine, this single process worked well with a memory footprint of much less than 1GB.

This idea ended up buying us about 3 years of breathing room. While these days we have other, better designed solutions for testing our code such as private Kubernetes namespaces for developers, you can still run UberService on your local machine with an input list of dozens of services and it will work nicely.

So, getting back to Elon’s tweet, maybe there is a way to tell your boss that you eliminated 80% of your microservices 🙂

Keeping dependencies up to date in Gradle toml files

TL;DR – This post shows you how to create a script that checks for available upgrades to third party dependencies in a Gradle libs.versions.toml file

Managing dependencies – What’s missing?

Keeping dependencies up to date is part of ongoing technical debt. It’s not the most glamorous task, but neglect it and you will find yourself in trouble sooner or later.

That’s why it’s important to make it as easy for the organization to “do the right thing” and keep those dependencies up to date with the least friction possible. A great tool for this is Dependabot, which automatically creates PRs with suggested upgrades. We use it a lot. Unfortunately, Dependabot has a limitation – currently, it does not support upgrading *.toml files. We won’t know that it’s time to upgrade libraries listed in the file!

What’s a toml file?

toml is a file format for storing configurations. It’s equivalent to a json, yaml or ini file, and is easily and clearly parseable. A full description of it’s capabilities can be found in the toml project page.

Toml & Gradle

In Gradle, we can define a libs.versions.toml file and use it as a version catalog that is shared between projects. This is great for aligning the versions of third party dependencies and making sure all projects work with the same third party versions. This prevents all sort of nasty bugs that we would get when depending on different versions of the same dependency that we really don’t want to run into.

Let’s take a look at a sample libs.versions.toml file:

[versions]
aws = "1.12.311"
commonsIo = "2.11.0"
...

[libraries]
awsSdkCore = { module = "com.amazonaws:aws-java-sdk-core", version.ref = "aws" }
awsSdkS3 = { module = "com.amazonaws:aws-java-sdk-s3", version.ref = "aws" }
awsSdkSqs = { module = "com.amazonaws:aws-java-sdk-sqs", version.ref = "aws" }
awsSdkSns = { module = "com.amazonaws:aws-java-sdk-sns", version.ref = "aws" }
commonsIo = { module = "commons-io:commons-io", version.ref = "commonsIo" }
...


As you can see, it’s fairly straightforward and very readable: The versions section defines the version number, while the libraries section defines all our dependencies and uses the version.ref property to refer to the version, enabling us to define the same version for multiple dependencies that belong in the same “family”.

We now need to add a reference to the libs.versions.toml file in our Gradle build. We do that in the settings.gradle.kts file:

dependencyResolutionManagement {
  versionCatalogs {
    create("externalLibs") {
      from(files("gradle/libs.versions.toml"))
    }
  }
}

Now, we can reference the versions using the externalLibs prefix in our build.gradle.kts file:

dependencies {
    implementation(externalLibs.awsSdkCore)
}

And that’s it!

Creating a script

OK, so back to our problem: We can’t use Dependabot to generate PRs for updating the libs.versions.toml file. So what can we do that’s the best next thing?

Reviewing the file for available updates is a grueling task. The project that I work on has over 100 different versions in the toml file. Reviewing them one by one and checking each one for the latest version in https://mvnrepository.com/ is not scalable. Our dependencies will probably fall into neglect if we use this approach.

If we had a script that parses the toml file and compares the current version to the latest available version, we’d have a tool that shows us the latest version that’s available. Moreover, if we commit this file, we would be able to see what’s changed since the last commit, next time we run it.

After searching for such a script and and not finding one, I decided to write one myself.

What are the steps that we need the script to perform?

  1. Extract a list of all the third parties.
  2. Look up the current version for each of these third parties.
  3. Look up the latest release for each of them.
  4. Compare the current version to the latest version and output the diff to a file.

With a few exceptions, all of the third parties are hosted in https://repo1.maven.org/maven2/ where each one has a maven-metadata.xml which contains a tag called <release> which holds the latest GA version. That’s our data source for the latest version.

Initially I started out trying to implement this script using bash. After re-learning bash for the n-th time (always fun…), this was the result (feel free to skip to details here):

grep -E '^.* ?= ?"[^"]*"' libs.versions.toml | \
grep -v module | \
sed -E 's/([^= ]*) ?= ?"([^"]*)".*/\1 \2/g' | \
awk 'FNR == NR { a[$1] = $2 ; next } { print $1 " " a[$2] }' - <( \
  grep module libs.versions.toml | \
  grep version | \
  sed -E 's/^.*{ ?module ?= ?"([^"]*)", ?version\.ref ?= ?"([^"]*)".*/\1 \2/g' \
) | \
sort | \
awk 'FNR == NR { a[$1] = $2 ; next } $2 != a[$1] { print $1 ": " a[$1] " -> " $2 }' - <( \
  grep module libs.versions.toml | \
  grep version | \
  sed -E 's/.*module ?= ?"([^"]*)".*/\1/g' | \
  tr .: '/' | \
  sed 's/^/https\:\/\/repo\.maven\.apache\.org\/maven2\//g' | \
  sed 's/$/\/maven-metadata.xml/g' | \
  xargs curl -s | \
  egrep '(groupId|artifactId|release)' | \
  xargs -n3 | \
  sed 's/\<groupId\>\(.*\)\<\/groupId\> \<artifactId\>\(.*\)\<\/artifactId\> \<release\>\(.*\)\<\/release\>/\1\:\2\ \3/g' | \
  sort \
) > latest_available_upgrades.txt

This script works, but… As one of my reviewers pointed out in the code review (Thanks Shay!) this is hardly maintainable. Good luck debugging this or just understanding what is does.

Now that the proof of concept was, well… proven, I rewrote this in friendlier language: Kotlin script. For those of you not familiar with Kotlin script, it’s just plain Kotlin with some added annotations to help compile the script. You can run it from the command line without any prior compilation.

Here’s the same code, but in readable Kotlin script:

#!/usr/bin/env kotlin

@file:Repository("https://repo1.maven.org/maven2/")
@file:DependsOn("org.tomlj:tomlj:1.1.0")
@file:DependsOn("org.apache.httpcomponents.core5:httpcore5:5.1.4")

import org.apache.hc.core5.http.HttpHost
import org.apache.hc.core5.http.impl.bootstrap.HttpRequester
import org.apache.hc.core5.http.impl.bootstrap.RequesterBootstrap
import org.apache.hc.core5.http.io.entity.EntityUtils
import org.apache.hc.core5.http.io.support.ClassicRequestBuilder
import org.apache.hc.core5.http.protocol.HttpCoreContext
import org.apache.hc.core5.util.Timeout
import org.tomlj.Toml
import org.xml.sax.InputSource
import java.io.FileReader
import java.io.StringReader
import java.net.URI
import javax.xml.parsers.DocumentBuilderFactory
import javax.xml.xpath.XPathConstants
import javax.xml.xpath.XPathExpression
import javax.xml.xpath.XPathFactory

val coreContext: HttpCoreContext = HttpCoreContext.create()
val httpRequester: HttpRequester = RequesterBootstrap.bootstrap().create()
val timeout: Timeout = Timeout.ofSeconds(5)

val mavenTarget: HttpHost = HttpHost.create(URI("https://repo1.maven.org"))

val dbf: DocumentBuilderFactory = DocumentBuilderFactory.newInstance()
val xpathFactory: XPathFactory = XPathFactory.newInstance()
val releasePath = xpathFactory.newXPath().compile("//metadata//versioning//release")!!

main()

fun main() {
  val toml = FileReader("./libs.versions.toml").use { file ->
    Toml.parse(file)
  }

  val versions = toml.getTable("versions")!!.toMap()
  val libraries = toml.getTable("libraries")!!

  val (currentVersionRefs, missingVersions) = libraries.keySet().associate {
    val module = libraries.getString("$it.module")!!
    val versionRef = libraries.getString("$it.version.ref")
    module to versionRef
  }.entries.partition { it.value != null }

  val currentVersions = currentVersionRefs.associate { it.key to versions.getValue(it.value) }

  val latestVersions = currentVersions.mapValues { (packageName, _) ->
    getLatestVersion(packageName, mavenTarget, httpRequester, timeout, coreContext, dbf, releasePath)
  }

  // Compare current to latest version and print if different
  currentVersions.toSortedMap().forEach { (packageName, currentVersion) ->
    val latestVersion = latestVersions.getValue(packageName)
    if (currentVersion != latestVersion) {
      println("$packageName: $currentVersion -> $latestVersion")
    }
  }

  if (missingVersions.isNotEmpty()) {
    println("\n--- Missing version.ref - Need to check manually ---\n")
    println(missingVersions.map { it.key }.sortedBy { it }.joinToString("\n"))
  }

}

fun getLatestVersion(packageName: String, mavenTarget: HttpHost?, httpRequester: HttpRequester, timeout: Timeout?, coreContext: HttpCoreContext?, dbf: DocumentBuilderFactory, releasePath: XPathExpression): String {
  val (groupId, artifactId) = packageName.split(":")
  val path = "${groupId.replace(".", "/")}/$artifactId"
  val request = ClassicRequestBuilder.get()
      .setHttpHost(mavenTarget).setPath("/maven2/$path/maven-metadata.xml").build()
  val xmlString = httpRequester.execute(mavenTarget, request, timeout, coreContext).use { response ->
    if (response.code != 200) return "Could not find value in Maven Central. Need to check manually."
    EntityUtils.toString(response.entity)
  }
  val db = dbf.newDocumentBuilder()
  val document = db.parse(InputSource(StringReader(xmlString)))
  return releasePath.evaluate(document, XPathConstants.STRING).toString()
}

Notes:

  • Third party libraries that were used: https://github.com/tomlj/tomlj & https://github.com/apache/httpcomponents-client
  • There are a few cases in the toml’s libraries section where we still manage dependencies via the dependencyManagement section of the build.gradle.kts file and the library’s entry does not have a version.ref property. These are not handled by the script and appear in a separate “Missing version.ref” section.
  • There are a few libraries that are not hosted in Maven Central’s repo but rather in Jitpack’s repo – These are also not handled and a “Could not find value in Maven Central” message is printed instead.

Running this script as

./check_latest_versions.main.kts > latest_available_upgrades.txt

we get the following output:

com.amazonaws:aws-java-sdk-core: 1.12.311 -> 1.12.326
com.amazonaws:aws-java-sdk-s3: 1.12.311 -> 1.12.326
com.amazonaws:aws-java-sdk-sns: 1.12.311 -> 1.12.326
com.amazonaws:aws-java-sdk-sqs: 1.12.311 -> 1.12.326

We can see immediately that our AWS lib’s version is not up to date. After committing latest_available_upgrades.txt, next time we run this script, we will also know when AWS publishes a new version of the libraries.

Summary

As you can see, this script automates the chore of checking for third party updates and reduces it to running a single command. It doesn’t create a PR as Dependabot does, but for most cases, it gets us most of the way towards quickly identifying which dependencies require an upgrade.

Getting background tasks to play nicely with deployments

This article discusses the shutdown of background tasks during a rolling deployment.

The backstory

Early on, when our team was still small, we would occasionally run into the following error in production:

java.lang.IllegalStateException: ServiceLocatorImpl(__HK2_Generated_0,0,1879131528) has been shut down 
at org.jvnet.hk2.internal.ServiceLocatorImpl.checkState(ServiceLocatorImpl.java:2288) 
at org.jvnet.hk2.internal.ServiceLocatorImpl.getServiceHandleImpl(ServiceLocatorImpl.java:629) 
at org.jvnet.hk2.internal.ServiceLocatorImpl.getServiceHandle(ServiceLocatorImpl.java:622) 
at org.jvnet.hk2.internal.ServiceLocatorImpl.getServiceHandle(ServiceLocatorImpl.java:640) 
at org.jvnet.hk2.internal.FactoryCreator.getFactoryHandle(FactoryCreator.java:103) 
... 59 common frames omitted 
Wrapped by: org.glassfish.hk2.api.MultiException: A MultiException has 1 exceptions. They are:

We quickly realized that this error was due to a long-running job that was in the middle of its run during deployment. A service would shut down, the DI framework would kill various components and the job would end up trying to use a decomissioned instance of a class.

Since the team was still small and everyone in the (single) room was aware of the deployment, solving this was put on the back burner – a known issue that was under our control. We had bigger fish to free *.

Eventually, the combination of team size growth and the move to continuous deployment meant that we could no longer afford to ignore this error. An engineer on shift who is no longer familiar with the entire system cannot be relied on to “just know” that this is an “acceptable” error (I would add several more quotes around “acceptable” if I could). The deploys also reached a pace of dozens every day, so we were much more likely to run into the error.

Let’s dive into some more detail about the issue we faced.

The graceful shutdown problem

During deployment, incoming external connections are handled by our infrastructure. Connections are drained using the load balancer to route connections and wait for existing connections to close.

Scheduled tasks are not triggered externally and are not handled by the infrastructure. They require a custom applicative solution in order to shut down gracefully.

The setting

The task we’re talking about is a background task, triggered by Quartz and running in Dropwizard with an infrastructure of AWS Elastic Beanstalk (EBS) that manages deployments. Our backend code is written in Kotlin, but is simple enough to understand even if you are not familiar with the language.

Rolling deployments

Rolling (ramped) deploys are a deployment method where we gradually take a group of servers that run a service offline, deploy a new version, wait for the server to come back online and move on to the next server.

This method allows us to maintain high availability while deploying a new version across our servers.

In AWS’ Elastic Beanstalk (EBS), you can configure a service to perform a rolling deploy on the EC2 instances that run a service. The load balancer will stop routing traffic to an EC2 instance and wait for open connections to close. After all connections have closed (or when we reach a max timeout period), the instance will be restarted.

The load balancer does not have any knowledge of what’s going on with our service beyond the open connections and whether the service is responding to health checks.

Shutting down our background tasks

Let’s divide what needs to be done into the following steps:

  1. Prevent new background jobs from launching.
  2. Signal long running jobs and let them know that we are about to shut down.
  3. Wait for all jobs to complete, up to a maximum waiting period.

Getting the components to talk to each other

The first challenge that we come across is triggering the shutdown sequence. EBS is not aware of the EC2 internals, and certainly not of Quartz jobs running in the JVM. All it does, unless we intervene, is send an immediate shutdown signal.

We need to create a link between EBS and Quartz, to let Quartz know that it needs to shut down. This needs to be done ahead of time, not at the point at which we destroy the instance.

Fortunately, we can use Dropwizard’s admin capabilities for this purpose. Dropwizard enables us to define tasks that are mounted via the admin path by inheriting the abstract class Task. Let’s look at what it does:

class ShutdownTask(private val scheduler: Scheduler) : Task("shutdown") {
 
  override fun execute(parameters: Map<String, List<String>>, output: PrintWriter) {
    // Stop triggering new jobs before shutdown
    scheduler.standby()

    // Wait until all jobs are done up to a maximum of time
    // This is in order to prevent immediate shutdown that may occur if there are no open connections to the server
    val startTime = System.currentTimeMillis()
    while (scheduler.currentlyExecutingJobs.size > 0 && System.currentTimeMillis() - startTime < MAX_WAIT_MS) {
      sleep(1000)
    }
  }
}

Some notes about the code:

  1. The task receives a reference to the Quartz scheduler in its constructor. This allows it to call the standby method in order to stop the launch of new jobs
  2. We call standby, not shutdown, so that jobs that are running will be able to complete their run and save their state in the Quartz tables. shutdown would close the connection to those tables.
  3. We wait up to MAX_WAIT_MS before continuing. If there are no running jobs, we continue immediately.
  4. EBS does not have a minimum time window during which it stops traffic to the instance. If there are no open connections to the process, EBS will trigger a shutdown immediately. This is why we need to check for running jobs and wait for them, not just call the standby method and move on.

Given a reference to Dropwizard’s environment on startup, we can call

environment.admin().addTask(ShutdownTask(scheduler))

to initialize the task. Once added to the environment, we can trigger the task via a POST call to

http://<server>:<adminport>/tasks/shutdown

The last thing we need to do is add this call to EBS’ deployment. EBS provides hooks for the deploy lifecycle – pre, enact and post. In our case we will add it to the pre stage. In the .ebextensions directory, we’ll add the following definition in a 00-shutdown.config file:

files:
  "/opt/elasticbeanstalk/hooks/appdeploy/pre/00_shutdown.sh":
    mode: "000755"
    owner: root
    group: root
    content: |
      #! /bin/bash
      echo "Stopping Quartz jobs"

      curl -s -S -X POST http://<server>:<adminport>/tasks/shutdown

      echo "Quartz shutdown completed, moving on with deployment."

So far we have accomplished steps #1 & #3 – Preventing new jobs from launching and waiting for running jobs to complete. However, if there are long running jobs that take more than MAX_WAIT_MS to complete, we will reach the time out and they will still terminate unexpectedly.

Signaling to running jobs

We put a timeout of MAX_WAIT_MS to protect our deployment from hanging due to a job that is not about to terminate (we could also put a -max-time on the curl command).

Many long running jobs are batch jobs that process a collection of items. In this step, we would also like to give them a chance to terminate in a controlled fashion – Give them a chance to record their state, notify other components of their termination or perform other necessary cleanups.

Let’s give these jobs this ability.

We’ll start with a standard Quartz job that has no knowledge the state of the system:

class SomeJob: Job {
  override fun execute(context: JobExecutionContext) {
    repeat(1000) { runner ->
      println("Iteration $runner - Doing something that takes time")
      sleep(1000)
    }
  }
}

We want to be able to stop the job. Let’s define an interface that will let us do that:

interface StoppableJob {
  fun stopJob()
}

Now, let’s implement the interface:

class SomeStoppableJob: Job, StoppableJob {

  @Volatile
  private var isActive: Boolean = true

  override fun execute(context: JobExecutionContext) {
    repeat(1000) { runner ->
      if (!isActive)  {
        println("Job has been stopped, stopping on iteration $runner")
        return
      }

      println("Iteration $runner - Doing something that takes time")
      sleep(1000)
    }
  }

  override fun stopJob() {
    isActive = false
  }
}

This setup allows us to pause execution. Notice that our flag needs to be volatile. Now we can modify ShutdownTask to halt the execution of all running tasks:

class ShutdownTask(private val scheduler: Scheduler) : Task("shutdown") {
 
  override fun execute(parameters: Map<String, List<String>>, output: PrintWriter) {
    // Stop triggering new jobs before shutdown
    scheduler.standby()

    scheduler.currentlyExecutingJobs
        .map { it.jobInstance }
        .filterIsInstance(StoppableJob::class.java)
        .forEach {
          it.stopJob()
        }

    // Wait until all jobs are done up to a maximum of time
    // This is in order to prevent immediate shutdown that may occur if there are no open connections to the server
    val startTime = System.currentTimeMillis()
    while (scheduler.currentlyExecutingJobs.size > 0 && System.currentTimeMillis() - startTime < MAX_WAIT_MS) {
      sleep(1000)
    }
  }
}

Conclusion

Looking at the implementation, we now have the ability to control long-running tasks and have them complete without incident.

It reduces noise in production that would otherwise send engineers scrambling to see whether there is a real problem, or just a deployment artefact. This enabled us to focus on real issues and improve the quality of our shifts.

Overall, well worth the development effort.

[update] For a discussion on how to implement shutdown flags in Kotlin, see https://galler.dev/kotlin-code-styles-for-stopping-loop-iterations/