DC/OS Agent Node Maintenance

Weston Bassler
ITNEXT
Published in
8 min readJun 21, 2018

--

source: toothpastefordinner.com

Providing high-availability (HA) of infrastructure is one major aspect of my job that I take much pride. I care about the availability of infrastructure because this is a reflection of my architecture design and my knowledge as an engineer that I have worked so hard to obtain. If I want developers to use my platform to run their services, they need to know that the underlying infrastructure is going to be available at all times and not going to be a bottle neck to their uptime SLA(s). Granted, this also requires users to design their services to be HA as well, but how can their services be HA if the underlying architecture is not?

In this post, I want to discuss the way we are currently handling DC/OS Agent Node maintenance. I will explain the steps we take to ensure HA for our users during maintenance, explain the steps and tools we use to handle maintenance, and discuss some things that we have learned over last couple of years in our environment.

Computers suck and they are stupid. We have to expect computer failure and/or maintenance at minimum at some point to keep infrastructure running smoothly and securely. This known fact does not change the requirement for being able to provide HA and high SLAs for services. In today’s world of compute and software, we have more control and opportunities than ever to design well architected systems that allow for failover, recovery and high availability. In my opinion, if it can’t allow for those things, then it shouldn’t run.

I want to take a quick second to discuss the underlying architecture and platform behind this post and some of the tools we are using to manage it our infrastructure.

The Platform

In case you haven’t guessed it by the title — yes, we are going to be discussing DC/OS.

“DC/OS is a distributed operating system based on the Apache Mesos distributed systems kernel. It enables the management of multiple machines as if they were a single computer. It automates resource management, schedules process placement, facilitates inter-process communication, and simplifies the installation and management of distributed services.” — docs.mesosphere.com

Without going too deep into explaining DC/OS Architecture, just note that this is the major platform that provides us with the frameworks that manage our services. We have also chosen to run this architecture in AWS to take advantage of the things that make cloud computing so powerful and awesome. We should also note that most of our services run in Docker, which also needs no explanation to the many benefits.

One other key piece I should mention here is that we use Kafka on DC/OS very heavily for much of what we do. We deploy multiple Kafka clusters in our environment most representing a different environment (development, QA, test, etc…) of apps. Each Kafka cluster also could be a different version or flavor (such as Confluent Kafka), so we must always ensure that we don’t lose too many brokers from the same cluster at a time or we suffer data loss.

Tooling

As Infrastructure Engineers, we practice Infrastructure-as-Code methods and apply them everywhere possible. We version control, code and automate basically every aspect of Infrastructure management and DC/OS. This allows for us to move very quickly and efficiently for an extremely small team of very busy engineers. Some of the tools that make this possible and deserve mentioning:

Packer — We use Packer to build our machine images (AMIs) and apply all our environment specifics to them (Monitoring, Logging, Docker Version, etc…). See my previous post on how we are using Packer to build our DC/OS Agent AMIs if interested in learning more.

Terraform — We use Terraform for creating and managing ALL aspects of our Infrastructure (EC2 Instances, SGs, Roles, etc…). We, of course, incorporate the machine images create above with Packer into this. With Terraform we are able to automate initial creation of the DC/OS Cluster as well as use it for scaling and patching. Much of the magic comes from custom scripts and Terraform Templating capabilities with user-data.

Ansible — We use Ansible pretty extensively for, not only automating different aspects of our infrastructure, but also to manage Infrastructure through ad-hoc commands. We incorporate playbooks into building our machine images and ansible-vault for securing sensitive information.

The Process

Now that we have talked about the tools that we use to perform maintenance, I want to give an overview of our process and method.

Since we use Infrastructure-as-Code, AWS and a highly distributed system, we completely replace our DC/OS Agents with a fresh one. Essentially what we do is build a brand new machine image, terminate the current DC/OS Agent and then rebuild/replace the Agent with the newly created AMI. There are, of course, things that we must do prior to terminating the instance (more on this below) but that is the high level of what how we handle maintenance. This will likely make more sense one you see the steps being preformed below.

We take advantage of being able to use our DC/OS Agents as ephemeral beings instead of long forever living instances. We treat them as if we know we are most likely going to lose the agent due to FS corruption, Kernel panic, someone fat fingering something in AWS, etc. We don’t waste time troubleshooting a failing node (99% of the time) and we don’t want to worry about what compatibilities issues might arise if we do a “yum update -y” and reboot. We replace with fully tested machine images to avoid disasters in the future. Remember what I said before, computers suck and they are stupid. They will fail on you at some point.

As mentioned above, before terminating the instance there are a couple of things we must do to ensure we don’t affect services running on our cluster. Although we have designed our services for failover, we must safely drain our Agent Nodes of the current services running there. We don’t want to take a chance of causing downtime while we are doing maintenance. DC/OS actually has a method of performing Node maintenance where you can deregister an Agent from the Masters until further notice or for a given amount of time. We actually take this a step further as well which I will display below. Also, since we run several Kafka clusters as well, we must also replace the brokers and fail them to other available Agents to avoid data loss.

The Steps

Now that we have a better understanding of the process, let’s go through the steps performed during node maintenance. The steps below represent the steps we go through for each Agent:

  1. Build a new machine image from the latest and greatest AMI provided by Amazon and the the vendor. We use CentOS 7. Again you can see this step in an earlier post of mine. This step is only done once and shared for all Agent nodes.
  2. In your Terraform code, modify the ec2 resource used to manage that particular DC/OS Agent and replace the old ami-id with the new ami-id created in step one. Example from Terraform docs.
resource "aws_instance" "dcos_agent_node" {
monitoring = true
ami = "NEW_AMI_ID_HERE"
instance_type = "${var.instance_types["dcos_agent"]}"
...
...
...
}

3. Prior to terminating the DC/OS Agent in AWS, there are other things we must do. We must drain the node of all its running services to ensure safe failover.

In our environment, we first have to check to see if there are any Kafka brokers running and if so we must “replace” them. You can do this easily using the DC/OS CLI and you can also drill down into the DC/OS UI Node tab to find which Kafka cluster the broker belongs to.

dcos kafka --name=<KAFKA_CLUSTER_NAME> pod replace kafka-0# For older versions
dcos kafka --name=<KAFKA_CLUSTER_NAME> broker replace 0

This will stop the current broker running on the Node and then redeploy a fresh broker to a different Agent Node and begin syncing data for the specified Kafka cluster.

Second we must drain all other services running there and remove the agent from the cluster. I mentioned before that the majority of our services run in Docker. Since our version of Docker currently requires the Docker daemon to run containers and Marathon requires the daemon to be running to deploy services, we send a kill signal to the docker systemd service and stop it.

sudo systemctl kill -s SIGUSR1 docker && sudo systemctl stop docker

This immediately kills the docker containers running on the node and Marathon will begin to redeploy them elsewhere in the cluster. Also, since the docker daemon is now stopped on the node, we don’t have to worry about Marathon trying to redeploy them on the current Agent node we are working on!

The last step we have to do is remove the Agent node completely from the DC/OS Cluster. Fortunately, Mesosphere has provided us with a similar command above that will handle it for us as mentioned above.

sudo systemctl kill -s SIGUSR1 dcos-mesos-slave && sudo systemctl stop dcos-mesos-slave

You will soon see, if not instantly see, your agent node completely gone from the Cluster in both the DC/OS UI and Mesos UI.

One cool thing we have figured out is how to kill and stop the Docker daemon and the dcos-mesos-slave service for our instances behind an autoscaling group with SSM. This allows us to automate the entire process of adding the new instance to the DC/OS cluster and safely draining the Node as it is terminated. More on that in a future post!

4. Now we log into AWS and initiate instance termination of that DC/OS Agent.

5. Once the instance is completely terminated, can now replace/redeploy that instance with our terraform code above using the new AMI that we built in step 1.

terraform apply

After about 2–3 mins we will see our DC/OS Agent re-register with the cluster and reappear in the DC/OS UI with a fresh OS and it will begin accepting reservation requests. Step 5 takes a bit of coding and some customization on our end to get the node automatically added back correctly because we run custom settings such as attribute tags and disable CFS. Perhaps a different post on how we use Terraform to manage our DC/OS cluster would be a better fit at some point. Perhaps also one day we will also have this entire process automated.

Getting us to a point of being able to provide HA to users while doing Node maintenance has taken some trial and error. We are now able to completely upgrade a cluster of about twenty or so DC/OS Agent nodes with zero impact to services in a very short period of time. This is huge because it allows for us to do maintenance during the day essentially with nobody knowing what is going on. Users are able to complete their testing with no impact and we don’t have to stay up late to do it during off hours. Hopefully, if your environment is similar, you can take the same steps I have shown above and apply it to your Node maintenance to ensure the highest availability.

--

--

Husband. Father. Friend. Mentor. Developer. Engineer. | Sr MLOps Eng