Architecting Security & Governance Across your AWS Accounts Part 2: Incident Response on AWS.

Published in

ITNEXT

10 min readMar 11, 2019

NOTE: This is Part 2 of a multi-part series. For Part 1, please click here.

This is part two of the "Architecting Security & Governance Across your AWS Accounts." series. In this part, we will navigate the narrow alleys of incident response planning on the AWS cloud and implement fun automated incident response activities using Cloud Custodian.

"It takes 20 years to build a reputation and a few minutes of cyber-incident to ruin it," Stephane Nappo said.

According to NIST, a security incident is “An occurrence that actually or potentially jeopardizes the confidentiality, integrity, or availability of an information system or the information the system processes, stores, or transmits or that constitutes a violation or imminent threat of violation of security policies, security procedures, or acceptable use policies.”

Well, that's a mouthful; let's simplify it. An incident is an unintended degradation of your IT services.

In this part, we will look at the processes, AWS services, and tactics you need to easily navigate the foggy nights of security breaches.

There are plenty of incident response frameworks that explain what you need to implement for an effective incident response team, playbook, and automation around your processes.

Let's start by talking about incident response phases and the attack surface on AWS, then try recruiting AWS Lambda to our incident response team to assist when things go wrong; I hope we can afford Lambda's salary :)

Incident response phases in the Cloud (NIST)

Regarding responding to security events, the phases have mostly stayed the same in the Cloud compared to traditional data centers. Still, the technical implementation did change drastically for the better. Let's explore the phases.

The Seven Incident response phases of a highly effective runbook

1- The Preparation Phase:

This phase is all about, well, you guessed it, preparations. We need to prepare by getting our threat modeling done, shrinking the attack surface, and taking proactive measures to prevent security incidents from occurring in the first place. We must enable logging, monitoring, and encryption and limit the blast radius.

Prepare by applying proactive measures:

1- Data classification: Identify data sensitivity level, owners, and security requirements.

2- Ownership: All resources should be tagged; it's nice to know who owns a set of compromised resources; the minor owners can offer you information about the sensitivity level of the data stored in resources or configuration that might have led to the compromise. Hint: We should always tag our resources!

3- Risk Management: Identify threats, risks, and vulnerabilities, figure out your risk appetite, and then manage your risks and vulnerabilities according to the level of risk you're willing to tolerate.

4- Resilience: Architect highly available, fault-tolerant infrastructure.Hint: Use the AWS well-architect tool and read the well-architect white paper.

5- Principle of least privilege: Use AWS IAM and resource policies to grant only limited access to those who need to access data or operate on your environment.

6- Test (Game days): Test your incident response plan. You will likely find shortcomings you can address before a real security incident occurs.

Prepare by Logging everything:

There is no excuse for not enabling CloudTrail and other logging services on all accounts and regions. You won't have the forensics to tell what happened to your assets in the event of a breach.

When you enable CloudTrail, any AWS service you touch will record actions taken against them in Cloudtrail, which can also be integrated with CloudWatch and logs and stored in a centralized s3 bucket.

CloudWatch events can be consumed by AWS Lambda for real-time remediation and by SNS for real-time notification. It does not matter if you use the SDK, console, or the AWS CLI. All actions will be recorded. Any action you take against AWS resources can be used against you :) I know this one is bad, but I promise not to do it again. However, please remember to enable ClouTrail log file validation.

Prepare by Limiting the Blast Radius

Use AWS Organizations and VPC, subnet NACLs, EC2 security groups, etc., to limit the blast radius by isolating AWS accounts and resources based on business units, products, etc.

Combined with principles like in-depth defense, this approach can provide excellent protection against threats.

Prepare by Encrypting Everything:

Data privacy professionals would tell you, “treat your data as if everyone is looking at it all the time becasue they might be.”

Encryption is the process of masking data by using an encryption algorithm and an encryption key. If a robust encrypting algorithm is used, bad actors won't be able to read your data as plaintext if they were able to intercept it in transit or access it at rest.

AWS gives you options to encrypt your data, including but not limited to KMS.

AWS KMS and other services that encrypt your data directly use a method called envelope encryption to provide a balance between performance and security. See below:

2- The Identification Phase:

Compromise indicators are many, such as an AWS GuardDuty high severity alert that your EC2 instance is making outbound calls to a domain associated with bitcoin mining or not preferably a production service going down.

The identification phase is where we discover an incident is realized. The best way to implement the identification phase is by setting up alerts for all security findings that you deem critical, such as AWS root account logins.

Here are some of the questions you need to answer in the event you realize that an incident has occurred:

1- Why: Knowing the intentions behind the breach can help you identify the products and resources in scope and the nature of the attack.

2- What: Identify the lost data and damaged resources and what effort and resources you need to clean up, isolate, and mitigate.

3- How: Determine weaknesses they leveraged to gain unauthorized access to your system.

4- When: Make sure you keep track of all the events

5- Who: Who is the bad actor?

Relying on humans is a bad practice when identifying security issues, as we are better than machines regarding correlating outliers and anomalies. Automated incident response using AWS services is the way to go. You can also take advantage of this by applying machine learning and security analytics.

3- The Containment Phase:

You've been through all the tedious identification steps. Now what? This phase involves removing the security threat.

We should have a Cloudformation or a Terraform template with all the resources needed to build an isolated environment for forensics investigation. Some of these actions we need to take are:

1- Move your AWS account to an organizational unit with restricted AWS organization service control policies.

2- Deny access should be applied to s3 buckets

3- Restricted security groups so they only allow ports designated for the investigation.

4- A global DENY * IAM policy should be attached to all entities that aren't involved with this phase

Automation should be deployed to stop compromised resources, snapshot volumes, turn off KMS encryption keys, and change Route53 recordsets.

4- The Investigation Phase:

The investigation phase involves activities that follow the containment phase, including but not limited to forensics and general log analysis. The investigation phase should reveal when the event occurred, what actions the bad actors took to gain access to our system, the side effects of the breach, and the possibility of this breach happening again based on evidence collected so far. The investigations should occur in an isolated environment, as stated before.

Services you can use:

1- VPC Flow Logs.

2- CloudTrail.

3- CloudWatch.

4- Athena for analyzing logs.

5- The Eradication Phase:

This phase involves the careful handling of affected resources. We delete resources if necessary and move healthy and clean resources to more secure environments.

Encrypted data should be undecipherable by attackers, which means we can perform some of the following:

1- Disable or delete KMS keys

2- Delete spilled files from EBS volumes and move the clean data to new encrypted EBS volumes

3- Delete encrypted s3 objects that are encrypted by s3 server-side encryption

4- Delete encrypted s3 objects and the encryption CMKs for s3 objects that are encrypted with KMS or customer keys

If the data wasn't encrypted, your only option is to restore your storage resource from a last known good state that you can guarantee it wasn't tampered with.

6- The Recovery Phase:

Now we have done everything to identify what has happened and data cleansing was performed to the best of our abilities, we need to restore our operations to normal.

Some of the actions we can perform at this stage:

1- Restore resources.

2- Restore network connectivity.

3- Use new and improved access control policies and encryption keys.

4- Monitor your environment for any unusual behaviors.

7- The Follow-up Phase:

The follow-up phase, aka post-mortem, is about the lessons learned and the follow-up activities that need to be implemented to avoid new security incidents and make improvements to our incident runbook.

Automation

If you have made it so far, thank you! Let's get our hands dirty and review a use case that should drive home the ideas we've discussed thus far. Ultimately, having a solid plan isn't enough if you don't walk the walk.

For this activity, we need the following:

1- An isolated AWS environment. Please do not use your production environments; this activity can take resources down.

2- On the isolated AWS environment, AWS GuardDuty must be enabled to generate findings that CloudWatch events and AWS Lambda would consume.

3- An EC2 instance ( t2.micro) to generate some security findings; we can open all ports or something along those lines.

4- We will set up Cloud Custodian so it deploys the serverless components needed for this activity.

Note — you must deploy Cloud Custodian to the region where GuardDuty is enabled.

This solution is straightforward: when our Cloud Custodian policy detects a medium/high severity finding generated by AWS GuardDuty, it should take action against the resource in scope.

What actions Custodian is going to take against the compromised EC2 instance:

1- Remove the IAM role attached to the EC2 instance.

2- Stop the EC2 instance.

3- Snapshot the volume for forensics investigations.

Let's start by installing Cloud Custodian and write our first policy. In your terminal, run the below commands to install CC:

Then, create a file named custodian.yml With the below content:

All this policy does is stop any EC2 instances that have a tag key "Custodian."

Running your first policy:

I am assuming you're using profiles to access your AWS accounts via the CLI; if not, you can look up how you authenticate Custodian using API keys or custodians assume role command; if you are using a profile, below is how you can run the policy locally using an AWS CLI profile, meaning this policy won't run based on an event generated by CloudTrail or on a predefined schedule.

If successful, you should see output similar to the following on the command line:

Now you are a Cloud Custodian expert who can automate security on AWS, let's get our actual policy out there!

Follow the steps provided for your first policy: write the policy, save it, deploy it to an isolated AWS environment, and watch the magic. If GuardDuty generates findings, the policy Lambda will consume the generated event and react by taking the abovementioned actions.

Alright! That's it for the incident response part!

Thank you for spending your valuable time! If you would like to see more of this, please 👏 so others can see it, and most importantly, share your experience in the comments for the community.