Automated remediation of cloud misconfigurations was a big theme in
2018, and here at DivvyCloud we expect the trend to continue through 2019. One
of the significant challenges customers face is putting automation into action,
instead of just talking about it.
When enterprises evaluate Cloud Security Posture Management (CSPM)
solutions, automated remediation is frequently the end goal. As with any
enterprise system, it is critical to learn, plan, and prototype your automation
capabilities until the power is fully understood. Our challenge is to help
those clients who are starting from a blank slate to take a “crawl, walk, run”
approach. Running aggressive automated
remediation from day 1 risks causing more issues than you’re solving. As a
result, your team will most likely be averse to future automated remediation
efforts. A poor initial implementation of remediation introduces a risk of
organizational opposition to automation going forward.
Automation can range from basic notification and logging to fully
automated remediation (the most advanced type of automation). You don’t need to
start with 100% automated remediation from day 1. In fact, most organizations
benefit greatly from working their way through the levels of automation to
fully explore what approaches suit their environment best. In this paper, we’ll examine the different
steps and levels of automation, and at the end of this document, you’ll be able
to choose the appropriate level of automation for your environment.
First, let’s start by reviewing the benefits of automated remediation:
time – humans no longer need
to react and take action manually. The actions are performed automatically,
allowing humans to work on higher value-add tasks. Especially at enterprise
scale, the time savings can be significant.
security – vulnerabilities
and problems are addressed immediately upon discovery, preventing bad actors
from capitalizing on issues.
- Consistency – every action runs with the exact same
workflow, and organizations can be sure that the prescribed procedures are
always being followed correctly.
compliance logging – provide
proof of the results of real-time corrections to keep cloud environments
compliant, rather than periodic audits.
Before Starting: Notification
Notification is the foundational building block of automated remediation, and something you will use on an
ongoing basis. Because you’ll configure notifications to send reports of
remediated events, this is a great place to start for testing.
During the initial rollout, using only notifications for the first
tests is critical because it allows you to audit exactly what would be
remediated without making changes. This is a great way to ensure any actions
that would be invoked will do exactly what you want.
The old saying of “measure twice, cut once” applies here. Do not move
out of the notification stage until you’re able to consistently validate that
you are only receiving notifications for the defined resources.
When you’re ready, the steps are:
1. Decide which resource type you want to remediate and what check will trigger the automated remediation (exposed SSH, public buckets, etc.)
2. Perform several dry-runs with notification only, and make sure the reported results are exactly what you expect.
3. If the notifications and resources that were marked as non-compliant are aligned with expectations, then move on to level 2.
Tip: After initial testing and the first stage of remediation has been performed, it is important to keep notifications turned on so that the ongoing results of your remediation are logged for audit.
This is critical for several reasons. If things are breaking and being
fixed automatically, how will you know when something bad is happening? Also,
if there’s someone who is making unintentionally incorrect or insecure changes,
they won’t have any way of being notified that they need to change what they’re
It’s important that notifications don’t become “noise” or “spam” to the
recipients. To that end, notifications should include as much contextual
information as possible.
Sample notification, ticketing,
and logging targets include:
• Service Now
• Post to API
Level 1: Ensure Visibility and Accountability through
The next step in moving towards automated remediation should focus on
locking down your account fundamentals. There are several initial
configurations that every new cloud account should have, and most of them can
be controlled with automation.
Sample automation for AWS can
– ensure there is one trail logging global services
– ensure that all regions are logging
– ensure all logs are being aggregated to a central bucket
- IAM –
enforce a complex password policy
buckets – enable versioning
buckets – enable logging
buckets – enable server-side encryption
- VPC –
ensure all VPCs have VPC Flow Logging enabled
- EC2 –
Auto-tag instances that are spun up without an owner tag to populate the value
with the username of whoever created it
- EBS –
Ensure all volumes associated with an instance are tagged
These fundamentals will save valuable time and increase the security of
your environment. None of these configurations should have a negative impact on
your day to day operations or users.
Level 2: More Impactful Best Practices
The automation recommendations in level 1 and 2 line up closely with
the AWS CIS benchmark. These are account fundamentals that any organization can
employ to improve the security and overall hygiene of their cloud accounts.
The difference between level 1 and 2 is that in level 2 the automation
for account fundamentals will take some planning to ensure they won’t have an
impact on your users. The level 2 automation will be easier to roll out than
the automation that provides remediation in levels 3 and 4.
Examples of AWS housekeeping
• Remove all unused
security groups that start with launch-wizard*
• Delete the default
rules on the default security groups
• Ensure that all custom
AMIs are set to private
Level 3: Governance and Account Hygiene
Things get a bit more free-form in level 3. The goal here is to make
this automation your own and add actions that bring your company the most value
while affecting day to day operations as little as possible.
There are several use-cases that
may be employed, and you’ll have to explore which of these best fits your
environment enforcement (clean up every X hours, or in line with your software
- Notify /
kill expensive instances that are spun up (either by cost, family type, or
specs on the hardware)
control (clean up unused databases, old snapshots, orphaned resources, etc.)
exposure cleanup (lock down SSH, RDP, etc.)
non-main regions unused (kill off or alert for assets outside of primary
Level 4: “Classic” Automated Remediation
It’s not until level 4 that you’ll begin performing the kind of
automated remediation most people think of when discussing the topic. That’s
because while these are the most exciting form of automation, these are actions
that give you the most control, and can also do the most damage. Additionally,
when you start rolling out these kinds of actions, you need to make sure that the
organization is completely on board. If these types of actions are being run in
a vacuum, you’re going to have a lot of confused people and potentially a lot
of broken systems.
Examples of level 4 automated
remediations in AWS:
old API keys
newly provisioned SG rules which do not have a description
instances that aren’t tagged properly, or aren’t running a golden AMI
- Lock down
public S3 buckets (zero out ACLs and bucket policy)
- Add in
exceptions (by tag, name, etc.)
- Add in
Other Considerations — Timing
It’s important to mention the concept of timing for the actions that
are in levels 3 and 4. It’s appealing to initially roll out actions that have a
lag between notification and remediation built into them, but that might not be
the best approach.
For example, if you wanted to lock down SSH exposure in your
development environment, you might design your remediation to send a Slack
notification that an instance is out of compliance because it has SSH exposed
and that it’ll be terminated in two hours if not fixed. Two hours later, the
instance can be terminated.
It may seem counterintuitive, however, this is actually a more
disruptive workflow than if the instance was turned off as soon as the issue
was created. In the first scenario, if a developer spins up a non-compliant
instance and then it goes away after two hours, they will have spent two hours
of work on that instance. The code will have been loaded, the app might be
running, and if it just goes away, that’s developer time that was wasted —and
they’ll probably be upset! Instead, if the instance is torn down as soon as
it’s seen, there wouldn’t have been an opportunity for the developer to waste
any time on the instance. It’ll go away a minute or two after it is created, and
the supporting notifications will give them the context they need to avoid the
same mistake again.
Different scenarios require different timing and you’ll always have to
balance the risk of security exposure with the operational impact it will have
on the organization.
To Sum Up…
Every organization will have a unique journey when implementing
automation. For some, there’s no appetite for full automated remediation and
just using automation for notifications will be enough. In other organizations,
everything will be automated and completely locked down. Whatever level your
organization strives to get to, by working through these 4 levels you can be
successful in gradually rolling out automation to achieve fully automated
remediation and get the most value out of the actions with the least amount of
shock to the organization.
Chris DeRamus, CTO and co-founder, DivvyCloud