Emergency vs Continuous Incident Response

Would you rather be putting out fires or keeping a calm, predictable environment?
Continuous monitoring of a cloud environment identifies vulnerabilities so threats can be dealt with before they become an actual problem. As soon as an infrastructure monitoring tool detects an error in the system or a glitch that could potentially break the system, an alert is created. Now, every second lost in remediating that issue increases the chances of having a security incident.

Remediation workflows are designed for fast identification and resolution, and they typically include actions like alerts, ticket creation, routing to proper team, and making sure that the proper action is taken to resolve the issue. In these processes, time is the biggest factor, so leveraging automation and executing planned incident response actions becomes key.

Emergency response on the other hand is when an incident has already occurred. The response is reactive in this case. The priorities by which corresponding activities abide, and the actions required, are unique to the issue of fast, effective resolution. Assuming that an incident will happen and planning accordingly is the key to maximizing resolution speed. Here the damage is already done and containment is the priority. In both cases, fast response is vital.

Fortunately, today’s organizations are aware of the threats that attempt to penetrate their cloud infrastructure and are taking measures to prevent and prepare for what seems to be the inevitable. Organizations that employ the best practices below are able to decrease the time to detection and time to remediation of exploitable vulnerabilities across all AWS services and realize improved security hygiene and lower information security risks around potential data breaches:

  1. Monitor your cloud infrastructure: Identify the vulnerabilities before the bad guys do.
  2. IR Plan: Have an incident response plan in place. Update it and run drills frequently.
  3. Create actionable alerts: Organizations need to create and implement actionable alerts to maximize resolution speed. Actionable alerts help your team identify who needs to respond and what action needs to be taken.
  4. Automate: Speed up the security workflow, from alerting, to ticketing, to task assignment and remediation, automation tools can help to combat threats in real-time and even enable you to enforce policy as code.
  5. Enable Security-as-Code: While automated policy enforcement through auto-remediation is a huge time saver and can reduce the time to remediation significantly, it’s important to be selective about the security alerts you choose to action. This criteria can be helpful to consider as you evaluate which alerts to auto-remediate:
    • A constantly recurring signature or control with a constant solution.
    • A process that provides maximum remediation value for the potential exploitability of the alert generated.
    • Alerts where following complex, custom remediation process is appropriate. For example ESP’s signature for Global SSH – An EC2 security group that allows SSH from the world could wreak major havoc in your AWS service by exposing your EC2 instances to malicious break-in attempts. Remediating alerts from this signature automatically provides great value and is relatively easy to fix.

Incidents, service disruptions, and outages are not limited to security breaches and exploited vulnerabilities. Risk is often introduced when a change is made in production – planned or unplanned. If a production change yields service degradation or a full blown outage, it’s not always immediate. It may be several hours or even days later before a production change is identified as the root cause.

As mentioned previously, it’s critical to have monitoring and instrumentation to know what a healthy environment looks like and to alert when it’s not. The holy grail though is event correlation. What appears to be isolated or disparate events/alerts are typically related and if the monitoring tool(s) and platform(s) that you have can identify this then that’s a huge win. This will help in reducing customer impact, MTTR, and maintaining your SLAs. Problem isolation is key here.

From site/service reliability standpoint, there should be no unplanned changes. However, how in-depth an organization wants to go with change management varies. Depending on compliance requirements, the company culture and acceptance of processes and structure, will determine direction. Having said that, enumerating change management is a bit outside the narrative of emergency vs incident response, so I’ll save that for another post.

Contributing Author
Serhat Can is the Technical Evangelist for OpsGenie. Serhat contributed to different parts of OpsGenie as a software engineer and now spreads the word by coding, writing and talking about DevOps. He is still a proud member of the on-call schedules.

About OpsGenie
OpsGenie is the world’s fastest growing platform for alerting and incident management. OpsGenie centralizes the flow of alerts and then delivers them according to customizable schedules and escalation policies so teams can minimize the impact of IT and security incidents. Watch the “What is OpsGenie” video to learn more.