Security and proper incident response are business-critical concerns, and managing the aftermath of a security breach or cyber attack is no easy task. Time is of the essence when a risk is identified, and it has to be approached in a disciplined manner. Businesses that continuously monitor security and improve incident response processes have a more rigorous security posture and are more resilient to security incidents. That being said, security never stops and incidents happen; the big question is, can we avoid some of them and mitigate their damage by having established processes to deal with them.
With more and more people pushing code and making changes to your AWS environment, how can you be certain that they are all adhering to security best practices and policies? The best cloud practitioners are embedding security experts and automation within product development teams so they can work side-by-side from throughout the development process. This approach provides the guardrails to prevent mis-configuring AWS services and enables DevOps teams to maintain their rapid pace of innovation while security ensures that risks are mitigated.
Testing and monitoring everything that is deployed to production at the speed of continuous development is not possible with the limited resources of most organizations. Yet, by employing automation, prioritizing tasks, maintaining continuous insight into your environment can gird your organization against threats.
So, if an incident is unavoidable, what are the best practices for how to respond?
The first step is to identify and prioritize security issues based on how their risk level and how badly they could impact the business. Then, the organization needs to map the appropriate incident response processes to those issue. Some IT incidents cause downtime and/or can compromise vital organizational data and there are a lot of different types of incidents that need to be considered.
Security incidents that take an organization offline while security issues are addressed are the most damaging. This is especially the case for financial services or ecommerce companies and could have disastrous effects when the online revenue stream functionality is interrupted. However, most companies survive downtime as long as incidents are managed well and Service Level Agreements (SLAs) are met.
Information breaches or leaks also rank high on the priority list, mostly because of the severe repercussions stemming from loss or damage of an organization’s assets and the loss of customer, investor, or stockholder trust. A security incident like this could come in the form of a threat to network, systems, Intellectual Property (IP), and/or Personally Identifiable Information (PII).
A security incident of any kind can lead to service degradation or more downtime and worst of all, regulatory and financial penalties and the loss of brand equity and customer trust.
Having an Incident Response Plan is Essential
Incidents happen. This does not change in the cloud. When it comes to security incidents, a mismanaged issue can cause increasingly more damage the longer it goes without being addressed. A key to eliminate or reduce the impact of these incidents is to have an effective plan and processes to handle issues. Without a well-planned incident response plan, it is nearly impossible to manage complex incidents affecting multiple services and teams in an already stressed situation.If you have an incident response plan in place and relevant configurations to an incident management system, they won’t do you and your team any favors unless you keep them up to date. The best way to ensure your plan and systems are to update is to regularly test them in peacetime. Consistent training and chaos simulations help teams to stay up to date and be prepared for incidents by incorporating a proactive approach to incident response.
There’s no definitive standard for cloud incident response plans, but we recommend that your organization adhere to these five main points:
Preparation is critical because it reduces “what if” moments and helps teams make practiced decisions. Having an on-call schedule with multiple rotations, escalations with correct responders, runbooks, practice sessions, and extensive documentation are all part of this crucial stage.
Detection & Alerting
Detection and alerting focuses on the communication of an abnormality. In this step, monitoring the right metrics, and setting up the correct thresholds are important to reduce false positives. In the cloud, multiple monitoring solutions are often involved in different parts of the infrastructure covering network, infrastructure, application, performance, or compliance monitoring. An undesired state can trigger a chain reaction and a new level of incident management becomes crucial to aggregate, triage, and then alert only the things that matter.
The containment stage is about limiting and preventing any further damage by isolating the affected area. In the case of complex incidents, teams join a war room and work together to stop the bleeding. In this stage, often an incident commander assigns tasks to predefined roles and takes informed actions in the incident command center.
Once the incident is under control, it is now time to address the problem and figure out how it can be corrected to prevent a similar incident from occurring in the future. A decision-making framework, like Cynefin, can be used to approach the problem depending on the type of the incident (simple, complex, complicated, chaotic). Cynefin provides a structured way to approach problems that helps incident responders determine the best course of action based on the nature of the problem itself.
Another popular approach is to use chat tools like Slack to enable teams to discuss and assess the incident. Modern ChatOps tools make collaborative investigation and actioning remediation a lot easier with a click of a button or typing a few words into the shared chat channel where everyone has visibility.
Incident response does not end after remediating the issue. Continuous improvement requires learning from mistakes, and the last step of any incident response plan should contribute to this idea. Postmortems or post-incident reviews help teams evaluate the incident and implement new measures to reduce the chances of experiencing a similar incident. An essential rule while writing postmortems is to be blameless and not point fingers while reviewing the events to create a culture of continuous learning.
How do OpsGenie and Evident.io help?
Evident.io and OpsGenie have joined forces to enable organizations to resolve security incidents quickly and effectively. Using the powerful monitoring capabilities of Evident.io, users can detect issues before they arise and streamline the resolution process by leveraging OpsGenie to create and route actionable alerts to the appropriate teams to resolve them.
Serhat Can is the Technical Evangelist for OpsGenie. Serhat contributed to different parts of OpsGenie as a software engineer and now spreads the word by coding, writing and talking about DevOps. He is still a proud member of the on-call schedules.