From a data science perspective

## Why now?

McKinsey estimates that, at the current growth rate, cyberattacks will cause an annual damage of $10.5 trillion by 2025. Already today, according to IBM, an average data breach in US costs $9.5 million.

Cybersecurity risks is one of the fastest growing digital industries where the data science toolkit saves millions from day one. This intro elaborates the **cybersecurity risk assessment framework **from a data science perspective**. **Starting with the definition of risk we will build an intuition for using Bayesian modelling mindset to address practical challenges.

Of course when risks are identified, they have to be managed. But in order to manage risks effectively they need to be calculated and prioritized first. Without risk assessment organizations couldn’t focus on really important things, and considering limited budget could leave critical assets unprotected.

By definition, **risk is an expected negative impact**. The word “expected” indicates that we try to predict the future and operate with uncertainty. In mathematical terms it means we have a random variable **Impact**, which takes different possible values when bad events happen (risk scenarios are realized). And the expected value of this random variable is called risk.

An event can be something complex, e.g., a data breach via unsecured API in a cloud infrastructure, but for our purposes it translates into a combination of **threats and vulnerabilities**.

**Impact**reflects a potential harm received when specific vulnerabilities are exposed to specific threats. To estimate the impact in different risk scenarios organizations conduct a Business Impact Analysis (BIA) which considers business-specific factors, dynamics over time (time to detect and contain a breach) and post-effects.**Probability**of a particular impact realization depends on the threats and vulnerabilities corresponding to the event.**Threat**is an incoming danger out of your hands, e.g., a cyber attack. Threats cannot be controlled, but can be monitored and analyzed to better prepare for them in future.**Vulnerability**reflects existing weaknesses in an enterprise, e.g., software bugs. Organizations try to control vulnerability through risk mitigation initiatives, e.g., Zero Trust security model.

**Threat Analysis **and **Vulnerability Assessment** are designated activities conducted on different levels (worldwide, country, organization, network perimeter, application etc.). For instance,

By modelling risks we mean estimating a **probability distribution** of the impact.

According to the definition of expected value, we operate in a universe of all possible events, i.e., **all combinations** and dependencies between different threats and vulnerabilities are possible.

- To realize the complexity imagine your email is hacked. It exposes all accounts using this email for authentication, but each account is secured using different policies — so vulnerability is different. However one vulnerability can pull another one. Also, your accounts have different importance, e.g., online bank account vs Netflix account.

How to handle all this real world complexity? This is the moment when Subject Matter Experts in cybersecurity reveal the truth or heuristics about the universe of events. For instance, they can tell that without loosing much accuracy, you can model some threats independently, or that vulnerability of one asset doesn’t influence another asset’s vulnerability and etc. In other words, SMEs will provide **realistic assumptions** which could simplify the estimation of the Impact probability distribution. i.e., **P(Impact) = P(Threat & Vulnerability)**. Selected examples of such assumptions are below:

- For the back-on-the-envelope calculation, you could assume a binary impact, following
**Bernoulli distribution**, i.e., a non-zero impact with probability p and zero impact with probability 1-p. - Independence of events P(e1, e2) = P(e1)xP(e2). It is often used as a baseline in case we are interested to estimate a
**risk of at least one breach**is happening, i.e., P(at least one risk realized: e1 or e2) = 1-P(no risks realized) = 1-P(no e1)xP(no e2). - If we are interested in the number of attacks assuming that threats are identical and independent, e.g., for DDoS attack modelling, then
**Poisson distribution**suits for P(Threat) modelling. **Independence of Threat and Vulnerability**so the joint distribution can be decomposed: P(Threat & Vulnerability) = P(Threat) x P(Vulnerability). E.g., threat of receiving a malware link in the email and having a sticker note with a password on your monitor. A counter-example is when bad guys know you have a sticker with the password on your monitor — it increases the chances they will try to enter the office building to look at it.

After the assumptions about the impact probability distribution being made it is time to estimate the distribution with the data. The motivation for the modelling approach comes from the fact that data points are rare by design — we don’t want any risks to be actually realized.

- So,
**Bayesian inference****priors**— assumptions of the distribution parameters. - However, usually the optimal point estimate of a parameter distribution is enough, so the
**Maximum a posteriori estimation (MAP)**is utilized. Based on MAP the**loss function**for optimization problem is formulated and can be solved using supervised**machine learning**models.

In order to decide which **risk mitigation strategy** to choose, an organization estimates how a particular risk mitigation measures change the vulnerability profile and **reduce the risk**. The question is critical because risk reduction does not come for free. By modelling residual risk (after measures were applied) risk managers solve an **optimization problem with constraints**: what measures mitigate the risk to a given risk tolerance level so that costs do not exceed the given budget.

Moreover, **game-theoretical aspects** should not be ignored. For instance, if hackers become aware of particular mitigation measures, it might decrease the probability of a threat addressing the vulnerability covered by those measures.

Cybersecurity risk management is vital for a modern organization. Risk assessment is the cornerstone of the risk management process which distinguishes winners from losers. It has many modelling challenges. And organizations overcome them by applying realistic assumptions to simplify calculations. Risk mitigation strategy is the outcome of the risk modelling exercise incorporating costs of the risk reduction activities.

Due to the fast-changing cybersecurity environment this process is usually semi-automated and active 24/7 with the help of real-time security risk monitoring systems. Implementation and maintenance of such system is an important but challenging task. However, only moving this way organizations can take the risks under control.

The good news is that the majority of the risks modelling challenges can be addressed with the existing data science and machine learning methods.

——————————————————-