How To Calculate The Probability of Cyber Breach Attack Methods Using Python and Bayesian Statistics
Introduction
Understanding and quantifying risk is paramount in cybersecurity's complex and ever-evolving landscape.
One powerful tool for this is Bayesian statistics, which allows for a more nuanced analysis of risks and threats.
In this article, I share a real-world Python program I created to help me compute and visualize probability distributions for a Web Application Attack Method, comparing an industry benchmark with an organization-specific distribution after applying new controls.
This program provides a practical example of applying Bayesian methods in cybersecurity.
I am committed to equipping cybersecurity professionals with the robust capabilities of quantitative Bayesian statistical methods. By leveraging these mathematical and statistical tools, we can enhance our current risk assessment techniques and present risks in terms that business leaders can understand. Bayesian methods allow us to prioritize cybersecurity risks and communicate them with their potential economic impact, ensuring clarity for business professionals.
You can connect with me on LinkedIn and follow my articles on Medium. Get notified via email every time I publish a new article.
Theoretical Underpinnings
This Python program is grounded in Bayesian statistics, a framework for updating beliefs in light of new evidence. In cybersecurity, Bayesian methods are particularly valuable for incorporating prior knowledge and new data. Here, using Beta distributions to model the likelihood of cyberattacks is a classic example of Bayesian thinking applied to real-world problems.
Practical Implications
For cybersecurity professionals, this program is more than an academic exercise. It offers a tangible method for assessing the effectiveness of security measures. By comparing an industry benchmark to a specific organization’s probability curve, it’s possible to estimate the probability of the organization experiencing a breach resulting from a Web Application Attack, as outlined in the latest Verizon DBIR report.
Real Visualization From The Program
Before we dive into the source code, I want to show you the visualization and output and describe how to consider using the information.
This information could be used alone to calculate the industry benchmark and internal probabilities of cyber breaches via Web Application Attacks. Alternatively, it can be used as input into a larger process to calculate a loss exceedance curve that communicates future magnitudes of potential losses in probabilistic terms.
The visualization is a graph of two probability distributions that compare the likelihood of web application attack methods affecting a business in two scenarios: the industry benchmark and after implementing new controls within the organization.
Purpose of the Visualization
The primary purpose of this visualization is to help business stakeholders understand the potential impact of cybersecurity threats on web applications and how implementing new controls can alter this risk landscape. It isn’t always as intuitive as many people think.
Business leaders want to know the return on their investments. This application is one part of that solution. The probability of breach must be calculated as the first step, and then, in a future article, I will show you how to calculate a loss exceedance curve.
This visualization is a quantitative risk analysis in the form of probability distributions that can be used for strategic decision-making.
Reading the Visualization
- Curves: There are two curves on the graph. The orange curve represents the ‘Industry Benchmark,’ which is the standard or average risk of web application attacks in the industry. The blue curve represents ‘New Controls,’ indicating the risk probability for an organization implementing new cybersecurity measures.
- Mean and Mode: Both curves have an associated ‘mean’ and ‘mode’ value:
- The mean (average) indicates the central tendency of the data, which is the expected probability of web application attacks.
- The mode represents the most likely value, the highest point on each curve.
- Confidence Interval: Each curve has a dashed vertical line indicating its mean and shaded areas showing a 90% confidence interval:
- For the ‘Industry Benchmark’, this interval is between 4.51% (P5) and 5.61% (P95). This effectively represents the 90% confidence interval. In other words, we are 90% confident the probability will fall within this range.
- For ‘New Controls’, it’s between 4.10% (P5) and 5.05% (P95).
Using the Information
- Comparative Risk Assessment: By comparing the mean of the two curves, a decision maker can assess how much the new controls are expected to reduce the risk of web application attacks. Here, we see a decrease from the industry mean of 5.05% to 4.57% with the new controls. Depending on the magnitudes of loss that are plotted on the loss exceedance curve, this may or may not be a good investment.
- Decision Making: The visualization provides insight into the effectiveness of the new controls. If the blue curve (‘New Controls’) is significantly lower than the orange curve (‘Industry Benchmark’), it suggests that the new controls effectively reduce the probability of an attack.
- Strategic Planning: The confidence interval helps understand the range of likely outcomes. A narrower interval suggests more certainty in the prediction. In this case, business stakeholders can see that the ‘New Controls’ not only lower the mean risk but also tighten the range of probable outcomes (a smaller confidence interval), indicating increased certainty in the effectiveness of these controls.
- Resource Allocation: Understanding which controls can shift the probability distribution from the industry benchmark to the improved state can help businesses decide where to invest cybersecurity resources.
Visualization Summary
This graph provides a clear visual representation of the expected effectiveness of cybersecurity measures. For business professionals, it’s a tool for understanding the average risk of attack methods and the potential improvements specific controls might bring. It can be a part of the risk management process, aiding in making informed decisions about cybersecurity investments.
Python Code: Overview
The Python program utilizes several libraries that are designed to help solve this problem:
- Numpy and Pandas: For data handling.
- Matplotlib: For data visualization.
- TensorFlow Probability: For probability analysis.
The main steps of the program are as follows:
Defining the Posterior Distribution Function:
- This function,
compute_posterior
, calculates the posterior distribution using a Beta distribution. The Beta distribution is chosen for its suitability in modeling probabilities of probabilities - a common requirement in cybersecurity risk analysis. - Inputs:
alpha_prior
,beta_prior
,total_breaches
,total_organizations
. - The function returns a TensorFlow Probability distribution object.
Plotting the Distribution:
- The
plot_distribution
function takes a distribution object and plots it on a specified axis. - It calculates and displays key statistics: mean, mode, and a 90% confidence interval (5th and 95th percentiles).
- This visual representation is crucial for understanding the likelihood of different risk levels.
Setting Parameters and Computing Distributions:
- After implementing new controls, the script computes two distributions: one representing the industry benchmark and another for the organization.
- The industry benchmark is modeled as a non-informative prior, with
total_breaches
andtotal_organizations
as inputs. - The organization-specific distribution uses an informative prior, reflecting the belief that new controls will reduce breach probability.
Visualization:
- The script creates a plot to compare the two distributions visually.
- This comparison is key to understanding how new controls might impact the organization’s cybersecurity posture compared to the industry norm.
Outputting Statistics:
- Finally, the program prints out key statistics for both distributions, concisely summarizing the findings.
Python Source Code
Import the required libraries:
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow_probability as tfp
tfd = tfp.distributions
# Computes the posterior distribution given prior parameters,
# total breaches, and total organizations.
def compute_posterior(alpha_prior, beta_prior, total_breaches, total_organizations):
return tfd.Beta(alpha_prior + total_breaches, beta_prior + total_organizations - total_breaches)
# Plot a distribution on the provided axes.
def plot_distribution(distribution, x, label, color, ax):
mean = distribution.mean().numpy()
mode = (distribution.concentration1.numpy() - 1) / (distribution.concentration1.numpy() + distribution.concentration0.numpy() - 2)
p5, p95 = distribution.quantile(0.05).numpy(), distribution.quantile(0.95).numpy()
ax.fill_between(x, distribution.prob(x), where=(x >= p5) & (x <= p95), color=color, alpha=0.3)
ax.plot(x, distribution.prob(x), label=f"{label} (Mean: {mean:.4f}, Mode: {mode:.4f})")
ax.axvline(x=mean, color='red', linestyle='--')
return mean, mode, p5, p95
# Enter parameters for the non-informative prior (Industry Benchmark)
# I could have created the program to prompt me for these inputs If I wanted to do this.
total_breaches = 213 # number of breaches in the last year in my reference class
total_organizations = 4237 # population size of reference class.
# Compute Non-informative Prior
non_info_prior = compute_posterior(1, 1, total_breaches, total_organizations)
# Informative Prior (Implemented new controls or belief our controls are stronger than benchmark)
# 25 represents the alpha (hits)
# 950 represents the beta (misses)
# I used 25 because it is one-half or a 50% reduction of 50/950, which produces 5% because we have new data
# or believe the new control will reduce the chance of breaches by 50%.
info_prior = compute_posterior(25, 950, total_breaches, total_organizations)
# X-axis scale adjustment
# After I ran the initial program, I decided to adust the scale for aesthetic reasons.
x = np.linspace(0.03, 0.07, 1000) # x-axis show 3% to 7%
# Plotting the Curves
fig, ax = plt.subplots(figsize=(10, 6))
mean_non_info, mode_non_info, p5_non_info, p95_non_info = plot_distribution(non_info_prior, x, 'Industry Benchmark', 'yellow', ax)
mean_info, mode_info, p5_info, p95_info = plot_distribution(info_prior, x, 'New Controls', 'green', ax)
# Setting the Title and the x and y axis labels
ax.set_title('Probability Distribution for Web App Attack Method')
ax.set_xlabel('Probability')
ax.set_ylabel('Density')
ax.set_ylim(bottom=0)
ax.legend()
plt.show()
# Print statistics for both distributions below the plot
print("Non-Informative Prior (Industry Benchmark):")
print(f"Mean: {mean_non_info:.4f}, Mode: {mode_non_info:.4f}, 90% Confidence Interval: P5 = {p5_non_info:.4f}, P95 = {p95_non_info:.4f}")
print("\nInformative Prior (New Controls):")
print(f"Mean: {mean_info:.4f}, Mode: {mode_info:.4f}, 90% Confidence Interval: P5 = {p5_info:.4f}, P95 = {p95_info:.4f}")
Summary & Conclusion
This Python program exemplifies the practical application of Bayesian statistics in cybersecurity. It demonstrates how complex statistical methods can be accessible and useful in addressing real-world challenges. It bridges the gap between theoretical statistics and practical cybersecurity solutions through thoughtful implementation and clear visualization.
I am committed to equipping cybersecurity professionals with the robust capabilities of quantitative Bayesian statistical methods. By leveraging these mathematical and statistical tools, we can enhance our current risk assessment techniques and present risks in terms that business leaders can understand. Bayesian methods allow us to prioritize cybersecurity risks and communicate them with their potential economic impact, ensuring clarity for business professionals.
You can connect with me on LinkedIn and follow my articles on Medium. Get notified via email every time I publish a new article.