How To Calculate The Probability of Cyber Breach Attack Methods Using Python and Bayesian Statistics

7 min readNov 27, 2023

Introduction

Understanding and quantifying risk is paramount in cybersecurity's complex and ever-evolving landscape.

One powerful tool for this is Bayesian statistics, which allows for a more nuanced analysis of risks and threats.

In this article, I share a real-world Python program I created to help me compute and visualize probability distributions for a Web Application Attack Method, comparing an industry benchmark with an organization-specific distribution after applying new controls.

This program provides a practical example of applying Bayesian methods in cybersecurity.

I am committed to equipping cybersecurity professionals with the robust capabilities of quantitative Bayesian statistical methods. By leveraging these mathematical and statistical tools, we can enhance our current risk assessment techniques and present risks in terms that business leaders can understand. Bayesian methods allow us to prioritize cybersecurity risks and communicate them with their potential economic impact, ensuring clarity for business professionals.

You can connect with me on LinkedIn and follow my articles on Medium. Get notified via email every time I publish a new article.

Theoretical Underpinnings

This Python program is grounded in Bayesian statistics, a framework for updating beliefs in light of new evidence. In cybersecurity, Bayesian methods are particularly valuable for incorporating prior knowledge and new data. Here, using Beta distributions to model the likelihood of cyberattacks is a classic example of Bayesian thinking applied to real-world problems.

Practical Implications

For cybersecurity professionals, this program is more than an academic exercise. It offers a tangible method for assessing the effectiveness of security measures. By comparing an industry benchmark to a specific organization’s probability curve, it’s possible to estimate the probability of the organization experiencing a breach resulting from a Web Application Attack, as outlined in the latest Verizon DBIR report.

Real Visualization From The Program

Before we dive into the source code, I want to show you the visualization and output and describe how to consider using the information.

This information could be used alone to calculate the industry benchmark and internal probabilities of cyber breaches via Web Application Attacks. Alternatively, it can be used as input into a larger process to calculate a loss exceedance curve that communicates future magnitudes of potential losses in probabilistic terms.

© Tim Layton, All Rights Reserved, 2023 — bayescybercoder.com

The visualization is a graph of two probability distributions that compare the likelihood of web application attack methods affecting a business in two scenarios: the industry benchmark and after implementing new controls within the organization.

Purpose of the Visualization

The primary purpose of this visualization is to help business stakeholders understand the potential impact of cybersecurity threats on web applications and how implementing new controls can alter this risk landscape. It isn’t always as intuitive as many people think.

Business leaders want to know the return on their investments. This application is one part of that solution. The probability of breach must be calculated as the first step, and then, in a future article, I will show you how to calculate a loss exceedance curve.

This visualization is a quantitative risk analysis in the form of probability distributions that can be used for strategic decision-making.

Reading the Visualization

Curves: There are two curves on the graph. The orange curve represents the ‘Industry Benchmark,’ which is the standard or average risk of web application attacks in the industry. The blue curve represents ‘New Controls,’ indicating the risk probability for an organization implementing new cybersecurity measures.
Mean and Mode: Both curves have an associated ‘mean’ and ‘mode’ value:
The mean (average) indicates the central tendency of the data, which is the expected probability of web application attacks.
The mode represents the most likely value, the highest point on each curve.
Confidence Interval: Each curve has a dashed vertical line indicating its mean and shaded areas showing a 90% confidence interval:
For the ‘Industry Benchmark’, this interval is between 4.51% (P5) and 5.61% (P95). This effectively represents the 90% confidence interval. In other words, we are 90% confident the probability will fall within this range.
For ‘New Controls’, it’s between 4.10% (P5) and 5.05% (P95).

Using the Information

Comparative Risk Assessment: By comparing the mean of the two curves, a decision maker can assess how much the new controls are expected to reduce the risk of web application attacks. Here, we see a decrease from the industry mean of 5.05% to 4.57% with the new controls. Depending on the magnitudes of loss that are plotted on the loss exceedance curve, this may or may not be a good investment.
Decision Making: The visualization provides insight into the effectiveness of the new controls. If the blue curve (‘New Controls’) is significantly lower than the orange curve (‘Industry Benchmark’), it suggests that the new controls effectively reduce the probability of an attack.
Strategic Planning: The confidence interval helps understand the range of likely outcomes. A narrower interval suggests more certainty in the prediction. In this case, business stakeholders can see that the ‘New Controls’ not only lower the mean risk but also tighten the range of probable outcomes (a smaller confidence interval), indicating increased certainty in the effectiveness of these controls.
Resource Allocation: Understanding which controls can shift the probability distribution from the industry benchmark to the improved state can help businesses decide where to invest cybersecurity resources.

Visualization Summary

This graph provides a clear visual representation of the expected effectiveness of cybersecurity measures. For business professionals, it’s a tool for understanding the average risk of attack methods and the potential improvements specific controls might bring. It can be a part of the risk management process, aiding in making informed decisions about cybersecurity investments.

Python Code: Overview

The Python program utilizes several libraries that are designed to help solve this problem:

Numpy and Pandas: For data handling.
Matplotlib: For data visualization.
TensorFlow Probability: For probability analysis.

The main steps of the program are as follows:

Defining the Posterior Distribution Function:

This function, compute_posterior, calculates the posterior distribution using a Beta distribution. The Beta distribution is chosen for its suitability in modeling probabilities of probabilities - a common requirement in cybersecurity risk analysis.
Inputs: alpha_prior, beta_prior, total_breaches, total_organizations.
The function returns a TensorFlow Probability distribution object.

Plotting the Distribution:

The plot_distribution function takes a distribution object and plots it on a specified axis.
It calculates and displays key statistics: mean, mode, and a 90% confidence interval (5th and 95th percentiles).
This visual representation is crucial for understanding the likelihood of different risk levels.

Setting Parameters and Computing Distributions:

After implementing new controls, the script computes two distributions: one representing the industry benchmark and another for the organization.
The industry benchmark is modeled as a non-informative prior, with total_breaches and total_organizations as inputs.
The organization-specific distribution uses an informative prior, reflecting the belief that new controls will reduce breach probability.

Visualization:

The script creates a plot to compare the two distributions visually.
This comparison is key to understanding how new controls might impact the organization’s cybersecurity posture compared to the industry norm.

Outputting Statistics:

Finally, the program prints out key statistics for both distributions, concisely summarizing the findings.

Python Source Code

Import the required libraries:

# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow_probability as tfp
tfd = tfp.distributions

# Computes the posterior distribution given prior parameters, 
# total breaches, and total organizations.

def compute_posterior(alpha_prior, beta_prior, total_breaches, total_organizations):
    return tfd.Beta(alpha_prior + total_breaches, beta_prior + total_organizations - total_breaches)

# Plot a distribution on the provided axes.

def plot_distribution(distribution, x, label, color, ax):
    mean = distribution.mean().numpy()
    mode = (distribution.concentration1.numpy() - 1) / (distribution.concentration1.numpy() + distribution.concentration0.numpy() - 2)
    p5, p95 = distribution.quantile(0.05).numpy(), distribution.quantile(0.95).numpy()
    ax.fill_between(x, distribution.prob(x), where=(x >= p5) & (x <= p95), color=color, alpha=0.3)
    ax.plot(x, distribution.prob(x), label=f"{label} (Mean: {mean:.4f}, Mode: {mode:.4f})")
    ax.axvline(x=mean, color='red', linestyle='--')
    return mean, mode, p5, p95

# Enter parameters for the non-informative prior (Industry Benchmark)
# I could have created the program to prompt me for these inputs If I wanted to do this.

total_breaches = 213 # number of breaches in the last year in my reference class
total_organizations = 4237 # population size of reference class.

# Compute Non-informative Prior
non_info_prior = compute_posterior(1, 1, total_breaches, total_organizations)

# Informative Prior (Implemented new controls or belief our controls are stronger than benchmark)
# 25 represents the alpha (hits) 
# 950 represents the beta (misses)
# I used 25 because it is one-half or a 50% reduction of 50/950, which produces 5% because we have new data
# or believe the new control will reduce the chance of breaches by 50%.

info_prior = compute_posterior(25, 950, total_breaches, total_organizations)

# X-axis scale adjustment
# After I ran the initial program, I decided to adust the scale for aesthetic reasons.

x = np.linspace(0.03, 0.07, 1000) # x-axis show 3% to 7%

# Plotting the Curves

fig, ax = plt.subplots(figsize=(10, 6))
mean_non_info, mode_non_info, p5_non_info, p95_non_info = plot_distribution(non_info_prior, x, 'Industry Benchmark', 'yellow', ax)
mean_info, mode_info, p5_info, p95_info = plot_distribution(info_prior, x, 'New Controls', 'green', ax)

# Setting the Title and the x and y axis labels 

ax.set_title('Probability Distribution for Web App Attack Method')
ax.set_xlabel('Probability')
ax.set_ylabel('Density')
ax.set_ylim(bottom=0)
ax.legend()
plt.show()

# Print statistics for both distributions below the plot 

print("Non-Informative Prior (Industry Benchmark):")
print(f"Mean: {mean_non_info:.4f}, Mode: {mode_non_info:.4f}, 90% Confidence Interval: P5 = {p5_non_info:.4f}, P95 = {p95_non_info:.4f}")

print("\nInformative Prior (New Controls):")
print(f"Mean: {mean_info:.4f}, Mode: {mode_info:.4f}, 90% Confidence Interval: P5 = {p5_info:.4f}, P95 = {p95_info:.4f}")

Summary & Conclusion

This Python program exemplifies the practical application of Bayesian statistics in cybersecurity. It demonstrates how complex statistical methods can be accessible and useful in addressing real-world challenges. It bridges the gap between theoretical statistics and practical cybersecurity solutions through thoughtful implementation and clear visualization.