Analyzing Cybersecurity Risks: Estimating Phishing Attack Probabilities with Bayesian Statistics in Python Using Beta Distribution

Tim Layton
11 min readJun 26, 2024

--

In today’s digital landscape, cybersecurity threats are a significant concern for businesses of all sizes. Phishing attacks are particularly prevalent in which malicious actors attempt to deceive employees into revealing sensitive information or clicking on harmful links.

The latest Verizon Data Breach Investigations Report (DBIR) highlights the alarming effectiveness of phishing attacks across all industries. The report underscores that phishing remains one of the most prevalent and successful methods for compromising organizational security. This finding emphasizes that phishing is a highly likely scenario for breaches, affecting companies regardless of size or sector. Despite advancements in security technology, the human element remains a critical vulnerability. Securing the human aspect of cybersecurity is an ongoing challenge, as no amount of technology alone can fully address this issue. However, by performing analyses like the one shared in this article, organizations can better understand their specific risks and invest in strategies that inform and educate their users. Empowering employees with knowledge and awareness can significantly mitigate the risk of phishing attacks and contribute to a more secure organizational environment.

You can connect with me on LinkedIn and join my professional network.

To mitigate these risks, it’s crucial to understand and quantify the likelihood of such attacks succeeding and leading to a cyber breach.

In this article, I walk you through a Python-based approach to estimating and visualizing these probabilities using Bayesian statistical methods. I explain why Python and specific libraries were chosen, provide a detailed step-by-step explanation of the program, and discuss why Bayesian methods, such as the Beta distribution, are well-suited for this type of analysis compared to traditional machine learning approaches.

Why Python and Specific Libraries?

Python

Python was chosen for this project due to its simplicity, readability, and extensive ecosystem of libraries that make data analysis and statistical computing straightforward. Python’s popularity in the data science and cybersecurity communities also means a wealth of resources and support is available.

NumPy

NumPy is a fundamental library for numerical computing in Python. It offers efficient array handling and a wide range of mathematical functions, making it ideal for performing the numerical operations required in this analysis.

Matplotlib

Matplotlib is a versatile plotting library that allows us to create detailed and customizable visualizations. In this project, we use Matplotlib to visualize the probability distributions, making the results easy to interpret and understand.

SciPy

SciPy builds on NumPy and provides additional functionality for scientific computing, including a comprehensive set of statistical functions. We use SciPy’s stats module to work with the Beta distribution, central to our Bayesian analysis.

Scenario Overview

The program estimates two key probabilities:

  1. The likelihood of an employee clicking on a phishing email.
  2. The likelihood that a click on a phishing email will result in a cyber breach.

In this scenario, I aimed to make the content both realistic and practical for my readers. Most large organizations conduct internal phishing campaigns, but often, they only scratch the surface of what this data can reveal. There is a significant opportunity, as highlighted in this article, to leverage this internal phishing campaign data in conjunction with industry breach data. By doing so, you can estimate the probability of a cyber breach event based on your users’ propensity to click on phishing email links. This approach not only provides deeper insights into your organization’s specific risks but also helps in formulating more effective strategies to mitigate these threats. This underscores the value of using a Bayesian method to reliably estimate the probability of a cyber breach, even if your organization has not yet experienced this widespread threat.

In this scenario, we combine prior knowledge (e.g., effectiveness of phishing training and industry benchmarks) with observed data to provide a probabilistic assessment of these risks. This Bayesian approach allows us to update our beliefs as new data becomes available, providing a flexible and robust framework for risk assessment.

You can connect with me on LinkedIn and join my professional network.

Why Using a Bayesian Method is a Better Choice Over a Machine Learning Approach

Integration of Prior Knowledge:

  • Bayesian Method: Allows the incorporation of prior knowledge and expert opinion directly into the analysis. For example, the effectiveness of phishing training or industry benchmarks can be included as prior distributions, providing a more informed starting point for the analysis.
  • Machine Learning: Typically relies solely on the available data for training. It does not inherently incorporate prior beliefs or external knowledge, which might lead to less informed initial models, especially when data is sparse.

Continuous Updating:

  • Bayesian Method: Provides a natural framework for updating probabilities as new data becomes available. This method is particularly useful in dynamic environments where new information can continuously refine risk assessments.
  • Machine Learning: Requires retraining the model with new data, which can be resource-intensive and may not integrate new information as seamlessly as Bayesian updating.

Handling Sparse Data:

  • Bayesian Method: This method is particularly effective in situations where data is limited. By leveraging prior distributions, Bayesian methods can still produce meaningful results even with sparse data, making them suitable for rare events like cyber breaches.
  • Machine Learning: Often requires large amounts of data to perform well. Machine learning models may struggle to provide accurate predictions in scenarios with limited data.

Probabilistic Interpretation:

  • Bayesian Method: This method produces full probability distributions for the parameters of interest, allowing for a more comprehensive understanding of uncertainty and risk. This probabilistic interpretation is valuable for decision-makers who must understand possible outcomes’ range and likelihoods.
  • Machine Learning: Typically provides point estimates or classifications without directly quantifying uncertainty. While some machine learning models can estimate uncertainty, it is not as inherent or straightforward as in Bayesian methods.

Flexibility and Robustness:

  • Bayesian Method: This method offers a flexible and robust framework for combining different sources of information and adapting to new data. This flexibility is crucial for complex, evolving risk assessments in cybersecurity.
  • Machine Learning: While powerful for pattern recognition and prediction, it may lack the same flexibility in integrating diverse sources of information and adapting to new, sparse, or evolving data.

Using a Bayesian method for estimating the probability of a cyber breach allows for integrating prior knowledge, continuous updating with new data, effective handling of sparse data, and a comprehensive probabilistic interpretation. These advantages make Bayesian methods better than traditional machine learning approaches for dynamic and uncertain environments like cybersecurity risk assessment.

Step-by-Step Explanation of the Program

Step 1: Import Libraries

import numpy as np 
import matplotlib.pyplot as plt from scipy.stats
import beta as beta_dist

We import the necessary libraries: NumPy for numerical operations, Matplotlib for plotting, and SciPy for statistical functions, specifically the Beta distribution.

Step 2: User Input

def get_user_input(): 
total_employees = int(input("Enter the total number of employees: "))
clicked_employees = int(input("Enter the number of employees who clicked on the phishing email: "))
training = input("Have employees received phishing training in the last 6 months? (yes/no): ").strip().lower()
industry_breach_rate = float(input("Enter the industry benchmark breach probability (as a decimal, e.g., 0.05 for 5%): "))
return total_employees, clicked_employees, training, industry_breach_rate

This function prompts the user to input:

  • The total number of employees.
  • The number of employees who clicked on a phishing email.
  • Whether employees received phishing training in the last 6 months.
  • The industry benchmark breach probability. (use the DBIR report)

Step 3: Calculate Beta Parameters

def calculate_beta_params(events, trials, alpha_prior, beta_prior): 
alpha_post = alpha_prior + events
beta_post = beta_prior + (trials - events)
return alpha_post, beta_post

This function calculates the posterior parameters for the Beta distribution, updating the prior parameters with observed data (events and trials).

Step 4: Plot Beta Distribution

def plot_beta_distribution(alpha, beta, title): 
mode = (alpha - 1) / (alpha + beta - 2) if alpha > 1 and beta > 1 else 0
x_min = max(0, mode - 0.05)
x_max = min(1, mode + 0.05)
x = np.linspace(x_min, x_max, 100)
y = beta_dist.pdf(x, alpha, beta)
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', label=f'Beta distribution with α={alpha} and β={beta}')
plt.fill_between(x, y, alpha=0.2, color='blue')
plt.axvline(mode, color='red', linestyle='--')
plt.text(mode, max(y), f'{mode*100:.2f}%', color='red', ha='center', va='bottom')
plt.title(title) plt.xlabel('Probability')
plt.ylabel('Density')
plt.xlim(x_min, x_max)
plt.legend()
plt.show()

This function plots the Beta distribution using the given alpha and beta parameters, dynamically adjusting the x-axis range around the mode for better readability. It also annotates the plot with the most probable value (mode).

Step 5: Main Function

def main(): 
total_employees, clicked_employees, training, industry_breach_rate = get_user_input()
if training == 'yes': alpha_prior_click = 2 beta_prior_click = 5
else: alpha_prior_click = 1 beta_prior_click = 1

alpha_post_click, beta_post_click = calculate_beta_params(clicked_employees, total_employees, alpha_prior_click, beta_prior_click)
plot_beta_distribution(alpha_post_click, beta_post_click, 'Posterior Probability Distribution of Phishing Clicks')

alpha_prior_breach = 1 + industry_breach_rate * 100
beta_prior_breach = 1 + (1 - industry_breach_rate) * 100

alpha_post_breach, beta_post_breach = calculate_beta_params(0, clicked_employees, alpha_prior_breach, beta_prior_breach)
plot_beta_distribution(alpha_post_breach, beta_post_breach, 'Posterior Probability Distribution of Clicks Leading to a Breach')

if __name__ == "__main__":
main()

The main function orchestrates the entire process:

  • It gathers user inputs.
  • Sets prior parameters for the probability of clicking on a phishing email based on whether training was received.
  • Calculates and plots the posterior distribution for phishing clicks.
  • Sets prior parameters for the probability of a click leading to a breach using the industry benchmark.
  • Calculates and plots the posterior distribution for breaches.

Sample Visualizations From the Python Program

I have included the plots generated by the Python program to illustrate how straightforward it is to understand the information produced by this simple yet highly effective approach. This method is more intuitive and offers greater flexibility than tools like Excel. The visualizations clearly convey the probabilistic assessments, making complex data easy to interpret and actionable for decision-makers.

In this first visualization, we show the probability distribution for a sample organization with 1,600 employees. 75 of them clicked on the internal phishing campaign email. Based on that campaign, there is a 4.74% probability that your user community will click on a phishing email from a threat actor.

This is useful information by itself, but we can extend it by looking at the breach data in the latest DBIR report for your industry since your organization has not yet fallen victim to a phishing attack.

In this analysis, we have combined internal phishing campaign data with industry cyber breach data from the Verizon DBIR report to compute the probability of your organization falling victim to a phishing attack. This approach leverages Bayesian statistics, specifically the Beta distribution, to provide a clear and defensible method for estimating the probability of a cyber breach.

Bayesian methods allow us to incorporate prior knowledge and expert opinions into our analysis directly. For instance, the effectiveness of phishing training and industry benchmarks, such as those from the Verizon DBIR report, can be included as prior distributions. This makes the analysis more informed and grounded in real-world data and expert insights. One of the core strengths of Bayesian statistics is the ability to update our beliefs as new data becomes available. This is particularly useful in cybersecurity, where the threat landscape constantly evolves. The Bayesian approach ensures that risk assessments remain current and relevant by continuously integrating new phishing campaign results.

Cyber breaches, especially those resulting from phishing attacks, can be relatively rare events. Bayesian methods are well-suited for situations with limited data because they can leverage prior distributions to produce meaningful estimates even when the observed data is sparse. Additionally, Bayesian analysis provides full probability distributions for the parameters of interest, allowing us to understand the range of possible outcomes and their likelihoods. This probabilistic interpretation is more informative than single-point estimates, giving decision-makers a comprehensive view of the risks.

The Beta distribution is ideal for modeling probabilities because it is defined on the interval [0, 1], which aligns perfectly with the probability of a cyber breach occurring. The shape of the Beta distribution is determined by two parameters, (alpha) and (beta), which makes it highly flexible. By adjusting these parameters, we can model various scenarios and incorporate various degrees of belief and uncertainty.

The Beta distribution is also the conjugate prior for the Bernoulli and binomial distributions, meaning that the posterior distribution is also a Beta distribution when combined with binomial data. This property simplifies the mathematical computations and makes the Bayesian updating process straightforward. Furthermore, alpha and beta parameters can be interpreted as the number of successes and failures plus one. This makes it easy to incorporate prior knowledge and understand the impact of observed data on the posterior distribution.

Using Bayesian statistics and the Beta distribution, this analysis provides a robust, flexible, and intuitive method for estimating the probability of a cyber breach due to phishing attacks. Combining internal phishing campaign data with industry breach data from the Verizon DBIR report ensures that the risk assessments are grounded in empirical data and continuously updated with new information. This approach offers a significant advantage over traditional methods, providing clear and actionable insights for improving your organization’s cybersecurity posture. The plots generated by the Python program further enhance understanding, making complex probabilistic assessments accessible and easy to interpret.

What About My Beloved Risk Matrix?

The methods used in this scenario, leveraging Bayesian statistics and the Beta distribution, provide a more robust and flexible risk assessment framework than the traditional risk matrix approach commonly used in many organizations.

While risk matrices categorize risks based on qualitative scales and often rely on subjective judgments, they can oversimplify the complexities and interdependencies of real-world threats.

In contrast, the Bayesian approach incorporates prior knowledge and continuously updates with new data, offering a probabilistic assessment that quantifies uncertainty and provides a more nuanced understanding of risks. This allows for more precise and data-driven decision-making. In contrast, risk matrices can sometimes lead to misinformed prioritizations and underestimating certain risks due to their inherent limitations in handling sparse and evolving data.

Dr. Tony Cox and Doug Hubbard have highlighted several critical issues with using risk matrices in risk assessment. One major concern is that risk matrices often lack precision, as they tend to categorize risks into broad qualitative categories that can obscure significant differences in risk levels. They also point out that risk matrices can lead to inconsistent and arbitrary risk rankings due to their reliance on subjective judgments and ordinal scales that do not accurately reflect the underlying probabilities or consequences. Additionally, risk matrices can fail to capture the complexities and interdependencies of risks, leading to oversimplified and potentially misleading assessments. These limitations underscore the need for more rigorous and quantitative approaches, such as Bayesian analysis, which can provide a clearer and more accurate picture of risk.

Why Use Beta Distribution?

  • Modeling Probabilities: The beta distribution is ideal for modeling probabilities because it is defined on the interval ([0, 1]).
  • Parameter Interpretation: The parameters alpha and beta can be interpreted as prior successes and failures, making it intuitive to incorporate prior knowledge.
  • Flexibility: By adjusting alpha and beta, the Beta distribution can take various shapes, representing different degrees of belief and uncertainty.

Why Not a Machine Learning Approach?

  • Data Requirements: Machine learning models typically require large amounts of data to train effectively. Data can be sparse in cybersecurity scenarios, especially with rare events like breaches.
  • Interpretability: Bayesian methods provide clear and interpretable probabilistic results, which are often easier for decision-makers to understand than machine learning models’ often opaque outputs.
  • Prior Knowledge: Bayesian methods allow for the incorporation of prior knowledge, which is particularly valuable in fields where expert opinions and industry benchmarks are available and useful.

Conclusion

The approach and simple Python program I shared in this article should highlight the capability and usefulness of thinking about cybersecurity risk using my shared methods.

The methods I shared today leverage the strengths of Python and its libraries to perform a Bayesian analysis of phishing attack probabilities. Combining prior knowledge with observed data provides a flexible and robust framework for estimating and visualizing cybersecurity risks. The Beta distribution is particularly well-suited for this type of probabilistic modeling, making the results intuitive and informative for decision-making.

You can connect with me on LinkedIn and join my professional network.

--

--

Tim Layton
Tim Layton

Written by Tim Layton

Cybersecurity Risk Analysis Using Python and Bayesian Statistics.

No responses yet