Overview of Pgmpy Python Library for Cybersecurity Risk Analysis Using Bayesian Networks

9 min readApr 5, 2024

In the new cloud-computing era, marked by its complex and evolving threat landscape, Bayesian Networks and Python are pivotal tools in reshaping cybersecurity risk analysis.

The intricacy of cloud infrastructure, with its intertwined services and data flows, necessitates a move beyond traditional risk matrices to a model that can adeptly handle complex dependencies and predict emerging threats.

With their capacity for probabilistic reasoning, continuous learning from historical data, and dynamic adaptation to new threats, Bayesian Networks provide a robust framework for understanding and mitigating risks in real-time.

With its powerful data processing and analytical libraries, Python stands as the essential computational backbone, enabling the detailed and efficient execution of Bayesian models. Together, they form a formidable duo, promising a future where cybersecurity risk analysis is not only reactive but also predictive and nuanced, tailored to the complexities of the cloud computing environment.

This article is part of a series: Data-Driven Decisions: Exploring Cybersecurity Risk Analysis using Python and Bayesian Statistics.

You can connect with me on LinkedIn and join my professional network.

In this article, I provide a high-level overview of why the Pgmpy library is a great fit for developing Bayesian Networks for cybersecurity risk analysis in Python. I also share a simple Python program to illustrate how Bayesian Networks can be developed in Python using the pgmpy library.

Pgmpy Python Library

pgmpy is an open-source Python library designed for creating, learning, and inference with Probabilistic Graphical Models (PGMs), including Bayesian Networks.

It provides a wide array of tools to work with these models, encompassing structure learning (discovering the network structure from data), parameter learning (estimating the relationships between variables), and inference (making predictions based on the model).

In a future article, I will explain PGMs and Direct Acyclic Graphs (DAG) in detail and provide real-world use cases and the Python code to make it all come to life, but that is beyond the scope of this article today.

Key Features of pgmpy:

Model Creation: Users can manually define the structure of Bayesian Networks, specifying nodes and edges to represent variables and their conditional dependencies.
Learning Algorithms: pgmpy supports both structure and parameter learning from data, using various algorithms, such as constraint-based, score-based, and Bayesian estimation, for parameter learning.
Inference Engines: To compute the probabilities of interest, different inference methods are offered, including exact inference (like Variable Elimination) and approximate inference (like Monte Carlo methods).
Extensibility: The library is designed to be easily extendable for new algorithms and models.

Why pgmpy is Excellent for Bayesian Networks in Cybersecurity Risk Analysis

Complex Dependency Modeling:
Bayesian Networks in cybersecurity need to model complex, non-linear relationships between risk factors, like threat likelihood, vulnerability impacts, and mitigation strategies. pgmpy allows for the detailed representation of these dependencies, providing a clear structure to the network that reflects the intricate interactions in a cybersecurity context.

Dynamic Learning Capability:
Cyber threats evolve rapidly, necessitating a system that can learn from new data and update its beliefs. pgmpy supports continuous learning, meaning that as new cybersecurity data (such as incidence reports, threat intelligence feeds, and vulnerability updates) becomes available, the network’s parameters can be updated to reflect these changes, keeping the risk analysis current and relevant.

Inference and Prediction:
With pgmpy, one can perform inference to predict the probability of future cybersecurity incidents based on the current network state. This predictive capability is crucial for identifying potential risks and implementing proactive measures to mitigate them. For example, if a new vulnerability is discovered, pgmpy can help estimate the increased likelihood of a security breach.

Scalability and Performance:
Handling the vast amount of data generated in cloud environments is critical for effective cybersecurity risk analysis. pgmpy, with Python’s computational efficiency, is well-suited for processing large datasets, making it capable of scaling to the needs of large-scale cloud infrastructures.

You can connect with me on LinkedIn and join my professional network.

Example Scenario

Consider a simple scenario where a cybersecurity team wants to assess the risk of a data breach. I will keep this simple for illustration purposes. In reality, many more factors would likely be considered.

Using pgmpy, we can create a Bayesian Network that includes nodes representing different risk factors, such as external threat levels, system vulnerabilities, and the effectiveness of current security measures.

As new data about emerging threats or detected vulnerabilities is received, the network can be updated to reflect these changes. This is a key concept that makes Bayesian Networks so much more powerful and useful than other static and qualitative approaches, like the Risk Matrix, which is so common in many organizations today.

Comparing Bayesian Networks to the Risk Matrix is akin to contrasting riding a camel with driving a Ferrari: Bayesian Networks, like a Ferrari, offer a fast, advanced, and dynamic approach to analyzing probabilities and outcomes, capable of handling complex, interconnected variables efficiently. In contrast, the Risk Matrix, similar to camel riding, provides a more static and slower method that attempts to communicate risks less dynamically.

Next, we can use pgmpy’s inference capabilities to estimate the probability of a data breach, helping them prioritize security investments and interventions effectively.

In summary, pgmpy’s comprehensive features for building, updating, and querying Bayesian Networks make it an excellent choice for developing sophisticated and dynamic cybersecurity risk analysis tools capable of addressing cloud computing environments' complex and evolving threat landscape.

Python Program For The Example Scenario

In this example program, I wanted to illustrate how elegant and straightforward it is to use the power of Bayesian Networks for cybersecurity risk analysis.

This program is intended for illustrative purposes, and a real-world version would be supported by internal telemetry and empirical data to ensure the models are trustworthy and defensible.

I would also create intuitive visualizations to help stakeholders consume the information easily and quickly.

In this simple program:

I use a simple five-step process to create the Bayesian Network and compute the probability of a cyber breach.
We define a Bayesian Network with nodes representing external threats, system vulnerabilities, data breaches, and security measures.
We set up the Conditional Probability Distributions (CPDs) for each node, describing how the probabilities relate.
We add these CPDs to our model and validate the model structure.
Finally, we perform inference to calculate the probability of a data breach given that security measures are strong (represented as 1 in the evidence).

This example is simplified for illustrative purposes and would need to be expanded with more detailed data and nuanced relationships for a real-world application. However, it demonstrates how pgmpy can model and analyze cybersecurity risks dynamically.

I would also use a library like Matplotlib to create visually expressive data visualizations that help stakeholders consume information quickly and easily.

If you remove all the comments and spacing, we can write this program in about 30 lines of code, which illustrates the power of this library.

The DAG (Directed Acyclic Graph) for this simple scenario could be visualized as follows:

In this DAG:

SecurityMeasures affects SystemVulnerability, indicating that the strength of security measures impacts the system’s vulnerability.
SystemVulnerability and ExternalThreat both influence DataBreach, showing that the likelihood of a data breach depends on both the system’s vulnerability and the level of external threat.
Arrows (-->) represent the direction of dependency from cause to effect.

This diagram succinctly encapsulates the relationships modeled in the Bayesian Network, where SecurityMeasures indirectly impacts DataBreach through its effect on SystemVulnerability, and ExternalThreat directly affects DataBreach.

Python Program

# Import necessary classes from pgmpy
from pgmpy.models import BayesianNetwork  # Updated class name
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination

# Step 1: Define the structure of the Bayesian Network
model = BayesianNetwork([
    ('ExternalThreat', 'DataBreach'),
    ('SystemVulnerability', 'DataBreach'),
    ('SecurityMeasures', 'SystemVulnerability')
])

# Step 2: Define the Conditional Probability Distributions (CPDs)
cpd_external = TabularCPD(variable='ExternalThreat', variable_card=2,
                          values=[[0.5], [0.5]])  # 50% chance for each state

cpd_security = TabularCPD(variable='SecurityMeasures', variable_card=2,
                          values=[[0.7], [0.3]])  # 70% chance for strong, 30% for weak

# SystemVulnerability depends on SecurityMeasures only, according to the network structure
cpd_vulnerability = TabularCPD(variable='SystemVulnerability', variable_card=2,
                               values=[[0.1, 0.9],  # Probabilities for high vulnerability
                                       [0.9, 0.1]], # Probabilities for low vulnerability
                               evidence=['SecurityMeasures'],
                               evidence_card=[2])

# DataBreach depends on ExternalThreat and SystemVulnerability
cpd_breach = TabularCPD(variable='DataBreach', variable_card=2,
                        values=[[0.01, 0.1, 0.4, 0.9],  # Probabilities for high risk
                                [0.99, 0.9, 0.6, 0.1]], # Probabilities for low risk
                        evidence=['SystemVulnerability', 'ExternalThreat'],
                        evidence_card=[2, 2])

# Step 3: Add the CPDs to the model
model.add_cpds(cpd_external, cpd_security, cpd_vulnerability, cpd_breach)

# Step 4: Validate the model to ensure it's correctly structured
if model.check_model():
    print("Model is valid.")
else:
    print("Model is invalid.")

# Step 5: Perform inference on the model
inference = VariableElimination(model)

# Query the probability of a Data Breach given strong Security Measures
prob_breach = inference.query(variables=['DataBreach'], evidence={'SecurityMeasures': 1})
print(prob_breach)

The text-based output of this program is as follows:

Model is valid.
+---------------+-------------------+
| DataBreach    |   phi(DataBreach) |
+===============+===================+
| DataBreach(0) |            0.1145 |
+---------------+-------------------+
| DataBreach(1) |            0.8855 |
+---------------+-------------------+

The output of this program can be understood as follows:

Model is valid: This message confirms that the Bayesian Network model is correctly structured. This means that the conditional probability distributions (CPDs) are properly defined and aligned with the network’s structure, and the model setup has no inconsistencies or errors. I comment this out once I know my code is working properly.

Table Output (Probability Distribution): This table shows the probability distribution for the DataBreach node after performing inference, given the evidence (conditions) provided in the query.

DataBreach(0) and DataBreach(1) represent the two possible states of the DataBreach node, where 0 might denote the absence of a data breach (e.g., 'Low Risk' or 'No Breach') and 1 denotes the occurrence of a data breach (e.g., 'High Risk' or 'Breach').
phi(DataBreach) refers to the function or distribution of probabilities for the DataBreach node. This function gives the probability of each state of DataBreach.

Probability Values:

The value 0.1145 next to DataBreach(0) indicates that, given the evidence (or conditions) specified in the inference query (e.g., strong security measures), there is approximately an 11.45% chance that there will be no data breach.
The value 0.8855 next to DataBreach(1) indicates that, under the same conditions, there is an 88.55% chance of experiencing a data breach.

To read and understand this output, one should recognize that it represents the model’s computed probabilities for the occurrence and non-occurrence of a data breach based on the current network configuration and evidence provided. In this context, the model predicts a higher likelihood of a data breach (88.55%) given the specific conditions set in the query, which might indicate a need to reassess the risk factors and security measures in place.

These probabilities are not realistic because a real-world scenario would need to be built based on internal telemetry and/or supported by reliable empirical data. Also, I would have developed the program to either read from data sources or prompt the user for inputs to refine it, along with supporting data visualizations using a library like Matplotlib.

You can connect with me on LinkedIn and join my professional network.

Summary

The pgmpy library, in tandem with Python, stands out as a robust toolset for constructing and applying Bayesian Networks in the realm of dynamic cybersecurity risk analysis. This combination leverages Python's computational strengths and pgmpy 's specialized functionalities to model cyber threats' intricate and evolving nature. It is possible to tap into APIs and analytic workspaces in the cloud to bring a new level of real-time analysis.

With pgmpy, users can define complex dependencies between risk factors, seamlessly integrate new data, and update their models in real-time, ensuring that the risk analysis remains current and reflects the actual threat landscape. The library's support for various inference algorithms enables the prediction of potential security incidents, facilitating proactive risk management.

Python’s role is pivotal, providing a versatile and efficient environment for handling large datasets typical in cloud-based systems. Its extensive ecosystem, including libraries like pgmpy, allows for the detailed and efficient execution of Bayesian models, making it indispensable in processing the voluminous and continuous data flow inherent in cybersecurity operations.

In essence, the integration of pgmpy and Python equips cybersecurity professionals with the means to develop dynamic, predictive models of cybersecurity risk. This approach significantly advances traditional, static risk analysis methods, offering a more nuanced and timely assessment of potential threats in the ever-changing cybersecurity landscape.

Through the power of Bayesian Networks and Python, stakeholders can make informed decisions, prioritize security measures, and allocate resources more effectively, ultimately enhancing the resilience of cyber infrastructures against potential threats.

You can connect with me on LinkedIn and join my professional network.

This article is part of a series: Data-Driven Decisions: Exploring Cybersecurity Risk Analysis using Python and Bayesian Statistics.

Overview of Pgmpy Python Library for Cybersecurity Risk Analysis Using Bayesian Networks

Pgmpy Python Library

Why pgmpy is Excellent for Bayesian Networks in Cybersecurity Risk Analysis

Example Scenario

Python Program For The Example Scenario

Summary

Written by Tim Layton

No responses yet