Statistics, Cybersecurity [Year 2024 - 25]
Topics on Statistics with intensive computer applications
$ \int_0^t d S_u = \int_0^t \mu(S_u, u) du + \int_0^t\sigma(S_u, u) dW_u $
Supporto al corso e alla didattica telematica, by T. Gastaldi #Sapienzanonsiferma #Sapienzadoesnotstop
(Instructor: tommaso.gastaldi@gmail.com,
https://www.datatime.eu/public/cybersecurity/)
Whatsapp group for the students of this
course
Invitation to join the Whatsapp group for this
course: https://chat.whatsapp.com/Kk3wRGmmxWH9RNUo01zFdX
(When first joining, send a message with your name and id ("matricola"))
____________________________________________________________________________________
General notes for all homeworks
-Implement exercises in your choice between C#, vb or Javascript. For Js, always use latest
ECMAScript (use classes, let, const, no var, etc...) and strict mode (in case, webstorm or rider can also be of
great help to stay up to date with latest language updates and to check syntax.) Put the javascript programs directly online as webpage.
-All important code must be shown and possibly discussed (as to the the crucial parts only) in the homework web page so that one can understand the main points.
(Full version can be stored on github or as zip file containing the "solution", if you like, but that is not required.)
-Never use any third part library external to the leanguage or higher level languages (e.g., sas, r, python, matlab, minitab, etc.) because our purpose is to actually implement from scratch the very basics to deeply understand our topics. (Using other people's "black boxes" would defy our learning purpose.)
-Always exercise your capacity of abstraction. Never write algorithms that works only on specific cases or data, but, on the contrary, try to be as general as possible in any of your creations and logic. Use smart personal implementations to show your intelligence and insight! Originality and deep thinking are the most appreciated values in this course.
-Always acknowledge your sources and use quotes when you just copy paste text from other sources (note that what you copy may be wrong!).
Homework 1
Theory (intro)
- Basic notions in Statistics: Population, Statistical Units, Distribution, Frequency (relative, absolute, percentage);
- Notion of arithmetic average. Derivation. Computational problems with floating point rapresentation (errors, catastrophical cancellation) and numerical solution (Knuth) ;
Applications / practice
We have n servers with m attackers. The hacker has probability p to penetrate each server. Make a graphical representation (line flat if hacker doesn’t penetrate and a jump to 1 if he penetrates), try different n,m,p.
At time n we want to complete distribution how many reached each level. (Draw the distribution histogram vertically
at the end of the chart, so that each rectangle representing the attackers' frequency is placed on the corresponding number of penetrations (or "successes") they achieved).
Some resources:
https://en.wikipedia.org/wiki/Variable_and_attribute_(research)
https://www.investopedia.com/terms/s/statistics.asp
https://www.scribbr.com/methodology/sampling-methods/#:~:text=Probability%20sampling%20methods%20include%20simple,a%20chance%20of%20being%20included.
https://en.wikipedia.org/wiki/Design_of_experiments
https://www.surveymonkey.com/mp/open-ended-questions-get-more-context-to-enrich-your-data/#:~:text=open%2Dended%20questions%3F-,So%20what%20are%20open%2Dended%20questions%3F,or%20other%20closed%2Dended%20format.
https://en.wikipedia.org/wiki/Level_of_measurement
https://www.youtube.com/watch?v=uHRqkGXX55I&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=EZrP_av3cmA&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=pTuj57uXWlk&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=10ikXret7Lk&ab_channel=SimpleLearningPro
Homework 2
Theory
Find the simplest and most elegant way to show the Welford recursion.
Application / practice
Refine you Euler–Maruyama simulator to approximate numerical solutions of stochastic differential equations (SDE), by adding the following variants to the existing framework:
A. Jumps -1 +1 with prob. p [random walk]
B. Absolute and relative frequency trajectories
C. Final distribution and intermediate distributions (at one internal time/step selectable from the gui),
with mean and variance (make it all parametric so that one unique interface will handle it all).
Research
Make your personal notes about the behavior of mean and variance wrt to time. For instance:
What did you observe in all the 4 different cases (relative/abs freq & Bernoulli/random walk)?
What are the main differences between the distribution of the distribution of absolute number of successes
and that of the relative frequencies.
Some resources:
https://en.wikipedia.org/wiki/Euler%E2%80%93Maruyama_method
https://en.wikipedia.org/wiki/Random_walk
https://en.wikipedia.org/wiki/Multinomial_distribution
Homework 3
Theory/Research
Illustrate formally, in the simplest possible way, why the Median is the minimum c f the sum of |x(i) - c| (sum of absolute deviations).
Find all possible different conceptual different ways to define a "location" statistics (sometime also called "center" or "central tendency") or synthesis of a distributions. Showing how the generalization of these ideas can potentially lead to infinite other definitions.
Application / practice
Refine your SDE simulator to simulate a continuous time process where we can have an attack (indicated with a jump of +1) at any
time with a constant rate of attack.
To create the approximation of time continuity subdivide your reference temporal window into numerous intervals
of vanishing size dt = 1/n and to each infinitesimal interval assign a probability of a +1 "jump" (attack success) equal
to Lambda * dt, where Lambda is a simulation parameter, having the meaning of expected total number of attacks in the reference
period.
Some resources:
https://en.wikipedia.org/wiki/Stochastic_simulation
https://www.probabilitycourse.com/chapter11/11_1_2_basic_concepts_of_the_poisson_process.php
https://www.probabilitycourse.com/chapter7/7_1_1_law_of_large_numbers.php
Homework 4
Theory/Research
Illustrate the concept of statistical independence, showing also the analogies with the formal definitions in probability theory.
Application / practice
Refine your stochastic SDE simulator to generate a continuous time, process to represent the scaling limit of the random Walk.
To create the approximation of time continuity subdivide your reference temporal window into vanishing intervals
dt and on each (theoretically infinitesimal) interval assign a probability p or p to make a jump of a + or - sqrt(dt).
Note the significance of the simulation (Donsker invariance principle/ theorem or the functional central limit theorem)
in relation to the Wiener process.
Some resources:
https://en.wikipedia.org/wiki/Donsker%27s_theorem
https://www.youtube.com/watch?v=sJPlOMrcJXo&ab_channel=ResearchMethodsandStatistics%28FMG%2CUvA%29
Homework 5
- Prove in the simplest possible way the C-S (Cauchy-Schwarz) inequality
(r coefficient normalizing denominator)
- Reflect on the concepts of independence and uncorrelation, pointing
out conceptual differences and possible measures.
- E-M Simulator Enhancement:
Enhance your existing Euler-Maruyama (E-M) simulator by developing a unified simulation framework. Create a general central class that can possibly manage various types of stochastic differential
equations (SDEs).
Optional: Regression Coefficients:
Derive the coefficients (b) and (a) of two regression lines using the least squares method, and show the relationships with R^2.]
Homework 6
Theory/Research
Research: Recall the fundamental theorem of calculus and demonstrate its relationship with density
functions and cumulative distribution functions (CDFs).
Application / practice
Exercise: Generate realizations from a discrete univariate probability distribution with arbitrary probabilities.
Graphically show the convergence of the empirical distribution to the theoretical distribution as the sample size increases.
Compute also, during the generation, the mean and variance using recursive methods (e.g., Knuth's/Welford's algorithms)
and compare these results with the theoretical mean and variance, discussing the relationship.
Homework 7
Theory/Research
Application / practice
Using the setup of previous homework, from a discrete distribution generate m (e.g. m=1000 ...) samples
of size n (e.g., n = 20, 30, 100, ...). Compute the distribution of the sampling average.
Determine the average and variance of the distribution of the averages of the samples, and represent the distribution,
discussing the observed relationship with the mean and variance of the parent (theoretical) distribution.
Optional
Given the random variable Y = g^U mod n (meaning the remainder of the division by n)
where U is a Uniform in [1, max_U] (max_U is a user param)
A) Generate the distributions of Y for n = 19 and g = 2, 3, 10, 17
B) Generate the distributions of Y for n = 15 and g = 3, 6, 9, 12
Observe the shape of the distributions and compute the entropy or other diversity indexes. Give your opinion on
the implications of any observed differences in terms of cryptographic properties (uniformity, predictability)
and potential applications. Why case A may be better suited for cryptographic applications ? Why case B
(predictability, lower entropy ?) may illustrate possible vulnerabilities, if any ?
What is the reason why we choose the set { 2, 3, 10, 17 } in case A ? Spot possible errors in the exercise
Homework 8
Theory/Research
Recall the notion of Shannon Entropy amd other diversity measures of distributions
Recall the notion of primitive root (a primitive root modulo p a prime number is a number g such that for every
integer a that is coprime to p , there exists an integer k such that g^k \mod p = a )
Application / practice
Part 1
Find and compile a sufficiently large piece of text by selecting several web pages and create a letter frequency distribution.
Choose a random shift value (e.g., 1-25, with wrap-around) and apply the Caesar cipher to encrypt the original text:
E = L + shift for each letter L of the message.
Use frequency analysis or find any efficient and effective strategy to find the shift and decrypt the message.
Part 2 Optional (Modular exponentiation)
Convert each letter of the original text to a numeric representation (A = 0, B = 1, ..., Z = 25).
Choose Parameters: Choose an exponent e and a modulus P. Ensure that e and P are coprime
(for example, you might choose ( e = 3 ) and ( P = 37)).
Calculate Encoded Values: Calculate the encoded values using the formula: E = L^e mod P
for each letter L of the message, where Lis the numeric representation of the letter. Try also encoding the
entire message and not the single letter and discuss the difference.
See if you can find strategies and effective ways to get back the values of e and P.
(In practice, certain values of e, like 3 or 65537 are commonly used. You may start with these values for e)
Visualize the distributions and calculate the Shannon entropy of the transformed distributions.
Summarize the findings from both parts of the exercise. Discuss how statistical analysis enhances understanding
of cryptographic algorithms and the importance of these skills in cybersecurity.
Hints and resources:
Function CaesarShift(input As String, shift As Integer) As String
Dim result As New System.Text.StringBuilder()
For Each ch As Char In input
If Char.IsUpper(ch) Then
' Handle uppercase letters
Dim offset As Integer = Asc("A")
Dim shiftedChar As Char = Chr((Asc(ch) - offset + shift) Mod 26 + offset)
result.Append(shiftedChar)
ElseIf Char.IsLower(ch) Then
' Handle lowercase letters
Dim offset As Integer = Asc("a")
Dim shiftedChar As Char = Chr((Asc(ch) - offset + shift) Mod 26 + offset)
result.Append(shiftedChar)
ElseIf Char.IsDigit(ch) Then
' Handle digits (0-9)
Dim offset As Integer = Asc("0")
Dim shiftedChar As Char = Chr((Asc(ch) - offset + shift) Mod 10 + offset)
result.Append(shiftedChar)
Else
' Non-letter characters are not shifted
result.Append(ch)
End If
Next
Return result.ToString()
End Function
//-----------------------
function caesarShift(input, shift) {
let result = '';
for (let ch of input) {
if (ch >= 'A' && ch <= 'Z') {
// Handle uppercase letters
let offset = 'A'.charCodeAt(0);
let shiftedChar = String.fromCharCode(((ch.charCodeAt(0) - offset + shift) % 26 + offset));
result += shiftedChar;
} else if (ch >= 'a' && ch <= 'z') {
// Handle lowercase letters
let offset = 'a'.charCodeAt(0);
let shiftedChar = String.fromCharCode(((ch.charCodeAt(0) - offset + shift) % 26 + offset));
result += shiftedChar;
} else if (ch >= '0' && ch <= '9') {
// Handle digits (0-9)
let offset = '0'.charCodeAt(0);
let shiftedChar = String.fromCharCode(((ch.charCodeAt(0) - offset + shift) % 10 + offset));
result += shiftedChar;
} else {
// Non-letter characters are not shifted
result += ch;
}
}
return result;
}
// Example usage
console.log(caesarShift("Hello, World! 123", 3)); // Output: "Khoor, Zruog! 456"
### English (EN) Letter Frequency Distribution
1. **K. M. O’Hara's Studies**: This study provides a comprehensive analysis of letter frequencies in English text. Here's a rough frequency distribution based on various sources:
- E: 12.70%
- T: 9.06%
- A: 8.17%
- O: 7.51%
- I: 7.00%
- N: 6.75%
- S: 6.33%
- H: 6.09%
- R: 5.99%
- D: 4.25%
- L: 4.03%
- C: 2.78%
- U: 2.76%
- M: 2.41%
- W: 2.36%
- F: 2.23%
- Y: 1.97%
- P: 1.93%
- B: 1.49%
- V: 0.98%
- K: 0.77%
- J: 0.15%
- X: 0.15%
- Q: 0.10%
- Z: 0.07%
2. **Wikipedia**: The page on [Letter Frequencies](https://en.wikipedia.org/wiki/Letter_frequency) provides a good overview and includes references to original studies.
3. **Cryptography and Information Security**: There are many cryptography textbooks that discuss letter frequency, including authors like Bruce Schneier and William Stallings.
### Italian (ITA) Letter Frequency Distribution
1. **Italian Language Studies**: Here’s a common frequency distribution for the Italian language:
- E: 11.79%
- A: 10.49%
- I: 9.96%
- O: 8.76%
- T: 6.87%
- N: 6.73%
- R: 6.52%
- S: 5.38%
- L: 5.26%
- U: 3.33%
- D: 3.41%
- C: 3.29%
- M: 2.51%
- P: 2.49%
- H: 0.77%
- B: 0.81%
- F: 0.84%
- X: 0.10%
- J: 0.12%
- K: 0.03%
- Q: 0.52%
- Z: 0.39%
- W: 0.00%
Some resources:
https://en.wikipedia.org/wiki/Entropy_(information_theory)
https://en.wikipedia.org/wiki/Letter_frequency
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
https://en.wikipedia.org/wiki/Majorization
https://en.wikipedia.org/wiki/Modular_exponentiation
Homework 9
Theory/Research
Mention the main properties of the sampling mean and variance.
Illustrate the law of large numbers and some possible applications, especially related to cybersecurity concepts.
Application / practice
Following the same scheme of HMWK 7 compute the distribution of the sampling variance ("corrected" or not).
Determine the distribution of the variances of the samples, and its mean and variance.
discussing the observed relationship with the mean and variance of the parent (theoretical) distribution.
Optional
Theory/Research
Research: Recall the fundamental ideas of the main encryption methods and their statistical properties.
AES-Inspired Encryption, Didactical "Toy" Version exercise
Objective: Apply statistics to learn about encryption and decryption using a simple substitution and permutation cipher. Gain insight into the fundamentals of encryption, key management, and frequency analysis, similar to concepts used in AES and RSA, particularly focusing on how these methods affect frequency distribution and entropy.
Create a Substitution Cipher
Generate a Substitution Key: Create a random mapping of the letters A-Z. Each letter should map to a unique letter.
Example:
A -> Q, B -> Z, C -> X, D -> W, E -> V, F -> U, G -> T,
H -> R, I -> S, J -> P, K -> O, L -> N, M -> M,
N -> L, O -> K, P -> J, Q -> I, R -> H, S -> G,
T -> F, U -> E, V -> D, W -> C, X -> B, Y -> A, Z -> Y
Choose a message: Pick a short message to encrypt. Example: "HELLO WORLD".
Encrypt the Text: Use your substitution key to transform each letter of your message. Write down the encrypted message.
Statistical Analysis:
Frequency Distribution:
Analyze the final encrypted message. Compare the frequency of letters in your original message and in the encrypted message. Discuss how the substitution cipher affects the distribution of letters.
Entropy: Calculate the entropy of both the original and the encrypted messages. Discuss how the substitution affects the amount of uncertainty or randomness in the message.
Permutation Step:
Reverse the order of the encrypted letters to create the final encrypted output.
Example: If your encrypted message is "RVNNM KOLHY", reverse it to "YHLKOM MNNVR".
Discuss how reversing the order of letters affects the frequency distribution and entropy. Does it reveal or obscure any patterns?
Encryption/Decryption Challenge.
Exchange encrypted messages with a classmate. Attempt to decode each other's messages as a challenge. Start with an encrypted message such as "YHLKOM MNNVR".
Guess the original message using frequency analysis or pattern recognition. If you know the substitution key, decode it by reversing both the substitution and the permutation.
Statistical Discussion:
Frequency Distribution Changes: How did the frequency distribution of letters change after applying the permutation step? Discuss the significance of this change in terms of statistical analysis and cryptography. Entropy Considerations: Discuss how the overall entropy of the original and final messages compares. What does this indicate about the security and unpredictability of the encrypted message compared to the original?
Contrast with RSA concepts. Discuss how RSA tends to maintain the structure of frequency distribution while complicating decryption through its key management.
Final thoughts on entropy and security: Reflect on the importance of entropy in cryptography. Consider how higher entropy in an encrypted message can enhance security by making it harder for attackers to predict or analyze the message content.
[
Notes:
Comparison with AES: In your discussions, compare how AES significantly alters the frequency distribution and entropy of plaintext through complex transformations, including multiple layers of substitution and permutation, as well as an integral XOR operation with a key. Although we are skipping the XOR steps for simplicity in this assignment, they are crucial in the AES process, as they introduce an additional layer of complexity that enhances security and makes reverse engineering through frequency analysis difficult.]
Homework 10
Theory/Research
General concept of sampling mean and variance and main features of their distributions
General idea Lebesgue–Stieltjes integration and applications to Probability theory and to Measure theory
Application / practice
Try compute numerically a Lebesgue integral and compare the same with the Riemann integral (you might
compute mean or variance of a distribution).
https://en.wikipedia.org/wiki/Lebesgue%E2%80%93Stieltjes_integration
https://en.wikipedia.org/wiki/Measure_(mathematics)
https://www.stat.berkeley.edu/~wfithian/courses/stat210a/measure-theory-basics.html
https://www.youtube.com/watch?v=TG67nsccqeQ
Optional
SSL/TLS Certificate Transparency Stat Analysis
Analyze publicly available SSL/TLS certificate data to identify potential security insights and patterns.
Collect a sample of certificates and do some statestical processing on their features:
Extract key statistical information:
Certificate issuer distribution
Certificate validity periods
Geographic distribution of certificates
Types of encryption used
...
Statistical Analysis:
Example, calculate:
Mean and median certificate validity duration
Most common certificate authorities
Distribution of key lengths
Proportion of short vs. long-lived certificates
Potential Insights:
Identify potential security risks
Detect unusual certificate patterns
Compare certificate practices across different domains/industries
... or anything you find interesting to study
Example:
Pie charts of certificate issuers
Bar graphs of key lengths
Timeline of certificate expirations
Core Functionality in VS:
Imports System.Security.Cryptography.X509Certificates
Primary Certificate Management Classes
- X509Certificate
- X509Certificate2
- X509Store
- X509Chain
RSA Roles:
1. Key Generation
2. Public/Private Key Pair
3. Digital Signature
4. Encryption Mechanism
5. Certificate Signing
RSA Versions/Key Lengths:
1. RSA-1024 (Deprecated)
2. RSA-2048 (Current Standard)
3. RSA-4096 (High Security)
Current SSL/TLS Versions:
Active Versions:
1. TLS 1.2 (Widely Used)
2. TLS 1.3 (Latest Recommended)
Deprecated:
- SSL 3.0 (Obsolete)
- TLS 1.0 (Insecure)
- TLS 1.1 (Deprecated)
TLS 1.3 Key Improvements:
tools:
SSL Labs (https://www.ssllabs.com/ssltest/)
SSL (Secure Sockets Layer)
↓
TLS (Transport Layer Security)
- SSL 3.0 → TLS 1.0
- Developed by IETF
- Successor to SSL
Certificate Components:
- Public Key
- Private Key
- Digital Signature
- Encryption Algorithm
RSA in TLS Handshake:
Phases:
1. Key Exchange
2. Initial Authentication
3. Symmetric Key Establishment
Detailed Handshake Process:
Client Hello → Server Hello
↓ RSA Used for:
- Initial Key Exchange
- Certificate Authentication
- Asymmetric Encryption of Shared Secret
After Handshake:
- Switch to Symmetric Encryption (Faster)
- Uses Session Key
Handshake Mechanism:
1. Asymmetric Encryption (RSA)
- Slow but Secure
- Used for Initial Key Exchange
2. Symmetric Encryption (AES)
- Fast
- Used for Actual Data Transfer
Technical Workflow:
Client Steps:
1. Generate Random Premaster Secret
2. Encrypt with Server's Public RSA Key
3. Send Encrypted Premaster Secret
Server Steps:
1. Decrypt Premaster Secret using Private RSA Key
2. Derive Session Keys
3. Establish Symmetric Encryption
First Hints and some snippets:
Imports System.Net.Http
Imports System.Text.Json
Imports System.Linq
Public Class CertificateAnalyzer
Private Const API_ENDPOINT As String = "https://crt.sh/?q={0}&output=json"
' Main method to run the analysis
Public Shared Sub Main()
' List of domains to analyze
Dim domains As String() = {"google.com", "microsoft.com", "github.com"}
' Analyze certificates for each domain
For Each domain In domains
AnalyzeDomainCertificates(domain)
Next
End Sub
' Method to fetch and analyze certificates for a specific domain
Public Shared Async Sub AnalyzeDomainCertificates(domain As String)
Try
' Fetch certificate data
Dim certificates As List(Of Certificate) = Await FetchCertificatesAsync(domain)
' Perform statistical analysis
Dim analysis As CertificateAnalysis = PerformAnalysis(certificates)
' Display results
DisplayResults(domain, analysis)
Catch ex As Exception
Console.WriteLine($"Error analyzing {domain}: {ex.Message}")
End Try
End Sub
' Fetch certificates from crt.sh API
Private Shared Async Function FetchCertificatesAsync(domain As String) As Task(Of List(Of Certificate))
Using client As New HttpClient()
Dim url As String = String.Format(API_ENDPOINT, domain)
Dim response As String = Await client.GetStringAsync(url)
' Parse JSON response
Dim options As New JsonSerializerOptions()
options.PropertyNameCaseInsensitive = True
Dim certData As List(Of Certificate) =
JsonSerializer.Deserialize(Of List(Of Certificate))(response, options)
Return certData
End Using
End Function
' Perform statistical analysis on certificates
Private Shared Function PerformAnalysis(certificates As List(Of Certificate)) As CertificateAnalysis
Dim analysis As New CertificateAnalysis()
' Calculate certificate issuer distribution
analysis.IssuerDistribution =
certificates.GroupBy(Function(c) c.Issuer_name)
.Select(Function(g) New With {
.Issuer = g.Key,
.Count = g.Count()
})
.OrderByDescending(Function(x) x.Count)
.ToList()
' Calculate average validity period
analysis.AverageValidityDays =
certificates.Average(Function(c)
(c.Not_after - c.Not_before).TotalDays)
' Count unique key lengths
analysis.KeyLengthDistribution =
certificates.GroupBy(Function(c) c.Pubkey_size)
.Select(Function(g) New With {
.KeyLength = g.Key,
.Count = g.Count()
})
.OrderBy(Function(x) x.KeyLength)
.ToList()
Return analysis
End Function
' Display analysis results
Private Shared Sub DisplayResults(domain As String, analysis As CertificateAnalysis)
Console.WriteLine($"Certificate Analysis for {domain}")
Console.WriteLine("----------------------------")
' Display issuer distribution
Console.WriteLine("Certificate Issuer Distribution:")
For Each issuer In analysis.IssuerDistribution
Console.WriteLine($"{issuer.Issuer}: {issuer.Count} certificates")
Next
' Display average validity
Console.WriteLine($"Average Certificate Validity: {analysis.AverageValidityDays:F2} days")
' Display key length distribution
Console.WriteLine("Key Length Distribution:")
For Each keyLength In analysis.KeyLengthDistribution
Console.WriteLine($"{keyLength.KeyLength} bits: {keyLength.Count} certificates")
Next
End Sub
' Certificate data model
Public Class Certificate
Public Property Issuer_name As String
Public Property Not_before As Date
Public Property Not_after As Date
Public Property Pubkey_size As Integer
End Class
' Analysis results container
Public Class CertificateAnalysis
Public Property IssuerDistribution As List(Of Object)
Public Property AverageValidityDays As Double
Public Property KeyLengthDistribution As List(Of Object)
End Class
End Class