Statistics, Cybersecurity [Year 2024 - 25]
Topics on Statistics with intensive computer applications
$ \int_0^t d S_u = \int_0^t \mu(S_u, u) du + \int_0^t\sigma(S_u, u) dW_u $
Supporto al corso e alla didattica telematica, by T. Gastaldi #Sapienzanonsiferma #Sapienzadoesnotstop
(Instructor: tommaso.gastaldi@gmail.com,
https://www.datatime.eu/public/cybersecurity/)
Whatsapp group for the students of this
course
Invitation to join the Whatsapp group for this
course: https://chat.whatsapp.com/Kk3wRGmmxWH9RNUo01zFdX
(When first joining, send a message with your name and id ("matricola"))
____________________________________________________________________________________
General notes for all homeworks
-Implement exercises in your choice between C#, vb or Javascript. For Js, always use latest
ECMAScript (use classes, let, const, no var, etc...) and strict mode (in case, webstorm or rider can also be of
great help to stay up to date with latest language updates and to check syntax.) Put the javascript programs directly online as webpage.
-All important code must be shown and possibly discussed (as to the the crucial parts only) in the homework web page so that one can understand the main points.
(Full version can be stored on github or as zip file containing the "solution", if you like, but that is not required.)
-Never use any third part library external to the leanguage or higher level languages (e.g., sas, r, python, matlab, minitab, etc.) because our purpose is to actually implement from scratch the very basics to deeply understand our topics. (Using other people's "black boxes" would defy our learning purpose.)
-Always exercise your capacity of abstraction. Never write algorithms that works only on specific cases or data, but, on the contrary, try to be as general as possible in any of your creations and logic. Use smart personal implementations to show your intelligence and insight! Originality and deep thinking are the most appreciated values in this course.
-Always acknowledge your sources and use quotes when you just copy paste text from other sources (note that what you copy may be wrong!).
Homework 1
Theory (intro)
- Basic notions in Statistics: Population, Statistical Units, Distribution, Frequency (relative, absolute, percentage);
- Notion of arithmetic average. Derivation. Computational problems with floating point rapresentation (errors, catastrophical cancellation) and numerical solution (Knuth) ;
Applications / practice
We have n servers with m attackers. The hacker has probability p to penetrate each server. Make a graphical representation (line flat if hacker doesn’t penetrate and a jump to 1 if he penetrates), try different n,m,p.
At time n we want to complete distribution how many reached each level. (Draw the distribution histogram vertically
at the end of the chart, so that each rectangle representing the attackers' frequency is placed on the corresponding number of penetrations (or "successes") they achieved).
Some resources:
https://en.wikipedia.org/wiki/Variable_and_attribute_(research)
https://www.investopedia.com/terms/s/statistics.asp
https://www.scribbr.com/methodology/sampling-methods/#:~:text=Probability%20sampling%20methods%20include%20simple,a%20chance%20of%20being%20included.
https://en.wikipedia.org/wiki/Design_of_experiments
https://www.surveymonkey.com/mp/open-ended-questions-get-more-context-to-enrich-your-data/#:~:text=open%2Dended%20questions%3F-,So%20what%20are%20open%2Dended%20questions%3F,or%20other%20closed%2Dended%20format.
https://en.wikipedia.org/wiki/Level_of_measurement
https://www.youtube.com/watch?v=uHRqkGXX55I&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=EZrP_av3cmA&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=pTuj57uXWlk&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=10ikXret7Lk&ab_channel=SimpleLearningPro
Homework 2
Theory
Find the simplest and most elegant way to show the Welford recursion.
Application / practice
Refine you Euler–Maruyama simulator to approximate numerical solutions of stochastic differential equations (SDE), by adding the following variants to the existing framework:
A. Jumps -1 +1 with prob. p [random walk]
B. Absolute and relative frequency trajectories
C. Final distribution and intermediate distributions (at one internal time/step selectable from the gui),
with mean and variance (make it all parametric so that one unique interface will handle it all).
Research
Make your personal notes about the behavior of mean and variance wrt to time. For instance:
What did you observe in all the 4 different cases (relative/abs freq & Bernoulli/random walk)?
What are the main differences between the distribution of the distribution of absolute number of successes
and that of the relative frequencies.
Some resources:
https://en.wikipedia.org/wiki/Euler%E2%80%93Maruyama_method
https://en.wikipedia.org/wiki/Random_walk
https://en.wikipedia.org/wiki/Multinomial_distribution
Homework 3
Theory/Research
Illustrate formally, in the simplest possible way, why the Median is the minimum c f the sum of |x(i) - c| (sum of absolute deviations).
Find all possible different conceptual different ways to define a "location" statistics (sometime also called "center" or "central tendency") or synthesis of a distributions. Showing how the generalization of these ideas can potentially lead to infinite other definitions.
Application / practice
Refine your SDE simulator to simulate a continuous time process where we can have an attack (indicated with a jump of +1) at any
time with a constant rate of attack.
To create the approximation of time continuity subdivide your reference temporal window into numerous intervals
of vanishing size dt = 1/n and to each infinitesimal interval assign a probability of a +1 "jump" (attack success) equal
to Lambda * dt, where Lambda is a simulation parameter, having the meaning of expected total number of attacks in the reference
period.
Some resources:
https://en.wikipedia.org/wiki/Stochastic_simulation
https://www.probabilitycourse.com/chapter11/11_1_2_basic_concepts_of_the_poisson_process.php
https://www.probabilitycourse.com/chapter7/7_1_1_law_of_large_numbers.php
Homework 4
Theory/Research
Illustrate the concept of statistical independence, showing also the analogies with the formal definitions in probability theory.
Application / practice
Refine your stochastic SDE simulator to generate a continuous time, process to represent the scaling limit of the random Walk.
To create the approximation of time continuity subdivide your reference temporal window into vanishing intervals
dt and on each infinitesimal interval assign a probability p or p to make a jump of a + or - sqrt(dt).
Note the significance of the simulation (Donsker invariance principle/ theorem or the functional central limit theorem)
in relation to the Wiener process.
Some resources:
https://en.wikipedia.org/wiki/Donsker%27s_theorem
https://www.youtube.com/watch?v=sJPlOMrcJXo&ab_channel=ResearchMethodsandStatistics%28FMG%2CUvA%29
Homework 5
- Prove in the simplest possible way the C-S (Cauchy-Schwarz) inequality
(r coefficient normalizing denominator)
- Reflect on the concepts of independence and uncorrelation, pointing
out conceptual differences and possible measures.
- E-M Simulator Enhancement:
Enhance your existing Euler-Maruyama (E-M) simulator by developing a unified simulation framework. Create a general central class that can possibly manage various types of stochastic differential
equations (SDEs).
Optional: Regression Coefficients:
Derive the coefficients (b) and (a) of two regression lines using the least squares method, and show the relationships with R^2.]
Homework 6
Theory/Research
Research: Recall the fundamental theorem of calculus and demonstrate its relationship with density
functions and cumulative distribution functions (CDFs).
Application / practice
Exercise: Generate realizations from a discrete univariate probability distribution with arbitrary probabilities.
Graphically show the convergence of the empirical distribution to the theoretical distribution as the sample size increases.
Compute also, during the generation, the mean and variance using recursive methods (e.g., Knuth's/Welford's algorithms)
and compare these results with the theoretical mean and variance, discussing the relationship.
Homework 7
Theory/Research
Application / practice
Using the setup of previous homework, from a discrete distribution generate m (e.g. m=1000 ...) samples
of size n (e.g., n = 20, 30, 100, ...). Compute the distribution of the sampling average.
Determine the average of the distribution of the averages of the samples, and the variance,
discussing the observed relationship with the mean and variance of the parent (theoretical) distribution.
Optional exercise
Given the random variable Y = g^U mod n (meaning the remainder of the division by n)
where U is a Uniform in [1, max_U] (max_U is a user param)
A) Generate the distributions of Y for n = 19 and g = 2, 3, 10, 17
B) Generate the distributions of Y for n = 15 and g = 3, 6, 9, 12
Observe the shape of the distributions and compute the entropy or other diversity indexes. Give your opinion on
the implications of any observed differences in terms of cryptographic properties (uniformity, predictability)
and potential applications. Why case A may be better suited for cryptographic applications ? Why case B
(predictability, lower entropy ?) may illustrate possible vulnerabilities, if any ?
What is the reason why we choose the set { 2, 3, 10, 17 } in case A ? Spot possible errors in the exercise
Homework 8
Theory/Research
Recall the notion of Shannon Entropy amd other diversity measures of distributions
Recall the notion of primitive root (a primitive root modulo p a prime number is a number g such that for every
integer a that is coprime to p , there exists an integer k such that g^k \mod p = a )
Application / practice
Part 1
Find and compile a sufficiently large piece of text by selecting several web pages and create a letter frequency distribution.
Choose a random shift value (e.g., 1-25, with wrap-around) and apply the Caesar cipher to encrypt the original text.
Use frequency analysis or find any efficient and effective strategy to attempt to decrypt the message you obtained in the previous step.
Part 2 Optional (Modular exponentiation)
Convert each letter of the original text to a numeric representation (A = 0, B = 1, ..., Z = 25).
Then, choose an exponent and calculate the encoded values using the formula: N^k \mod P for each letter k,
where n > 26 (e.g: N=10, P=37, note 10 is a primtive root of 37.)
Compare the simplicity of frequency analysis in the Caesar cipher to the infeasibility of breaking RSA without
access to critical information.
Visualize the distributions and calculate the Shannon entropy of the transformed distributions.
Summarize the findings from both parts of the exercise. Discuss how statistical analysis enhances understanding
of cryptographic algorithms and the importance of these skills in cybersecurity.
hints and resources:
Function CaesarShift(input As String, shift As Integer) As String
Dim result As New System.Text.StringBuilder()
For Each ch As Char In input
If Char.IsUpper(ch) Then
' Handle uppercase letters
Dim offset As Integer = Asc("A")
Dim shiftedChar As Char = Chr((Asc(ch) - offset + shift) Mod 26 + offset)
result.Append(shiftedChar)
ElseIf Char.IsLower(ch) Then
' Handle lowercase letters
Dim offset As Integer = Asc("a")
Dim shiftedChar As Char = Chr((Asc(ch) - offset + shift) Mod 26 + offset)
result.Append(shiftedChar)
ElseIf Char.IsDigit(ch) Then
' Handle digits (0-9)
Dim offset As Integer = Asc("0")
Dim shiftedChar As Char = Chr((Asc(ch) - offset + shift) Mod 10 + offset)
result.Append(shiftedChar)
Else
' Non-letter characters are not shifted
result.Append(ch)
End If
Next
Return result.ToString()
End Function
//-----------------------
function caesarShift(input, shift) {
let result = '';
for (let ch of input) {
if (ch >= 'A' && ch <= 'Z') {
// Handle uppercase letters
let offset = 'A'.charCodeAt(0);
let shiftedChar = String.fromCharCode(((ch.charCodeAt(0) - offset + shift) % 26 + offset));
result += shiftedChar;
} else if (ch >= 'a' && ch <= 'z') {
// Handle lowercase letters
let offset = 'a'.charCodeAt(0);
let shiftedChar = String.fromCharCode(((ch.charCodeAt(0) - offset + shift) % 26 + offset));
result += shiftedChar;
} else if (ch >= '0' && ch <= '9') {
// Handle digits (0-9)
let offset = '0'.charCodeAt(0);
let shiftedChar = String.fromCharCode(((ch.charCodeAt(0) - offset + shift) % 10 + offset));
result += shiftedChar;
} else {
// Non-letter characters are not shifted
result += ch;
}
}
return result;
}
// Example usage
console.log(caesarShift("Hello, World! 123", 3)); // Output: "Khoor, Zruog! 456"
### English (EN) Letter Frequency Distribution
1. **K. M. O’Hara's Studies**: This study provides a comprehensive analysis of letter frequencies in English text. Here's a rough frequency distribution based on various sources:
- E: 12.70%
- T: 9.06%
- A: 8.17%
- O: 7.51%
- I: 7.00%
- N: 6.75%
- S: 6.33%
- H: 6.09%
- R: 5.99%
- D: 4.25%
- L: 4.03%
- C: 2.78%
- U: 2.76%
- M: 2.41%
- W: 2.36%
- F: 2.23%
- Y: 1.97%
- P: 1.93%
- B: 1.49%
- V: 0.98%
- K: 0.77%
- J: 0.15%
- X: 0.15%
- Q: 0.10%
- Z: 0.07%
2. **Wikipedia**: The page on [Letter Frequencies](https://en.wikipedia.org/wiki/Letter_frequency) provides a good overview and includes references to original studies.
3. **Cryptography and Information Security**: There are many cryptography textbooks that discuss letter frequency, including authors like Bruce Schneier and William Stallings.
### Italian (ITA) Letter Frequency Distribution
1. **Italian Language Studies**: Here’s a common frequency distribution for the Italian language:
- E: 11.79%
- A: 10.49%
- I: 9.96%
- O: 8.76%
- T: 6.87%
- N: 6.73%
- R: 6.52%
- S: 5.38%
- L: 5.26%
- U: 3.33%
- D: 3.41%
- C: 3.29%
- M: 2.51%
- P: 2.49%
- H: 0.77%
- B: 0.81%
- F: 0.84%
- X: 0.10%
- J: 0.12%
- K: 0.03%
- Q: 0.52%
- Z: 0.39%
- W: 0.00%
Some resources:
https://en.wikipedia.org/wiki/Entropy_(information_theory)
https://en.wikipedia.org/wiki/Letter_frequency
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
https://en.wikipedia.org/wiki/Majorization
https://en.wikipedia.org/wiki/Modular_exponentiation