Application – Pagina 2 – Statistics Course

Homework_21A

Consider the general scheme we have used so far to simulate some stochastic processes (such as the relative frequency of success in a sequence of trials, the sample mean, the random walk, the Poisson point process) and now add this new process to our simulator.
Same scheme as previous simulations programs, except changing the way to compute the values of the paths at each time. Starting from value 0 at time 0, for each of m paths, at each new time compute P(t) = P(t-1) + Random step(t), for t = 1, …, n, where Random step(t) is now: σ * sqrt(1/n) * Z(t), where Z(t) is a N(0,1) random variable (the deviation σ is a user parameter, to scale the process dispersion).
At time n (last time) and one (or more) other chosen inner time 1<j<n (j is a program parameter) create and represent with histogram the distribution of P(t). Observe the behavior of the process for large n.

Code VB.NET

https://drive.google.com/file/d/18VrAuvDyjSh2ntiNrCYFsxfhpS_3jtxi/view?usp=sharing

Homework_12RA

Find out what you have just generated in exercise 18_A. How can you interpret what you see? Can you find out about all the well known distributions that “naturally (and “magically”) arise” in this process ?

Poisson Distribution

A Poisson Process is a model for a series of discrete event where the average time between events is known, but the exact timing of events is random.

A Poisson Process meets the following criteria

Events are independent of each other. The occurrence of one event does not affect the probability another event will occur.
The average rate (events per time period) is constant.
Two events cannot occur at the same time.

A discrete random variable X is said to have a Poisson distribution with parameter λ > 0 if for k = 0, 1, 2, …, the probability mass function of X is given by:

k is the number of occurrences
λ = E(x) = Var(x)

Homework_19A

Consider the general scheme we have used so far to simulate some stochastic processes (such as the relative frequency of success in a sequence of trials, the sample mean and the random walk) and now add this new process to our simulator.
Same scheme as previous program (random walk), except changing the way to compute the values of the paths at each time. Starting from value 0 at time 0, for each of m paths, at each new time compute P(i) = P(i-1) + Random step(i), for i = 1, …, n, where Random step(i) is now a Bernoulli random variable with success probability equal to λ * (1/n) (where λ is a user parameter, eg. 50, 100, …).
At time n (last time) and one (or more) other chosen inner time 1<j<n (j is a program parameter) create and represent with histogram the distribution of P(i).

Represent also the distributions of the following quantities (and any other quantity that you think of interest):
– Distance (time elapsed) of individual jumps from the origin
– Distance (time elapsed) between consecutive jumps

19_A Add to your statistical application, on each variable histogram, and across the scatterplot, 3 lines indicating the 3 quartiles (use online algos for computations).

Update: Code VB.Net v2.0

https://drive.google.com/file/d/1j3oEg5IOTGTaJl7pHzdeTjOuToOulNcl/view?usp=sharing

Code VB.Net

I must fix hinstogram and add distance, so update coming soon.

https://drive.google.com/file/d/1p_8JK3LvxbFk4SnOp73wPlG0z-_Ba2Qp/view?usp=sharing

Homework15_A

Simple illustration of the Glivenko-Cantelli theorem ( http://home.uchicago.edu/~amshaikh/webfiles/glivenko-cantelli_topics.pdf ).
Consider random variables from a Uniform distribution (not necessarily in the same range), and create both the histogram and the empirical CDF of the sample mean. Show with an animation what happens when the number of observations increases. What do we see here?

Update: Code VB.Net v2.0

https://drive.google.com/file/d/1-6fRYwmREvvVgiv-hI7khTEmcMpuZjhg/view?usp=sharing

Changed way to build the path, follow the definition:

1/n * sum (indicator function) with x in R.

VB.NET Code

https://drive.google.com/file/d/1xbpG-stSMS-8T-M6QnoHNYb6cRhZxHw5/view?usp=sharing

We can see that as the number of observations increases, the empirical cdf closest to the real cdf and the histogram take a bell shape. The empirical CDF, infact, usually approximates the CDF quite well, especially for large samples.

I have to thank Luca Scarmozzino who helped me with this homework!
https://stats4cyber.wordpress.com/

https://stats.stackexchange.com/questions/239937/empirical-cdf-vs-cdf

Homework11_RA

Do a research about the random walk and its properties. Looking at your possible simulation in exercise 15_A, how would you describe the beaviour of the distribution of Y, as n increases ? What are mean and variance of Y at step n ?

The Random Walk

A random walk is a mathematical object, known as a stochastic or random process, that describes a path that consists of a succession of random steps on some mathematical space such as the integers.

More in general, a random walk refers to any process in which there is no observable pattern or trend; that is, where the movements of an object, or the values taken by a certain variable, are completely random. Certain real-life scenarios that could be modeled as random walks could be:

The movements of an animal foraging for food in the wilderness
The path traced by a molecule as it moves through a liquid or a gas (diffusion)
The price of a stock as it moves up and down
The path of a drunkard wandering through Greenwich Village
The financial status of a gambler at the roulette wheel in Las Vegas

A Fun Caveat on Random Walks: Monkeys and Typewriters

As n (steps) gets larger, we expect to get further and further from the origin. In fact, for an infinite number of random walks with an infinitely large number of steps, we will end up visiting every number on the number line.

This idea was illustrated quite vividly and humorously in the book Fooled by Randomness., by Nassim Nicholas Taleb In that book, Taleb posits the idea that:

if a sufficiently large group of monkeys were to type randomly on typewriters (a more complex variety of a random walk) for an infinite amount of time, one of those monkeys would necessarily compose The Iliad. And not only that—given infinite time, one of the monkeys who composes the Iliad will then go on to compose the Odyssey.

Distribution

The mean of the empirical distribution is an unbiased estimator (is the difference between this estimator’s expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased) of the mean of the population distribution.

When n increase, the mean is closer to the real value of theorical CDF.

The variance of the empirical distribution times n/n − 1 is an unbiased estimator of the variance of the population distribution.

https://en.wikipedia.org/wiki/Random_walk
https://en.wikipedia.org/wiki/Empirical_distribution_function
https://en.wikipedia.org/wiki/Bias_of_an_estimator
https://magoosh.com/statistics/what-is-random-walk-theory/

Homework14_A

Generate and represent m paths of n point each (m, n are program parameters), where each point represents a pair of:

time index
relative frequency of success (i, f(i))
where f(i) is the sum of i Bernoulli random variables p^x(1-p)^(1-x) observed at the times j=1, …, i.

At time n (last time) and any other inner time 1<j<n (j is a program parameter) create and represent with histogram the distribution of f(i).
At the same times (j and n), compute the absolute and relative frequency of the f(i)’s contained in the interval (p-ε, p+ε), where ε is a program parameter.

Code VB.Net

https://drive.google.com/file/d/1r2JZnVztQlfUp-w0ZCvnNpYgTZ3BU9X_/view?usp=sharing

Homework10_RA

Do a research about the various methods proposed to compute the running median (one pass, online algorithms). Store (cite sources) the algorithm that you think is a good candidate and explaining briefly how it works and possibly show a quick demo.

The running median

The median of a set of integers is the midpoint value of the data set for which an equal number of integers are less than and greater than the value.

If we keep each number in a sorted sequence then cost of single entry is O(n) and finding median is O(n).
A slight modification can be done by keeping the middle pointer and adjusting it based on the insertion on its left side and right side. In that case finding median after insertion is O(1). But the overall cost for finding median still remains O(n) as insertion in sorted sequence is necessary after each number is entered.
We can keep two heaps which divides the entered number in two almost equal halves. Half of the number would be greater than the median and the rest would be lesser. The upper half will be maintained in a min heap and the lower half will be maintained in a max heap. In this arrangement we can find out in O(1) time whether a new number would go to the upper half or lower half. All we need to do is to compare the new number with the head of two heaps. After deciding we can insert in a heap in O(log n) time. After this insertion if the heaps are unbalanced, we can just move from one heap to another. which is again of O(log n) complexity. And now we can find the median in O(1) time. If two heaps contain same number of elements then median is the average of the head of two heaps. If one is greater, then median is the head of the larger heap.

import java.util.Comparator;
import java.util.PriorityQueue;
import java.util.Random;


public class RunningMedian
{
    PriorityQueue<Integer> upperQueue;
    PriorityQueue<Integer> lowerQueue;

    public RunningMedian()
    {
        lowerQueue=new PriorityQueue<Integer>(
          20,new Comparator<Integer>()
        {

            @Override
            public int compare(Integer o1, Integer o2)
            {

                return -o1.compareTo(o2);
            }

        });
        upperQueue=new PriorityQueue<Integer>();
        upperQueue.add(Integer.MAX_VALUE);
        lowerQueue.add(Integer.MIN_VALUE);
    }

    public double getMedian(int num)
    {
        //adding the number to proper heap
            if(num>=upperQueue.peek())
                upperQueue.add(num);
            else
                lowerQueue.add(num);
        //balancing the heaps
        if(upperQueue.size()-lowerQueue.size()==2)
            lowerQueue.add(upperQueue.poll());
        else if(lowerQueue.size()-upperQueue.size()==2)
            upperQueue.add(lowerQueue.poll());
        //returning the median
        if(upperQueue.size()==lowerQueue.size())
            return(upperQueue.peek()+lowerQueue.peek())/2.0;
        else if(upperQueue.size()>lowerQueue.size())
            return upperQueue.peek();
        else
            return lowerQueue.peek();

    }
    public static void main(String[] args)
    {
        Random random=new Random();
        RunningMedian runningMedian=new RunningMedian();
        System.out.println("num\tmedian");
        for(int i=0;i<50;++i)
        {
            int num=random.nextInt(100);
            System.out.print(num);
            System.out.print("\t");
            System.out.println(
              runningMedian.getMedian(num));
        }

    }

}

http://www.dsalgo.com/2013/02/RunningMedian.php.html

https://dcc-backup.ligo.org/public/0027/T030168/000/T030168-00.pdf

Homework11_A

Make a short demonstrative program where you apply both the Riemann and Lebesgue approach to integration to compute numerically (with an increasingly large number of subdivisions) the integral on a bounded continuous function of your choice and compare the results. [Optionally, show with an animation, using the graphics object, the convergence to a limit, as the number of subdivisions of the function domain (for Riemann) or range (for Lebesgue) increases.]

Update: Code VB.Net v2.0

https://drive.google.com/file/d/1wH6vjj-yGOwr3C8F_GePlK2EpNK2HS-7/view?usp=sharing

Code VB.Net

https://drive.google.com/file/d/1m_H-ZOTsFOgt_2liKjBmuKxZbNFeZkqj/view?usp=sharing

Riemann integral(left) and Lesbesgue integral(right)

'take x_1 and x_2 that denote 'the range of function' in x axis
'take interval to divide the axis(in this case y axis)
'take function f to map the x in y axis with respect to the given function
'take g as inverse to f to return to the x axis

Public Function Lebesgue_int(x_1 As Integer, x_2 As Integer, interval As Integer, f As Func(Of Double, Double), g As Func(Of Double, Double)) As Double

...

'calculate the 'interval size' as height of horizontal rectangle
Dim dy As Double = range / CDbl(interval)

...

'calculate the area of rectangle and sum it
sum += dy * (x_2 - g(low_bnd))

...

End Function

While the Riemann integral considers the area under a curve as made out of vertical rectangles, the Lebesgue definition considers horizontal slabs that are not necessarily just rectangles, and so it is more flexible. For this reason, the Lebesgue definition makes it possible to calculate integrals for a broader class of functions.

One approach to constructing the Lebesgue integral is to make use of so-called simple functions: finite real-linear combinations of indicator functions. Simple functions can be used to approximate a measurable function, by partitioning the range into layers. The integral of a simple function is equal to the measure of a given layer, times the height of that layer.

https://www.quantstart.com/articles/Function-Objects-Functors-in-C-Part-1/
https://stackoverflow.com/questions/29318682/integration-with-riemann-sum-python
https://towardsdatascience.com/integrals-are-easy-visualized-riemann-integration-in-python-87dd02e90994
https://commons.wikimedia.org/wiki/File:Lebesgue_Integration_and_Lower_Sums.gif
https://en.wikipedia.org/wiki/Lebesgue_integration

Homework9_RA

Do a research about the various methods to generate, from a Uniform([0,1)), all the most important random variables (discrete and continuous).

Random Variable

Informally is a variable that is described as a variable whose values depend on outcomes of a random phenomenon.

In formal a random variable is understood as a measurable function defined on a probability space that maps from the sample space to the real numbers.
So the random variable is defined as a function, that must be measurable, which performs the mapping of the outcomes of a random process to a numeric value.

Random variables can be either discrete or continuous.

Not long after research began at RAND in 1946, the need arose for random numbers that could be used to solve problems of various kinds of experimental probability procedures. These applications, called Monte Carlo methods(therefore, a class of techniques for randomly sampling a probability distribution), required a large supply of random digits and normal deviates of high quality.

Random number generators

The purpose of random number generators (RNGs) is to produce sequences of numbers that appear as if they were generated randomly from a specified probability distribution.

A random number generator produces truly random numbers (the results are unpredictable). These are generally produced by physical devices also known as noise generator which are coupled with a computer. In computing, an apparatus that produces random numbers from a physical process is called a hardware random number generator or TRNG (for true random number generator).
Computers are deterministic in nature so producing truly random numbers with a computer is challenging, which is why we generally resort to using noise generators if we need “true” randomness. However, what we can do on a computer, is develop some sort of algorithm for generating a sequence of numbers that approximates the properties of random numbers.
When numbers are produced by some sort of algorithm or formula that simulates the values of a random variable X, they are called pseudorandom numbers. And the algorithm is called a pseudorandom number generator (or PRNG). The term “simulate” here is important: it simply means that the algorithm can generate sequences of numbers which have statistical properties that are similar (and this can be tested) to that of the random variable we want to simulate. For instance if we need to simulate a random variable X with probability distribution D, then we will need to test whether the sequence of numbers produced by our PRNG has the same distribution.

For some applications, such as cryptography, it is necessary to have pseudo-random number sequences for which prediction is computationally infeasible.

Uniform random generator

The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success/yes (with probability p) or failure/no(with probability q = 1 − p).

The Bernoulli distribution, (a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution)),

Image for post — Bernoulli Distribution:
https://en.wikipedia.org/wiki/Bernoulli_distribution

is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q = 1 − p.

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

The exponential distribution is the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless.

The general form of Normal Distribution.
https://en.wikipedia.org/wiki/Normal_distribution

A normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a type of continuous probability distribution for a real-valued random variable.

https://towardsdatascience.com/understanding-random-variable-a618a2e99b93
https://towardsdatascience.com/how-to-generate-random-variables-from-scratch-no-library-used-4b71eb3c8dc7
https://www.britannica.com/science/statistics/Random-variables-and-probability-distributions
https://statweb.stanford.edu/~owen/mc/Ch-unifrng.pdf
https://www.scratchapixel.com/lessons/mathematics-physics-for-computer-graphics/monte-carlo-methods-in-practice/generating-random-numbers
https://machinelearningmastery.com/monte-carlo-sampling-for-probability/
https://en.wikipedia.org/wiki/Poisson_distribution

Homework9_A

Prepare separately the following charts:
1) Scatterplot
2) Histogram/Column chart [in the histogram, within each class interval, draw also a vertical colored line where lies the true mean of the observations falling in that class]
3) Contingency table, using the graphics object and the Drawstring(), MeasureString(), DrawLine(), etc. methods.
When done, merge these charts in your previous application 7_A.
Use them to represent 2 numerical variables that you select from a CSV file. In particular, in the same picture box, you will make 2 separate charts:
1 rectangle (chart) will contain the contingency table
1 rectangle (chart) will contain the scatterplot, with the histograms/column charts and rug plots drawn respectively near the two axis (and oriented accordingly).

UPDATE: CodeVB.Net version 2.0

https://drive.google.com/file/d/1Os4CTy97ceuZFr3onDnqK09QQVE65dGV/view?usp=sharing

CodeVB.Net

https://drive.google.com/file/d/1EV97N2hHhMmilz8th-7K_i3Bf-e6c7jJ/view?usp=sharing

Some control miss. But i have fixed the problem with histogram and add the mean and the calculate of distribution. Fix also the proportion of height of histogram.
Add also move and wheel for graph(table and scatterplot+histo)

http://www.devcity.net/printarticle.aspx?articleid=138

Code VB.NET

Poisson Distribution

Update: Code VB.Net v2.0

Code VB.Net

Update: Code VB.Net v2.0

VB.NET Code

I have to thank Luca Scarmozzino who helped me with this homework!https://stats4cyber.wordpress.com/

The Random Walk

A Fun Caveat on Random Walks: Monkeys and Typewriters

Distribution

Code VB.Net

The running median

Update: Code VB.Net v2.0

Code VB.Net

Random Variable

Random number generators

Uniform random generator

I have to thank Luca Scarmozzino who helped me with this homework!
https://stats4cyber.wordpress.com/