Lesson_02 – Statistics Course

Homework3_RA

Understand how the floating point representation works and describe systematically (possibly using categories) all the possible problems that can happen. Try to classify the various issues and limitations (representation, comparison, rounding, propagation, approximation, loss of significance, cancellation, etc.) and provide simple examples for each of the categories you have identified.

How it works

Floating-point numbers are represented in computer hardware as base 2 (binary) fractions.

Since computer memory is limited, you cannot store numbers with infinite precision, no matter whether you use binary fractions or decimal ones, at some point you have to cut off.

The idea is to compose a number of two main parts:

A significand that contains the number’s digits. Negative significands represent negative numbers.
An exponent that says where the decimal (or binary) point is placed relative to the beginning of the significand. Negative exponents represent numbers that are very small (i.e. close to zero).

Nearly all hardware and programming languages use floating-point numbers in the same binary formats, which are defined in the IEEE 754 standard. The usual formats are 32 or 64 bits in total length.
So it allows us to represent very large numbers in a compact way, but also very small ones.

Issues

Representation

Representation error refers to the fact that some decimal fractions cannot be represented exactly as binary (base 2) fractions:

// 1/10 and 2/10 are not exactly representable as a binary 
// fraction.
0.1 + 0.2
0.30000000000000004

Rounding

How already said, since floating-point numbers have a limited number of digits, they cannot represent all real numbers accurately: when there are more digits than the format allows, the leftover ones are omitted – the number is rounded.

Cancellation

There are two kinds of cancellation: catastrophic and benign.

Catastrophic cancellation occurs when the operands are subject to rounding errors.
The subtraction of two numbers did not introduce any error, but rather exposed the error introduced in the earlier multiplications.

Benign cancellation occurs when subtracting exactly known quantities. If x and y have no rounding error if the subtraction is done with a guard digit, the difference x-y has a very small relative error.

Guard Digits

One method of computing the difference between two floating-point numbers is to compute the difference exactly and then round it to the nearest floating-point number:

x = 2.15 × 10¹² y = .0000000000000000125 × 10¹²x – y = 2.1499999999999999875 × 10¹² which rounds to 2.15 × 10¹²

But floating-point hardware normally operates on a fixed number of digits(cause is very expensive if the operands differ greatly in size):

x = 2.15 × 10¹²y = 0.00 × 10¹²x – y = 2.15 × 10¹²

The answer is exactly the same as if the difference had been computed exactly and then rounded.

Observation

Since many floating-point numbers are merely approximations of the exact value this means that for a given approximation f of a real number r there can be infinitely many more real numbers r1, r2, … which map to exactly the same approximation. Those numbers lie in a certain interval.
https://stackoverflow.com/questions/2100490/floating-point-inaccuracy-examples

This is an important phenomenon cause if you perform calculations on that number—adding, subtracting, multiplying, etc.—you lose precision. Every number is just an approximation, therefore you’re actually performing calculations with intervals.

There is a mathematical technique used to put bounds on rounding errors and measurement errors in mathematical computation:
Interval arithmetic (also known as interval mathematics, interval analysis, or interval computation)

A standard for interval arithmetic, IEEE Std 1788-2015, has been approved in June 2015. Two reference implementations are freely available. These have been developed by members of the standard’s working group: The libieeep1788 library for C++, and the interval package^[ for GNU Octave.

Curiosity, real world example
https://en.wikipedia.org/wiki/Round-off_error#Real_world_example:_Patriot_missile_failure_due_to_magnification_of_roundoff_error

https://docs.python.org/2/tutorial/floatingpoint.html
https://floating-point-gui.de/errors/rounding/
https://floating-point-gui.de/formats/fp/
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
https://en.wikipedia.org/wiki/Interval_arithmetic#Application

Homework6_A

Create one or more simple sequences of numbers which clearly show the problem with the “naive” definition formula of the arithmetic mean, and explore possible ways to fix that.
Provide alternative algorithms to minimize problems with the floating point representation with simple demos with actual numbers.

Problems with floating representation

There are only a finite number of different floats and they are discrete but not equidistant.
The spacing between adjacent numbers jumps by a factor of two at every power of two larger than 0.25.
0.25 is the smallest normalised number in this representation (by modulus), and numbers between -0.25 and 0.25 are denormalised.

    'This Is one of the joke of numbers
    'The result of sum is 0.1 
    Public Function create_list() As List(Of Double)
        Dim possibleInput As New List(Of Double)
        possibleInput.Add(1E-350)
        possibleInput.Add(0.1)
        Return possibleInput
    End Function

    'This Is another one of the joke of numbers
    'The sum will be 0.19999
    Public Function create_list() As List(Of Double)
        Dim possibleInput As New List(Of Double)
        possibleInput.Add(10.2)
        possibleInput.Add(10.0)
        Return possibleInput
    End Function

Problems with naive alghoritm

vb.net

Public Class Form1

    Public CurrentArMean As Double = 0

    'not precision in naive method

    Public Function create_list() As List(Of Double)
        Dim possibleInput As New List(Of Double)
        possibleInput.Add(10000000000.0)
        possibleInput.Add(0.1)
        possibleInput.Add(-10000000000.0)
        possibleInput.Add(-0.0001)
        possibleInput.Add(0.1)
        possibleInput.Add(10000)
        Return possibleInput
    End Function

    'ok method also for naive

    'Public Function create_list() As List(Of Double)
    '    Dim possibleInput As New List(Of Double)
    '    possibleInput.Add(10000000000.0)
    '    possibleInput.Add(-10000000000.0)
    '    possibleInput.Add(0.1)
    '    Return possibleInput
    'End Function

    'This Is the joke of float numbers
    'Public Function create_list() As List(Of Double)
    '    Dim possibleInput As New List(Of Double)
    '    possibleInput.Add(10.2)
    '    possibleInput.Add(10.0)
    '    Return possibleInput
    'End Function

    Public Function do_data()
        Dim result As Double() = {
        naive_alg_mean(create_list),
        online_alg_mean(create_list),
        kahan_sum(create_list)
        }
        Return result
    End Function
    Public Sub print_data(result As Double())
        Me.RtxtB1.AppendText("Online Algo result: " & result(1) & Environment.NewLine & Environment.NewLine)
        Me.RtxtB1.AppendText("Kahan Algo result: " & result(2) & Environment.NewLine & Environment.NewLine)
        Me.RtxtB1.AppendText("Naive Algo result: " & result(0) & Environment.NewLine & Environment.NewLine)
    End Sub

    Private Sub BtnCalc_Click(sender As Object, e As EventArgs) Handles BtnCalc.Click

        Me.RtxtB1.Clear()
        Dim result As Double() = do_data()
        print_data(result)

    End Sub

    Private Function naive_alg_mean(Possibleinput As List(Of Double)) As Double
        Dim sum As Double = 0
        For Each i In Possibleinput
            Try
                sum += i
            Catch ex As OverflowException
                Me.RtxtB1.AppendText("Exception in naive method: " & ex.Message & Environment.NewLine)
            End Try
        Next
        Return sum / Possibleinput.Count
    End Function

    Private Function online_alg_mean(Possibleinput As List(Of Double)) As Double
        Dim count = 0
        For Each i In Possibleinput
            count += 1
            CurrentArMean = CurrentArMean + (i - CurrentArMean) / count
        Next
        Return CurrentArMean
    End Function


    Private Function better_alg_mean(Possibleinput As List(Of Double)) As Double
        Dim currentMean As Double = 0
        Dim c As Double = Possibleinput.Min
        For Each i In Possibleinput
            currentMean += i - c
        Next

        Return (currentMean / Possibleinput.Count) + c
    End Function

    Private Function kahan_sum(Possibleinput As List(Of Double)) As Double
        Dim sum As Double = 0
        Dim c As Double = 0
        For Each i In Possibleinput
            Dim y As Double = i - c
            Dim t As Double = sum + y
            c = (t - sum) - y
            sum = t
        Next
        Return sum / Possibleinput.Count
    End Function
End Class

The input in this case create problems in Naive Algoritm because two of the summands were much larger than the other.

Precision is lost most quickly when the magnitude of the two numbers to be added is most different. With summation, one of the operands is always the sum so far, and in general it is much larger than the numbers being added to it. This effect can be minimized by adding the numbers in order from the smallest to the largest.
(If the two numbers have sufficiently different exponents then some least significant digits will have to be discarded.)

Klein algorithm for sum is better of Kahan sum:

With this method, a correction is maintained along with the current running total, effectively increasing the precision. Each term is first corrected for the error that has accumulated so far. Then the new sum is calculated by adding this corrected term to the running total. Finally, a new correction term is calculated as the difference between the change in the sums and the corrected term.

 Public Function Klein_alg_mean(possibleInput As List(Of Double))
        Dim sum As Double = 0.0
        Dim cs As Double = 0.0
        Dim ccs As Double = 0.0
        Dim c As Double = 0.0
        Dim cc As Double = 0.0
        For Each i In possibleInput
            Dim t As Double = sum + i
            If Math.Abs(sum) >= Math.Abs(i) Then
                c = (sum - t) + i
            Else
                c = (i - t) + sum
            End If
            sum = t
            t = cs + c
            If Math.Abs(cs) >= Math.Abs(c) Then
                cc = (cs - t) + c
            Else
                cc = (c - t) + cs
            End If
            cs = t
            ccs = ccs + cc
        Next i
        Return (sum + cs + ccs) / possibleInput.Count
    End Function

There are also other methods for decrease loss of precision:

sort your set
split in groups of elements whose sum wouldn’t overflow – since they are sorted, this is fast and easy
do the sum in each group – and divide by the group size
do the sum of the group’s sum’s (possibly calling this same algorithm recursively) – be aware that if the groups will not be equally sized, you’ll have to weight them by their size

Moreover double can be divided by a power of 2 without loss of precision. So if your only problem if the absolute size of the sum you could try to pre-scale your numbers before summing them.

We can also use this trick:

mean(X) = sum(X[i] - c)  +  c
          -------------
                N

Such that rescale the value, c could be the min value in the input list.

https://www.volkerschatz.com/science/float.html
https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Further_enhancements
https://www.drdobbs.com/floating-point-summation/184403224
https://www.xspdf.com/help/50835306.html

https://www.soa.org/news-and-publications/newsletters/compact/2014/may/com-2014-iss51/losing-my-precision-tips-for-handling-tricky-floating-point-arithmetic/

Homework4_A

Create – in both languages C# and VB.NET – a demonstrative program which computes the online arithmetic mean (if it’s a numeric variable) and the distribution for a discrete variable (can use values simulated with RANDOM object).

How i worked

In this application i used a class for store Dogs information, such as weight and breed.
I use Random method for data (i also searched for a database to capture the data, but i didnt’t find it).
I didn’t use a second class to represent the Frequency Distribution but i used a list to store the new dog instances and i used the override method for demostration purpose.

I tried to modulate the code and make it generic, so as to have it more robust and readable. Also for this reason the class has been put in a separate file.

   Private Sub Timer_Tick(sender As Object, e As EventArgs) Handles Timer.Tick
 
        Dim Dog As Dog = new_dataset()
        ListOfDog.Add(Dog)
        print_Database(Dog)
        online_mean(Dog)
        print_OnArMean()
        discrete_distribution(DicOfBreed, Dog.Breed)
        discrete_distribution(DicOfAgeDistributio, Dog.Age)
    End Sub

Vb.Net

Form1.vb

Public Class Form1
    Public R As New Random

    'List for store the dog
    Public ListOfDog As New List(Of Dog)

    'Array of breed
    Public PossibleBreed() As String = {"German Shepherd", "Alaskan Malamute", "Beauceron", "Bernese Mountain Dog",
                                        "Black Russian Terrier", "Weimaraner", "Pointer", "Rottweiler", "Siberian Husky",
                                        "Afghan Hound", "Boxer", "Collie", "Weimaraner", "Dalmatian", "Golden Retriver",
                                        "Irish Setter", "Labrador Retriver"}

    Public Count As Integer
    Public CurrentArMean As Double

    'Dictionary for Frequence Distribution of the breed(Breed,count)
    Public DicOfBreed As New Dictionary(Of String, Integer)
    'Dictionary for Frequence Distribution of the age(Age,count)
    Public DicOfAgeDistributio As New Dictionary(Of Integer, Integer)


    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
        inizialized()
    End Sub
    Public Sub inizialized()
        Count = 0
        CurrentArMean = 0
        DicOfBreed.Clear()
        DicOfAgeDistributio.Clear()
        ListOfDog.Clear()
        Me.RTxTDatabase.Text = ("Breed:".PadRight(25) & "Weight:".PadRight(10) & "Age:" & Environment.NewLine & Environment.NewLine)
        Me.RTxtArOnMean.Text = ("Mean: " & Environment.NewLine & Environment.NewLine)
        Me.RTxtBreedDistribution.Text = "Distribution: " & Environment.NewLine & Environment.NewLine
    End Sub




    Public Sub online_mean(CollForMean As Dog)
        'Online algo for mean
        Count += 1
        CurrentArMean = CurrentArMean + (CollForMean.Age - CurrentArMean) / Count

    End Sub
    Public Overloads Sub discrete_distribution(dictionary As Dictionary(Of String, Integer),
                                     b As String)
        If (dictionary.ContainsKey(b)) Then
            dictionary(b) += 1
        Else
            dictionary.Add(b, 1)
        End If
    End Sub
    Public Overloads Sub discrete_distribution(dictionary As Dictionary(Of Integer, Integer), b As Integer)
        If (dictionary.ContainsKey(b)) Then
            dictionary(b) += 1
        Else
            dictionary.Add(b, 1)
        End If
    End Sub
    Public Sub print_OnArMean()
        Me.RTxtArOnMean.AppendText(CurrentArMean.ToString("N2") & Environment.NewLine)
    End Sub
    Public Sub print_Database(Dog As Dog)
        Me.RTxTDatabase.AppendText(Dog.Breed.PadRight(25) & Dog.Weight.ToString("N2").PadRight(12) & Dog.Age & Environment.NewLine)

    End Sub
    Public Overloads Sub print_Distribution(dictionary As Dictionary(Of String, Integer))
        Me.RTxtBreedDistribution.AppendText("Breed Distribution: " & Environment.NewLine & Environment.NewLine)
        Me.RTxtBreedDistribution.AppendText("Count:".PadRight(10).PadLeft(35) & "Relative Freq:".PadRight(18) & "Percentage:" & Environment.NewLine & Environment.NewLine)
        For Each k As KeyValuePair(Of String, Integer) In dictionary
            Dim RelFreq = k.Value / ListOfDog.Count
            Me.RTxtBreedDistribution.AppendText(k.Key.PadRight(30) & k.Value.ToString.PadRight(10) & RelFreq.ToString("N2").PadRight(15) &
                                                 (RelFreq * 100).ToString("N2") & " %" & Environment.NewLine)
        Next
    End Sub
    Public Overloads Sub print_Distribution(dictionary As Dictionary(Of Integer, Integer))
        Me.RTxtBreedDistribution.AppendText(Environment.NewLine & Environment.NewLine & "Age Distribution: " & Environment.NewLine & Environment.NewLine)
        Me.RTxtBreedDistribution.AppendText("Count:".PadRight(10).PadLeft(35) & "Relative Freq:".PadRight(18) & "Percentage:" & Environment.NewLine & Environment.NewLine)
        For Each k As KeyValuePair(Of Integer, Integer) In dictionary
            Dim RelFreq = k.Value / ListOfDog.Count
            Me.RTxtBreedDistribution.AppendText(k.Key.ToString.PadRight(30) & k.Value.ToString.PadRight(10) & RelFreq.ToString("N2").PadRight(15) &
                                                 (RelFreq * 100).ToString("N2") & " %" & Environment.NewLine)
        Next

    End Sub
    Public Function new_dataset() As Dog
        'Large Dog breed Weight's is between 30 and 50kg
        'And can be old max(usually) 12 years (used for discrete mean exercise's)
        'We consider dogs only over 1 year old
        Dim Dog As New Dog With {
            .Breed = PossibleBreed(R.Next(0, PossibleBreed.Length)),
            .Weight = R.NextDouble * 20 + 30,
            .Age = R.Next(1, 13)
        }
        Return Dog
    End Function
    Private Sub BtnGo_Click(sender As Object, e As EventArgs) Handles BtnGo.Click
        Me.Timer.Start()
    End Sub

    Private Sub Timer_Tick(sender As Object, e As EventArgs) Handles Timer.Tick

        Dim Dog As Dog = new_dataset()
        ListOfDog.Add(Dog)
        print_Database(Dog)
        online_mean(Dog)
        print_OnArMean()
        discrete_distribution(DicOfBreed, Dog.Breed)
        discrete_distribution(DicOfAgeDistributio, Dog.Age)
    End Sub

    Private Sub BtnStop_Click(sender As Object, e As EventArgs) Handles BtnStop.Click
        Me.Timer.Stop()
        Me.RTxtArOnMean.AppendText(Environment.NewLine & ListOfDog.Count & " occurrences" & Environment.NewLine)
    End Sub



    Private Sub BtnDistribution_Click(sender As Object, e As EventArgs) Handles BtnDistribution.Click
        Me.RTxtBreedDistribution.Clear()
        print_Distribution(DicOfBreed)
        print_Distribution(DicOfAgeDistributio)
    End Sub

    Private Sub BtnClear_Click(sender As Object, e As EventArgs) Handles BtnClear.Click
        inizialized()
    End Sub
End Class

Dog.vb

Public Class Dog
    Public Breed As String
    Public Age As Integer
    'Public Height As Double
    Public Weight As Double
    'Public Footage As Double

    Public Overrides Function ToString() As String
        Return Breed & " " & Age & " " & Weight
    End Function
End Class

C#

form1.cs

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace Application4_C
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
            inizialized();
        }
        public Random R = new Random();

        // List for store the dog
        public List<Dog> ListOfDog = new List<Dog>();

        // Array of breed
        public string[] PossibleBreed = new[] { "German Shepherd", "Alaskan Malamute", "Beauceron", "Bernese Mountain Dog", "Black Russian Terrier", "Weimaraner", "Pointer", "Rottweiler", "Siberian Husky", "Afghan Hound", "Boxer", "Collie", "Weimaraner", "Dalmatian", "Golden Retriver", "Irish Setter", "Labrador Retriver" };

        public int Count;
        public double CurrentArMean;

        // Dictionary for Frequence Distribution of the breed(Breed,count)
        public Dictionary<string, int> DicOfBreed = new Dictionary<string, int>();
        // Dictionary for Frequence Distribution of the age(Age,count)
        public Dictionary<int, int> DicOfAgeDistributio = new Dictionary<int, int>();


        public void inizialized()
        {
            Count = 0;
            CurrentArMean = 0;
            DicOfBreed.Clear();
            DicOfAgeDistributio.Clear();
            ListOfDog.Clear();
            this.RTxTDatabase.Text = ("Breed:".PadRight(25) + "Weight:".PadRight(10) + "Age:" + Environment.NewLine + Environment.NewLine);
            this.RTxtArOnMean.Text = ("Mean: " + Environment.NewLine + Environment.NewLine);
            this.RTxtBreedDistribution.Text = "Distribution: " + Environment.NewLine + Environment.NewLine;
        }




        public void online_mean(Dog CollForMean)
        {
            // Online algo for mean
            Count += 1;
            CurrentArMean = CurrentArMean + (CollForMean.Age - CurrentArMean) / Count;
        }
        public void discrete_distribution(Dictionary<string, int> dictionary, string b)
        {
            if ((dictionary.ContainsKey(b)))
                dictionary[b] += 1;
            else
                dictionary.Add(b, 1);
        }
        public void discrete_distribution(Dictionary<int, int> dictionary, int b)
        {
            if ((dictionary.ContainsKey(b)))
                dictionary[b] += 1;
            else
                dictionary.Add(b, 1);
        }
        public void print_OnArMean()
        {
            this.RTxtArOnMean.AppendText(CurrentArMean.ToString("N2") + Environment.NewLine);
        }
        public void print_Database(Dog Dog)
        {
            this.RTxTDatabase.AppendText(Dog.Breed.PadRight(25) + Dog.Weight.ToString("N2").PadRight(12) + Dog.Age + Environment.NewLine);
        }
        public void print_Distribution(Dictionary<string, int> dictionary)
        {
            this.RTxtBreedDistribution.AppendText("Breed Distribution: " + Environment.NewLine + Environment.NewLine);
            this.RTxtBreedDistribution.AppendText("Count:".PadRight(10).PadLeft(35) + "Relative Freq:".PadRight(18) + "Percentage:" + Environment.NewLine + Environment.NewLine);
            foreach (KeyValuePair<string, int> k in dictionary)
            {
                var RelFreq = k.Value / (double)ListOfDog.Count;
                this.RTxtBreedDistribution.AppendText(k.Key.PadRight(30) + k.Value.ToString().PadRight(10) + RelFreq.ToString("N2").PadRight(15) + (RelFreq * 100).ToString("N2") + " %" + Environment.NewLine);
            }
        }
        public void print_Distribution(Dictionary<int, int> dictionary)
        {
            this.RTxtBreedDistribution.AppendText(Environment.NewLine + Environment.NewLine + "Age Distribution: " + Environment.NewLine + Environment.NewLine);
            this.RTxtBreedDistribution.AppendText("Count:".PadRight(10).PadLeft(35) + "Relative Freq:".PadRight(18) + "Percentage:" + Environment.NewLine + Environment.NewLine);
            foreach (KeyValuePair<int, int> k in dictionary)
            {
                var RelFreq = k.Value / (double)ListOfDog.Count;
                this.RTxtBreedDistribution.AppendText(k.Key.ToString().PadRight(30) + k.Value.ToString().PadRight(10) + RelFreq.ToString("N2").PadRight(15) + (RelFreq * 100).ToString("N2") + " %" + Environment.NewLine);
            }
        }
        public Dog new_dataset()
        {
            // Large Dog breed Weight's is between 30 and 50kg
            // And can be old max(usually) 12 years (used for discrete mean exercise's)
            // We consider dogs only over 1 year old
            Dog Dog = new Dog()
            {
                Breed = PossibleBreed[R.Next(0, PossibleBreed.Length)],
                Weight = R.NextDouble() * 20 + 30,
                Age = R.Next(1, 13)
            };
            return Dog;
        }
        private void BtnGo_Click_1(object sender, EventArgs e)
        {
            Timer.Start();
        }


        private void BtnStop_Click_1(object sender, EventArgs e)
        {
            this.Timer.Stop();
            this.RTxtArOnMean.AppendText(Environment.NewLine + ListOfDog.Count + " occurrences" + Environment.NewLine);
        }



        private void BtnDistribution_Click_1(object sender, EventArgs e)
        {
            this.RTxtBreedDistribution.Clear();
            print_Distribution(DicOfBreed);
            print_Distribution(DicOfAgeDistributio);
        }

        private void BtnClear_Click(object sender, EventArgs e)
        {
            inizialized();
        }

        private void Timer_Tick(object sender, EventArgs e)
        {
            Dog Dog = new_dataset();
            ListOfDog.Add(Dog);
            print_Database(Dog);
            online_mean(Dog);
            print_OnArMean();
            discrete_distribution(DicOfBreed, Dog.Breed);
            discrete_distribution(DicOfAgeDistributio, Dog.Age);
        }
    }
}

Dog.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Application4_C
{
    public class Dog
    {
        public string Breed;
        public int Age;
        // Public Height As Double
        //public double Weight;
        // Public Footage As Double

        public override string ToString()
        {
            return Breed + " " + Age + " " + Weight;
        }
    }
}

Homework5_A

Create – in your preferred language C# or VB.NET – a demonstrative program which computes the online arithmetic mean (or “running mean”) and distribution for a continuous variable (can use random simulated values). Make the code as general and reusable as possible, as it must be used in your next applications and exam.

How i worked

I tried to modulate the code, so as to have it more robust and readable. Also for this reason the class has been put in a separate file.

   Private Sub Timer_Tick(sender As Object, e As EventArgs) Handles Timer.Tick
 
        Dim Dog As Dog = new_dataset()
        ListOfDog.Add(Dog)
        print_Database(Dog)
        online_mean(Dog.Weight)
        print_OnArMean()
        discrete_distribution(DicOfBreed, Dog.Breed)
        discrete_distribution(DicOfAgeDistributio, Dog.Age)
    End Sub

Vb.Net

Form1.vb

Public Class Form1
    Public R As New Random

    'List for store the dog
    Public ListOfDog As New List(Of Dog)

    'Array of breed
    Public PossibleBreed() As String = {"German Shepherd", "Alaskan Malamute", "Beauceron", "Bernese Mountain Dog",
                                        "Black Russian Terrier", "Weimaraner", "Pointer", "Rottweiler", "Siberian Husky",
                                        "Afghan Hound", "Boxer", "Collie", "Weimaraner", "Dalmatian", "Golden Retriver",
                                        "Irish Setter", "Labrador Retriver"}

    Public Count As Integer
    Public CurrentArMean As Double

    Dim listOfIntervals As New List(Of Interval)


    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
        inizialized()
    End Sub
    Public Sub inizialized()
        Count = 0
        CurrentArMean = 0
        ListOfDog.Clear()
        listOfIntervals.Clear()
        Me.RTxTDatabase.Text = ("Breed:".PadRight(25) & "Weight:".PadRight(10) & "Time:".PadRight(12) & "Age:" & Environment.NewLine & Environment.NewLine)
        Me.RTxtArOnMean.Text = ("Mean: " & Environment.NewLine & Environment.NewLine)
        Me.RTxtBreedDistribution.Text = "Distribution: " & Environment.NewLine & Environment.NewLine
    End Sub



    'Online algo for mean
    Public Sub online_mean(mean As Double)
        Count += 1
        CurrentArMean = CurrentArMean + (mean - CurrentArMean) / Count
    End Sub
    Public Overloads Sub discrete_distribution(dictionary As Dictionary(Of String, Integer),
                                     b As String)
        If (dictionary.ContainsKey(b)) Then
            dictionary(b) += 1
        Else
            dictionary.Add(b, 1)
        End If
    End Sub
    Public Overloads Sub discrete_distribution(dictionary As Dictionary(Of Integer, Integer), b As Integer)
        If (dictionary.ContainsKey(b)) Then
            dictionary(b) += 1
        Else
            dictionary.Add(b, 1)
        End If
    End Sub
    Public Sub print_OnArMean(type As String)
        Me.RTxtArOnMean.AppendText(type & ": " & CurrentArMean.ToString("N2") & Environment.NewLine)
    End Sub
    Public Sub print_Database(Dog As Dog)
        Me.RTxTDatabase.AppendText(Dog.Breed.PadRight(25) & Dog.Weight.ToString.PadRight(10) & Dog.Time.ToString.PadRight(12) & Dog.Age & Environment.NewLine)

    End Sub

    Public Sub print_continue_distr()
        Me.RTxtBreedDistribution.AppendText("Continue time Distribution: " & Environment.NewLine & Environment.NewLine)
        Me.RTxtBreedDistribution.AppendText("Count:".PadRight(10).PadLeft(25) & "Relative Freq:".PadRight(18) & "Percentage:" & Environment.NewLine & Environment.NewLine)
        For Each i In listOfIntervals
            Me.RTxtBreedDistribution.AppendText(i.lowerEnd & " - " & i.upperEnd.ToString.PadRight(15) & i.count.ToString.PadRight(13) & i.relativeFreq.ToString("N2").PadRight(10) & i.percentage.ToString("N2") & " %" & Environment.NewLine)
        Next
    End Sub
    Public Function new_dataset() As Dog
        Dim minTime As Integer = 10
        Dim maxTime As Integer = 120
        Dim maxWeight As Integer = 50
        Dim minWeight As Integer = 30
        'Time to complete a race can be between 10 and 120 minutes

        Dim Dog As New Dog With {
            .Breed = PossibleBreed(R.Next(0, PossibleBreed.Length)),
            .Time = Math.Round(R.NextDouble * (maxTime - minTime) + minTime, 2),
            .Weight = Math.Round(R.NextDouble * (maxWeight - minWeight) + minWeight, 2),
            .Age = R.Next(1, 13)
        }
        Return Dog
    End Function
    Private Sub BtnGo_Click(sender As Object, e As EventArgs) Handles BtnGo.Click
        Me.Timer.Start()
    End Sub

    Private Sub Timer_Tick(sender As Object, e As EventArgs) Handles Timer.Tick

        Dim Dog As Dog = new_dataset()
        ListOfDog.Add(Dog)
        print_Database(Dog)
        online_mean(Dog.Weight)
        print_OnArMean("Weight mean")
        Count = 0
        CurrentArMean = 0
        online_mean(Dog.Time)
        print_OnArMean("Time mean")
        create_interval(Dog)
        continue_distribution()
    End Sub

    Public Sub continue_distribution()
        For Each i In listOfIntervals
            i.relativeFreq = i.count / listOfIntervals.Count
            i.percentage = i.relativeFreq * 100
        Next
    End Sub

    Private Sub BtnStop_Click(sender As Object, e As EventArgs) Handles BtnStop.Click
        Me.Timer.Stop()
        Me.RTxtArOnMean.AppendText(Environment.NewLine & ListOfDog.Count & " occurrences" & Environment.NewLine)
    End Sub



    Private Sub BtnDistribution_Click(sender As Object, e As EventArgs) Handles BtnDistribution.Click
        Me.RTxtBreedDistribution.Clear()
        print_continue_distr()
    End Sub

    Private Sub BtnClear_Click(sender As Object, e As EventArgs) Handles BtnClear.Click
        inizialized()
    End Sub

    'interval
    Public Function interval_Correct(d As Dog, i As Interval) As Boolean
        Return d.Time > i.lowerEnd AndAlso d.Time <= i.upperEnd
    End Function

    Public Function interval_isLeft(d As Dog) As Boolean
        Return d.Time <= listOfIntervals(0).lowerEnd
    End Function
    Public Function interval_isRight(d As Dog) As Boolean
        Return d.Time > listOfIntervals(listOfIntervals.Count - 1).upperEnd
    End Function

    Public Sub inizialized_list(intervalSize As Double)
        Dim startingEndPoint As Double = 40
        Dim interval_0 As New Interval
        interval_0.lowerEnd = startingEndPoint
        interval_0.upperEnd = interval_0.lowerEnd + intervalSize
        listOfIntervals.Add(interval_0)
    End Sub
    Public Sub create_interval(d As Dog)
        Dim intervalSize As Double = 5

        If listOfIntervals.Count = 0 Then
            inizialized_list(intervalSize)
        End If


        Dim dogAlloc As Boolean = False

        For Each i In listOfIntervals
            If interval_Correct(d, i) Then
                i.count += 1
                dogAlloc = True
                Exit For
            End If
        Next
        If dogAlloc Then
            Exit Sub
        End If
        If interval_isLeft(d) Then

            newLeftInterval(intervalSize, d)

        ElseIf interval_isRight(d) Then
            newRightInterval(intervalSize, d)
        End If
    End Sub

    Public Sub newLeftInterval(intervalSize As Integer, d As Dog)

        Do
            Dim newLeftIn As New Interval
            newLeftIn.upperEnd = listOfIntervals(0).lowerEnd
            newLeftIn.lowerEnd = newLeftIn.upperEnd - intervalSize

            listOfIntervals.Insert(0, newLeftIn)

            If interval_Correct(d, newLeftIn) Then
                newLeftIn.count += 1
                Exit Do
            End If
        Loop
    End Sub

    Public Sub newRightInterval(intervalSize As Integer, d As Dog)
        Do
            Dim newRightIn As New Interval
            newRightIn.lowerEnd = listOfIntervals(listOfIntervals.Count - 1).upperEnd
            newRightIn.upperEnd = newRightIn.lowerEnd + intervalSize
            listOfIntervals.Add(newRightIn)

            If interval_Correct(d, newRightIn) Then
                newRightIn.count += 1
                Exit Do
            End If
        Loop
    End Sub
End Class

Dog.vb

Public Class Dog
    Public Breed As String
    Public Age As Integer
    Public Time As Double
    Public Weight As Double

    Public Overrides Function ToString() As String
        Return Breed & " " & Age & " " & Weight & " " & Time
    End Function
End Class

Interval.vb

Public Class Interval
    Public lowerEnd As Double
    Public upperEnd As Double
    Public count As Integer
    Public relativeFreq As Double
    Public percentage As Double
End Class

Homework6_R

Show how we can obtain an online algo for the arithmetic mean and explain the various possible reasons why it is preferable to the “naive” algo based on the definition.

Online Algorithm

An online problem is a problem where the size of the input is not known in advance. An algorithm that solves an online problem by continous calculation (after each single input) is called an online algorithm.

Algorithm for Arithmetic mean

The usual way to calculate the average for a set of data is:

This naive approach to obtain the mean have a couple of limitations in practice:

We accumulate a potentially large sum, whic can cause precision and overflows problems when using floating point types.
We need to have all of the data available before we can do the calculation!

Both of these problems can be solved with an incremental approach where we adjust the average for each new value that comes along:

Let’s start with the formula for the average that we saw earlier:

Let’s split up the sum so that we add up the first n-1 values first, and then we add the last value x_n.

We know that the average = total / count:

Let’s rearrange this a bit:

Here’s the result of applying the above substitution to the total of the first n-1 values:

Let’s expand this:

Rearranging a bit, we get:

We can cancel out the n’s in the first fraction to obtain our final result:

So, the strategy for an online algorithm is to hold some current value for the average(i.e the precedent mean).

Observing the naive method, we see easily that if some values are too large or if we add infinite ones we will arrive at the overflow or we will have problems with bad approximation, instead with the second method (studied by Knuth) we always add a small contribution, because the added contribution is the difference between the last observation and the first one, divided by N. The larger N is, the smaller it will be.

https://nestedsoftware.com/2018/03/20/calculating-a-moving-average-on-streaming-data-5a7k.22879.html
https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Homework5_R

Describe the most common configuration of data repository in the real world and corporate environment. Concepts such as Operation systems (oltp), Data Warehouse DW, Data Marts, Analitical and statistical systems (olap), etc. Try to draw a conceptual picture of how all these elements works toghether and how the flow of data and informations is processed to extract useful knowledge from raw data.

It’s one thing to collect and store data
it’s another to accurately decipher what the data is saying
“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”
Aaron Levenstein, Business Professor at Baruch College

Successful organizations continue to derive business value from their data. One of the first steps towards a successful big data strategy is choosing the underlying technology of how data will be stored, searched, analyzed, and reported on.

Data repository

A data repository is also known as a data library or data archive. Is a large database infrastructure that collect, manage, and store data sets for data analysis, sharing and reporting.
Examples are:

data warehouse
data lake
data marts

We have a source of data (i.e. Businesses), that with a stream store the data in a repository (i.e. Data Lake).

OLTP(On-line Transaction Processing) is the operation systems that provide source data to data repository such as Data Lake.

Data lakes are a great way to store huge amounts of data and drive business insights. But they have limited governance and weak traceability, lineage, and quality. Many lakes have turned into swamps, that is a storage more confusionary.

With some operations, i.e. ETL(Extraxt, Trasform and Load) or ELT, we can build and store the data in Data Warehouse, a more organized data repository.

Data Warehouse stores data in files or folders which helps to organize and use the data to take strategic decisions.

From Data Warehouse we can create another kind of repository that are most focus on specific task: Data Marts.

From Data Marts we have operation systems that analize the data, i.e. OLAP (On-line Analytical Processing)

Conclusions

Data lakes offer the flexibility of storing raw data, including all the meta data and a schema can be applied when extracting the data to be analyzed. Databases and Data Warehouses require ETL processes where the raw data is transformed into a pre-determined structure, also known as, schema-on-write.
Data warehouses typically deal with large data sets, but data analysis requires easy-to-find and readily available data. That’s why smart companies use data marts.
The data marts are one key to efficiently transforming information into insights.

Even with the improved flexibility and efficiency that data marts offer, big data—and big business—is still becoming too big for many on-premises solutions. As data warehouses and data lakes move to the cloud, so too do data marts.

Sources:
Data Repository
OLTP and OLAP
ELT and ETL
https://www.mackenziecorp.com/much-data-not-enough-data-lets-start/
https://www.confluent.io/learn/database-data-lake-data-warehouse-compared/
https://www.talend.com/resources/what-is-data-mart/

Homework4_R

A characteristic (or attribute or feature or property) of the units of observation can be measured and operationalized on different “levels”, on a given unit of observation, giving rise to possible different operative variables. Find out about the proposed classifications of variables and express your opinion about their respective usefulness

Level of measurement

Also called scale of measure is a classification that describes the nature of information within the values assigned to variables.
Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement.

**Qualitative*/Categorical***

Nominal: that cannot be put in any order
Ordinal: wich, even if they aren’t numbers, can be order and still does not allow for relative degree of difference between them

Quantitative/Numerical

Interval: the difference is meaningful(Numbers have order, like ordinal, but there are also equal intervals between adjacent categories)
Ratio: Differences are meaningful(Linke interval) but there is also a true zero point

Usefullness

https://quizzma.com/levels-of-measurement-quiz/

While these levels are reasonable, they are not exhaustive. Other statisticians have proposed new typologies, but this seem the most used, because the extended levels of measurement are rarely used outside of academic geography.

We need to pay attention, cause can be that the same variable may be a different scale type depending on how it is measured and on the goals of the analysis:

Age usually is Ratio Data(Quantitaive), but in some case we can think to the age how Qualitative.

Example of advantage and disadvantage

Ordinal measurement is normally used for surveys and questionnaires. Statistical analysis is applied to the responses once they are collected to place the people who took the survey into the various categories. The data is then compared to draw inferences and conclusions about the whole surveyed population with regard to the specific variables. The advantage of using ordinal measurement is ease of collation and categorization. If you ask a survey question without providing the variables, the answers are likely to be so diverse they cannot be converted to statistics.

The same characteristics of ordinal measurement that create its advantages also create certain disadvantages. The responses are often so narrow in relation to the question that they create or magnify bias that is not factored into the survey. For example, on the question about satisfaction with the governor, people might be satisfied with his job performance but upset about a recent sex scandal. The survey question might lead respondents to state their dissatisfaction about the scandal, in spite of satisfaction with his job performance — but the statistical conclusion will not differentiate.

Statistics and geostatistics
https://petrowiki.org/Statistical_concepts

Sources:
https://en.wikipedia.org/wiki/Level_of_measurement

https://www.youtube.com/watch?v=KIBZUk39ncI

https://www.youtube.com/watch?v=eghn__C7JLQ

https://sciencing.com/advantages-disadvantages-using-ordinal-measurement-12043783.html

How it works

Issues

Representation

Rounding

Cancellation

Guard Digits

Observation

Problems with floating representation

Problems with naive alghoritm

How i worked

Vb.Net

C#

How i worked

Vb.Net

Online Algorithm

Algorithm for Arithmetic mean

Data repository

Conclusions

Level of measurement

Qualitative/Categorical

Quantitative/Numerical

Usefullness

Example of advantage and disadvantage

**Qualitative*/Categorical***