Lesson_04 – Statistics Course

Homework10_A

Implement your own algorithm to compute a frequency distribution of the words from any text (possibly judiciously scraped from websites) and draw some personal graphical representation of the “word cloud”

Code VB.NET

https://drive.google.com/file/d/1nHYV1BiQIPjvwI08zaqnm7j_e2BoSfGO/view?usp=sharing

Connect to web page

'avoid pop-up
WebBrowser1.ScriptErrorsSuppressed = True

        If String.IsNullOrEmpty(url1) Then Return
        If url1.Equals("about:blank") Then Return
        If Not url1.StartsWith("http://") And
        Not url1.StartsWith("https://") Then
            url1 = "https://" & url1
        End If

'in web browser window upload the page in url
        Try
            WebBrowser1.Navigate(New Uri(url1))
        Catch ex As System.UriFormatException
            Return
        End Try

Copy all the words

WebBrowser1.Document.ExecCommand("SelectAll", False, Nothing)
        WebBrowser1.Document.ExecCommand("Copy", False, Nothing)
        WebBrowser1.Document.ExecCommand("Unselect", False, Type.Missing)
        RTXBCloud.Text = Clipboard.GetText
        Clipboard.Clear()

Build rettangle

 'orderDistrW is in decrescending order
        For Each kvp In orderDistrW
            Dim rect As RectangleF
            Dim f As New Font("arial", (kvp.Value * 200) / orderDistrW.First.Value)
            Dim s As SizeF = g.MeasureString(kvp.Key, f)


         
            Dim tries As Integer
            Do
                'build a rectangle with size of the word and save them for check if the position in the picturbeox is free with 'occupied' function
                'points x and y of rectangle are taken to not go out the picturebox
                Dim x = viewport.Left + ((viewport.Right + 1 - s.Width) - viewport.Left) * R.NextDouble()
                Dim y = viewport.Top + ((viewport.Bottom + 1 - s.Height) - viewport.Top) * R.NextDouble()

              
                rect = New RectangleF(New PointF(x, y), s)
                If Not Occupied(rect, listOfRect) Then Exit Do
                tries += 1
                If tries >= 100000 Then Continue For
            Loop

            g.DrawString(kvp.Key, f, New SolidBrush(Color.FromArgb(R.Next(256), R.Next(256), R.Next(256))), New Point(rect.X, rect.Y))
            listOfRect.Add(rect)

        Next

Homework9_A

Prepare separately the following charts:
1) Scatterplot
2) Histogram/Column chart [in the histogram, within each class interval, draw also a vertical colored line where lies the true mean of the observations falling in that class]
3) Contingency table, using the graphics object and the Drawstring(), MeasureString(), DrawLine(), etc. methods.
When done, merge these charts in your previous application 7_A.
Use them to represent 2 numerical variables that you select from a CSV file. In particular, in the same picture box, you will make 2 separate charts:
1 rectangle (chart) will contain the contingency table
1 rectangle (chart) will contain the scatterplot, with the histograms/column charts and rug plots drawn respectively near the two axis (and oriented accordingly).

UPDATE: CodeVB.Net version 2.0

https://drive.google.com/file/d/1Os4CTy97ceuZFr3onDnqK09QQVE65dGV/view?usp=sharing

CodeVB.Net

https://drive.google.com/file/d/1EV97N2hHhMmilz8th-7K_i3Bf-e6c7jJ/view?usp=sharing

Some control miss. But i have fixed the problem with histogram and add the mean and the calculate of distribution. Fix also the proportion of height of histogram.
Add also move and wheel for graph(table and scatterplot+histo)

http://www.devcity.net/printarticle.aspx?articleid=138

Homework10_R

Explain a unified conceptual framework to obtain all most common measures of central tendency using the concept of distance (or “premetric” in general).

Measures of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics.

The p-norm and L^p spaces

For a real number p ≥ 1, the p-norm or L^p-norm of x is defined by:

{\displaystyle \left\|x\right\|_{p}=\left(|x_{1}|^{p}+|x_{2}|^{p}+\dotsb +|x_{n}|^{p}\right)^{1/p}.}

The Euclidean norm from above falls into this class and is the 2-norm, and the 1-norm is the norm that corresponds to the rectilinear distance (Manhattan distance).

The length of a vector x = (x₁, x₂, …, x_n) in the n-dimensional real vector space Rⁿ is usually given by the Euclidean norm:

{\displaystyle \left\|x\right\|_{2}=\left({x_{1}}^{2}+{x_{2}}^{2}+\dotsb +{x_{n}}^{2}\right)^{1/2}.}

The Euclidean distance between two points x and y is the length ||x − y||₂ of the straight line between the two points.

The function: $d_{p}(x,y)=\sum _{i=1}^{n}|x_{i}-y_{i}|^{p}$ defines a metric.

The L^p spaces are function spaces defined using a natural generalization of the p-norm for finite-dimensional vector spaces.
In statistics, measures of central tendency and statistical dispersion, such as the mean, median, and standard deviation, are defined in terms of L^p metrics, and measures of central tendency can be characterized as solutions to variational problems, in the sense of the calculus of variations, namely minimizing variation from the center.
In the sense of L^p spaces, the correspondence is:

In equations, for a given (finite) data set X, thought of as a vector x = (x₁,…,x_n), the dispersion about a point c is the “distance” from x to the constant vector c = (c,…,c) in the p-norm (normalized by the number of points n):

{\displaystyle f_{p}(c)=\left\|\mathbf {x} -\mathbf {c} \right\|_{p}:={\bigg (}{\frac {1}{n}}\sum _{i=1}^{n}\left|x_{i}-c\right|^{p}{\bigg )}^{1/p}}

For p = 0 and p = +-∞ these functions are defined by taking limits.

Clustering

Instead of a single central point, one can ask for multiple points such that the variation from these points is minimized. This leads to cluster analysis, where each point in the data set is clustered with the nearest “center”.

Mode, median and mean

Hence, measures of central tendency help you find the middle, or the average, of a data set. The 3 most common measures of central tendency are the

mode: the most frequent value.
median: the middle number in an ordered data set.
mean: the sum of all values divided by the total number of value.

https://www.scribbr.com/statistics/central-tendency/
https://en.wikipedia.org/wiki/Lp_space#The_p-norm_in_finite_dimensions
https://en.wikipedia.org/wiki/Central_tendency

Homework11_R

What are the most common types of means known? Find one example where these two types of means arise naturally: geometric, harmonic.

General or power mean

In mathematics, generalized means (or power mean) are a family of functions for aggregating sets of numbers, that include as special cases the Pythagorean means (arithmetic, geometric, and harmonicmeans).
The generalized mean or power mean is:

M_{p}(x_{1},\dots ,x_{n})=\left({\frac {1}{n}}\sum _{{i=1}}^{n}x_{i}^{p}\right)^{{{\frac {1}{p}}}}. — Special cases:
https://en.wikipedia.org/wiki/Generalized_mean#Special_cases

Arithmetic mean

It is generally referred as the average or simply mean. (p = 1).

Geometric mean

It indicates the central tendency or typical value of a set of numbers by using the product of their values (When p -> 0):

The geometric mean can be understood in terms of geometry. The geometric mean of two numbers, $a$ and b $b$ , is the length of one side of a square whose area is equal to the area of a rectangle with sides of lengths $a$ and $b$ .
The geometric mean is used in finance to calculate average growth rates and is referred to as the compounded annual growth rate.

Harmonic Mean

Typically, it is appropriate for situations when the average of rates is desired(p = -1):

In computer science, specifically information retrieval and machine learning, the harmonic mean of the precision (true positives per predicted positive) and the recall (true positives per real positive) is often used as an aggregated performance score for the evaluation of algorithms and systems: the F-score (or F-measure). This is used in information retrieval because only the positive class is of relevance, while number of negatives, in general, is large and unknown.
The weighted harmonic mean is used in finance to average multiples like the price-earnings ratio because it gives equal weight to each data point.

https://en.wikipedia.org/wiki/Geometric_mean
https://www.investopedia.com/ask/answers/060115/what-are-some-examples-applications-geometric-mean.asp
https://econtutorials.com/blog/mean-and-its-types-in-statistics/
https://en.wikipedia.org/wiki/Harmonic_mean
https://www.investopedia.com/terms/h/harmonicaverage.asp

Homework12_R

Explain the idea underlying the measures of dispersion and the reasons of their importance.

Dispersion

In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed or, also, is a way of describing how spread out a set of data is. Common examples of measures of statistical dispersion are the variance and standard deviation.

Dispersion is contrasted with location or central tendency, and together they are the most used properties of distributions.

Some measures of the dispersion

Range: is the simple measure of dispersion, which is defined as the difference between the largest value and the smallest value.

Standard Deviation: the most used method, It is a measure of spread of data about the mean.

Why necessary?

While measures of central tendency are used to estimate “normal” values of a dataset, measures of dispersion are important for describing the spread of the data, or its variation around a central value.
Two distinct samples may have the same mean or median, but completely different levels of variability, or vice versa. A proper description of a set of data should include both of these characteristics.

When it comes to samples, that dispersion is important because it determines the margin of error you’ll have when making inferences about measures of central tendency, like averages.
Show you the variability of your data.

https://en.wikipedia.org/wiki/Statistical_dispersion
https://iridl.ldeo.columbia.edu/dochelp/StatTutorial/Dispersion/index.html
https://www.statisticssolutions.com/dispersion/
https://exploringyourmind.com/measures-of-dispersion-in-statistics/

Homework13_R

Find out all the most important properties of the linear regression.

What is it?

In statistics, is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

Linear regression finds the straight line, LSRL, that best represents observations in a bivariate data set.
Suppose Y is a dependent variable, and X is an independent variable. The population regression line is:

Y = a + bX

where ‘a‘ is a constant and ‘b‘ the regression coefficient(in relationship to the angular coefficient).

The regression line has the following properties:

The line minimizes the sum of squared differences between observed values and predicted values
The regression line passes through the mean of the X values and through the mean of the Y values
The regression constant (a) is equal to the y intercept of the regression line.
The regression coefficient (b) is the average change in the dependent variable (Y) for a 1-unit change in the independent variable (X). It is the slope of the regression line.

https://en.wikipedia.org/wiki/Linear_regression
https://www.tandfonline.com/doi/abs/10.1080/00220671.1947.10881608?journalCode=vjer20
https://stattrek.com/regression/linear-regression.aspx

Homework7_RA

Do a research about the real world window to viewport transformation.

What is?

Window to Viewport Transformation is the process of transforming a 2D world-coordinate objects to device coordinates. Objects inside the world or clipping window are mapped to the viewport which is the area on the screen where world coordinates are mapped to be displayed.

General Terms:

World coordinate – It is the Cartesian coordinate w.r.t which we define the diagram, like X_wmin, X_wmax, Y_wmin, Y_wmax
Device Coordinate –It is the screen coordinate where the objects is to be displayed, like X_vmin, X_vmax, Y_vmin, Y_vmax
Window –It is the area on world coordinate selected for display.
ViewPort –It is the area on device coordinate where graphics is to be displayed.

Mathematical Calculation of Window to Viewport:

It may be possible that the size of the Viewport is much smaller or greater than the Window. In these cases, we have to increase or decrease the size of the Window according to the Viewport and for this, we need some mathematical calculations.

(x_w, y_w): A point on Window
(x_v, y_v): Corresponding  point on Viewport

Where Sx and Sy are the scaling factor.

Exemple in C#

Manual trasforming.

// C# program to implement 
// Window to ViewPort Transformation 
using System; 

class GFG 
{ 

// Function for window to viewport transformation 
static void WindowtoViewport(int x_w, int y_w, 
							int x_wmax, int y_wmax, 
							int x_wmin, int y_wmin, 
							int x_vmax, int y_vmax, 
							int x_vmin, int y_vmin) 
{ 
	// point on viewport 
	int x_v, y_v; 

	// scaling factors for x coordinate 
	// and y coordinate 
	float sx, sy; 

	// calculatng Sx and Sy 
	sx = (float)(x_vmax - x_vmin) / 
				(x_wmax - x_wmin); 
	sy = (float)(y_vmax - y_vmin) / 
				(y_wmax - y_wmin); 

	// calculating the point on viewport 
	x_v = (int) (x_vmin + 
		(float)((x_w - x_wmin) * sx)); 
	y_v = (int) (y_vmin + 
		(float)((y_w - y_wmin) * sy)); 

	Console.Write("The point on viewport: " + 
				"({0}, {1} )\n ", x_v, y_v); 
} 

// Driver Code 
public static void Main(String[] args) 
{ 

	// boundary values for window 
	int x_wmax = 80, y_wmax = 80, 
		x_wmin = 20, y_wmin = 40; 

	// boundary values for viewport 
	int x_vmax = 60, y_vmax = 60, 
		x_vmin = 30, y_vmin = 40; 

	// point on window 
	int x_w = 30, y_w = 80; 

	WindowtoViewport(30, 80, 80, 80, 20, 
					40, 60, 60, 30, 40); 
} 
} 

// This code is contributed by PrinciRaj1992

https://www.geeksforgeeks.org/window-to-viewport-transformation-in-computer-graphics-with-implementation/

https://www.javatpoint.com/computer-graphics-window-to-viewport-co-ordinate-transformation

Homework8_RA(To be reviewed)

Do a research with examples about how matrices and homogeneous coordinates can be useful for graphics transformations and charts.

Homogeneous coordinates

In mathematics, homogeneous coordinates are a system of coordinates used in projective geometry, as Cartesian coordinates are used in Euclidean geometry.

They have the advantage that the coordinates of points, including points at infinity, can be represented using finite coordinates.
Formulas involving homogeneous coordinates are often simpler and more symmetric than their Cartesian counterparts.
Homogeneous coordinates have a range of applications, including computer graphics and 3D computer vision, where they allow affine transformations and, in general, projective transformations to be easily represented by a matrix.

Any point in the projective plane is represented by a triple (X, Y, Z), called the homogeneous coordinates or projective coordinates of the point, where X, Y and Z are not all 0.
The point represented by a given set of homogeneous coordinates is unchanged if the coordinates are multiplied by a common factor.
Conversely, two sets of homogeneous coordinates represent the same point if and only if one is obtained from the other by multiplying all the coordinates by the same non-zero constant.
When Z is not 0 the point represented is the point (X/Z, Y/Z) in the Euclidean plane.
When Z is 0 the point represented is a point at infinity.

Matrix and trasformation

Using homogeneous coordinates allows to use matrix multiplication to calculate transformations extremely efficient!

Since a 2×2 matrix representation of translation does not exist, by using a homogenous coordinate system, we can represent 2×2 translation transformation as a matrix multiplication.
A point (x, y) can be re-written in homogeneous coordinates as (xw, yw,w).
The homogeneous parameterw is a non-zero value such that x and y coordinates can easily be recovered by dividing the first and second numbers by the third.

Insights into geometry
http://precollegiate.stanford.edu/circle/math/notes06f/homogenous.pdf

https://en.wikipedia.org/wiki/Homogeneous_coordinates
https://uomustansiriyah.edu.iq/media/lectures/9/9_2019_04_24!06_36_54_PM.pdf