Homework5_R

Describe the most common configuration of data repository in the real world and corporate environment. Concepts such as Operation systems (oltp), Data Warehouse DW, Data Marts, Analitical and statistical systems (olap), etc. Try to draw a conceptual picture of how all these elements works toghether and how the flow of data and informations is processed to extract useful knowledge from raw data.

It’s one thing to collect and store data
it’s another to accurately decipher what the data is saying

“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”
Aaron Levenstein, Business Professor at Baruch College

Successful organizations continue to derive business value from their data. One of the first steps towards a successful big data strategy is choosing the underlying technology of how data will be stored, searched, analyzed, and reported on.

Data repository

A data repository is also known as a data library or data archive. Is a large database infrastructure that collect, manage, and store data sets for data analysis, sharing and reporting.
Examples are:

  • data warehouse
  • data lake
  • data marts

We have a source of data (i.e. Businesses), that with a stream store the data in a repository (i.e. Data Lake).

OLTP(On-line Transaction Processing) is the operation systems that provide source data to data repository such as Data Lake.

Data lakes are a great way to store huge amounts of data and drive business insights. But they have limited governance and weak traceability, lineage, and quality. Many lakes have turned into swamps, that is a storage more confusionary.

With some operations, i.e. ETL(Extraxt, Trasform and Load) or ELT, we can build and store the data in Data Warehouse, a more organized data repository.

Data Warehouse stores data in files or folders which helps to organize and use the data to take strategic decisions.

From Data Warehouse we can create another kind of repository that are most focus on specific task: Data Marts.

From Data Marts we have operation systems that analize the data, i.e. OLAP (On-line Analytical Processing)

Conclusions

Data lakes offer the flexibility of storing raw data, including all the meta data and a schema can be applied when extracting the data to be analyzed. Databases and Data Warehouses require ETL processes where the raw data is transformed into a pre-determined structure, also known as, schema-on-write.
Data warehouses typically deal with large data sets, but data analysis requires easy-to-find and readily available data. That’s why smart companies use data marts.
The data marts are one key to efficiently transforming information into insights.

Even with the improved flexibility and efficiency that data marts offer, big data—and big business—is still becoming too big for many on-premises solutions. As data warehouses and data lakes move to the cloud, so too do data marts.

Sources:

Data Repository

OLTP and OLAP

ELT and ETL

https://www.mackenziecorp.com/much-data-not-enough-data-lets-start/
https://www.confluent.io/learn/database-data-lake-data-warehouse-compared/
https://www.talend.com/resources/what-is-data-mart/

Homework4_R

A characteristic (or attribute or feature or property) of the units of observation can be measured and operationalized on different “levels”, on a given unit of observation, giving rise to possible different operative variables. Find out about the proposed classifications of variables and express your opinion about their respective usefulness

Level of measurement

Also called scale of measure is a classification that describes the nature of information within the values assigned to variables.
Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement.

More in general, we have:

Qualitative/Categorical

  • Nominal: that cannot be put in any order
  • Ordinal: wich, even if they aren’t numbers, can be order and still does not allow for relative degree of difference between them

Quantitative/Numerical

  • Interval: the difference is meaningful(Numbers have order, like ordinal, but there are also equal intervals between adjacent categories)
  • Ratio: Differences are meaningful(Linke interval) but there is also a true zero point

Usefullness

While these levels are reasonable, they are not exhaustive. Other statisticians have proposed new typologies, but this seem the most used, because the extended levels of measurement are rarely used outside of academic geography.

We need to pay attention, cause can be that the same variable may be a different scale type depending on how it is measured and on the goals of the analysis:

Age usually is Ratio Data(Quantitaive), but in some case we can think to the age how Qualitative.

Example of advantage and disadvantage

Ordinal measurement is normally used for surveys and questionnaires. Statistical analysis is applied to the responses once they are collected to place the people who took the survey into the various categories. The data is then compared to draw inferences and conclusions about the whole surveyed population with regard to the specific variables. The advantage of using ordinal measurement is ease of collation and categorization. If you ask a survey question without providing the variables, the answers are likely to be so diverse they cannot be converted to statistics.

The same characteristics of ordinal measurement that create its advantages also create certain disadvantages. The responses are often so narrow in relation to the question that they create or magnify bias that is not factored into the survey. For example, on the question about satisfaction with the governor, people might be satisfied with his job performance but upset about a recent sex scandal. The survey question might lead respondents to state their dissatisfaction about the scandal, in spite of satisfaction with his job performance — but the statistical conclusion will not differentiate.

Statistics and geostatistics

https://petrowiki.org/Statistical_concepts

Sources:

https://en.wikipedia.org/wiki/Level_of_measurement

https://www.youtube.com/watch?v=KIBZUk39ncI

https://www.youtube.com/watch?v=eghn__C7JLQ

https://sciencing.com/advantages-disadvantages-using-ordinal-measurement-12043783.html