Data Warehouse Concepts

Data Warehouse – 

A data warehouse is a subject oriented integrated time variant and nonvolatile collection of data in support of management decision-making process.

Data Warehousing – 

Data warehousing is the process of constructing and using the data warehouse. The major processes involved are data cleaning, data integration, and data consolidation.

Data Mart –

A data mart is the subset of a data warehouse that’s of value to a specific group of users and catering to a specific subject area of the business.

Data Cube –

A data cube allows data to be modeled and viewed in multiple dimensions. it’s defined by dimensions and facts.

Dimensions-

Dimensions are the perspectives or entities with respect to which an organization wants to keep records. for example time, item, product, location etc are the dimensions of sales.

Facts – 

Facts are numeric measures. they are the quantities by which we want to analyze the relationship between dimensions. The quantity of sales, Sales Amount etc.

Start Schema –

The most common modeling paradigm for a data warehouse, in which the model contains a large central fact table and all dimension tables are directly linked to the fact table.

Snowflake Schema –

The snowflake schema is a variant of the star schema model where some dimension table is normalized, thereby splitting the data into additional dimension tables.

Fact Constellation –

A collection of stars in which multiple fact tables share same dimension tables.

Hierarchy-

A sequence of mappings from a set of low-level concepts to higher level more general concepts. Day >> Week >> month >> quarter >> Year

Roll-up-

The roll-up operation performs aggregation on a data cube, either by climbing up a dimension hierarchy or dimension reduction.

Drill Down-

The reverse of roll-up. it navigates from less detailed data to more detailed data.

Slice-

The slice operation performs a selection on one dimension of the given cube, resulting in a subcube.

Dice-

The dice operation defines a subcube by performing a selection on two or more dimensions.

Pivot-

Pivot is a visualization operation that rotates the data axes in views to provide an alternate data presentation.

Drill through- 

Navigating to the detailed data from the bottom levels.

 

 

 

 

Techinques for Filling Missing Values

There are many methods to fill missing values in a dataset while cleaning.

  1. Ignore the tuple
  2. Fill in the missing values manually – Time consuming not practical with large datasets
  3. Use a global constant to fill in the missing value – Ie replace all missing values with unknown or something similar
  4. Use a measure of central tendency for the attribute such as mean, median etc
  5. Use the most probable value to fill the missing value

Histograms and Scatter Plots

Histogram –  Histogram is a chart of poles. Plotting histogram is a graphical method for summarizing the distribution of a given attribute, the resulting distribution is most commonly known as bar chart.

Scatter Plot- Scatter plot is one of the most effective graphical methods for determining if there appears to be a relationship or pattern between two numeric attributes.

Positive Correlation – If the pattern of the points slopes from Lower left to Upper right.

/

Negative Correlation –  If the pattern of the points slopes from upper left to lower right.

\

Image result for histograms and scatter plots

Varience and Standard Deviation

Variance and Standard deviation are measures of data dispersion, they indicate how spread out a data distribution is.

A low standard deviation means that the data observations tend to be very close to the mean. while a high standard deviation indicates that the data is spread out over a large range of values.

Variance – 

Variance is the average of the squared differences from the mean,  it gives us a very general idea of the spread of our data. A value of zero means that there is no variability; All the numbers in the data set are the same.

Varience = 1/N * SUM( (xi – X) 2) ::: X = Mean, N= Number of items in the sample

Standard Deviation – 

The square root of the variance is the standard deviation. While variance gives us a rough idea of spread, the standard deviation is more concrete, giving you exact distances from the mean.

Standard Deviation = SQRT (varience)

Five Number Summary and Boxplots

If the distribution is skewed the mean, median and mode are really not very informative. it’s more informative to also provide quartiles also.

Five – number Summary – The five number summary of a distribution consists of the Median, the quatiles Q1 and Q3, and the smallest and largest observations written in the order of Minimum, Q1, Median, Q3 and Maximum.

Boxplots are a popular way of visualizing a distribution, specifically five number summary. A Boxplot incorporates the five number summary as follows.

  • The ends of the box are at the quartiles
  • The median is marked by a line within the box
  • Two lines outside the box extend the smallest and largest observations.

Image result for box plot example

Measuring Dispersion of Data

The dispersion measures the spread of numeric data. the measures include range, quartiles, quantiles, and percentiles.

Range –  The range of a set of values is the difference between the largest and smallest values in the set.

Quantiles – Quantiles are the points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets.

Quartiles – If the total set is divided into four equal parts then it’s known as quartiles.

Percentile – If we divide the set into 100 equal parts or quantiles then it’s known as a percentile.

 

Measuring Central Tendency of Data

Assume that we have a set of values or observations if we were to find where would most of the values fall? this gives us an idea of the central tendency of the data. There are various ways to measure the central tendency of data. Let’s explore Mean, Median and Mode.

Assume we have the salary data of N employees as below. 30,36,47,50,52,52,56,60,63,70,70,110.

Mean (Avg() in SQL)

Mean is the most common and effective numeric measure of the center of a set of data. let x1, x2,…xN be a set of N values or observations such as for some numeric attributes like salary the mean is X1+x2+….+xN/N.

Example:

Using the above salary data mean = 696/12 = 58.

Sometimes each value of the observation may be associated with a weight, which reflects the significance in such cases we do the weighted arithmetic mean.

Challenges with mean

  • Sensitivity to extreme values (some small number of extreme higher values or lower values can corrupt the mean.
  • In such cases, we may use trimmed mean, mean by removing the extreme values

Median

Median is a better measure of the center of data. which is the middle value in a set of ordered data values.  it’s the data value that separates the lower half from the upper half of data set.

If the number of values in a data set is odd then the median is the central value, if the number of values is even then two median exists. if the values are numeric then we could take the average of the two median values.

Example:

Using the above salary data, we have 12 values and the middle values are 52 and 56. Since the values are numeric we can find the median as 56+52/2 = 54.

Mode

The mode for a set of data is the value that occurs most frequently in the set. therefore it can be determined for quantitative and qualitative attributes. It’s possible that greatest frequency corresponds to multiple values in a dataset and hence more than one mode.

Example:

Using the above salary data, we have 52 and 70 repeating twice hence two modes exist, 52 and 72.

Midrange

Midrange is the average of the largest and smallest values in the set.

Example:

Using the above salary data, the max value is 110 and min value is 30 hence midrange = 30+110/2 = 70

In the real world, data is not always symmetric. they may be positively skewed or negatively skewed.

Positively Skewed – The mode occurs at a value that’s smaller than the median

Negatively Skewed – The mode occurs at a value that’s higher than the median