ROLAP vs MOLAP

ROLAP – 

These are the intermediate servers that stand in between a relational back-end server and client front-end tools. Uses relational or extended-relational DBMS to store and mange warehouse data.

MOLAP –

These servers support multidimensional data views through array-based multidimensional storage engines.

Data Warehouse Concepts

Data Warehouse – 

A data warehouse is a subject oriented integrated time variant and nonvolatile collection of data in support of management decision-making process.

Data Warehousing – 

Data warehousing is the process of constructing and using the data warehouse. The major processes involved are data cleaning, data integration, and data consolidation.

Data Mart –

A data mart is the subset of a data warehouse that’s of value to a specific group of users and catering to a specific subject area of the business.

Data Cube –

A data cube allows data to be modeled and viewed in multiple dimensions. it’s defined by dimensions and facts.

Dimensions-

Dimensions are the perspectives or entities with respect to which an organization wants to keep records. for example time, item, product, location etc are the dimensions of sales.

Facts – 

Facts are numeric measures. they are the quantities by which we want to analyze the relationship between dimensions. The quantity of sales, Sales Amount etc.

Start Schema –

The most common modeling paradigm for a data warehouse, in which the model contains a large central fact table and all dimension tables are directly linked to the fact table.

Snowflake Schema –

The snowflake schema is a variant of the star schema model where some dimension table is normalized, thereby splitting the data into additional dimension tables.

Fact Constellation –

A collection of stars in which multiple fact tables share same dimension tables.

Hierarchy-

A sequence of mappings from a set of low-level concepts to higher level more general concepts. Day >> Week >> month >> quarter >> Year

Roll-up-

The roll-up operation performs aggregation on a data cube, either by climbing up a dimension hierarchy or dimension reduction.

Drill Down-

The reverse of roll-up. it navigates from less detailed data to more detailed data.

Slice-

The slice operation performs a selection on one dimension of the given cube, resulting in a subcube.

Dice-

The dice operation defines a subcube by performing a selection on two or more dimensions.

Pivot-

Pivot is a visualization operation that rotates the data axes in views to provide an alternate data presentation.

Drill through- 

Navigating to the detailed data from the bottom levels.

 

 

 

 

Techinques for Filling Missing Values

There are many methods to fill missing values in a dataset while cleaning.

  1. Ignore the tuple
  2. Fill in the missing values manually – Time consuming not practical with large datasets
  3. Use a global constant to fill in the missing value – Ie replace all missing values with unknown or something similar
  4. Use a measure of central tendency for the attribute such as mean, median etc
  5. Use the most probable value to fill the missing value

Histograms and Scatter Plots

Histogram –  Histogram is a chart of poles. Plotting histogram is a graphical method for summarizing the distribution of a given attribute, the resulting distribution is most commonly known as bar chart.

Scatter Plot- Scatter plot is one of the most effective graphical methods for determining if there appears to be a relationship or pattern between two numeric attributes.

Positive Correlation – If the pattern of the points slopes from Lower left to Upper right.

/

Negative Correlation –  If the pattern of the points slopes from upper left to lower right.

\

Image result for histograms and scatter plots

Varience and Standard Deviation

Variance and Standard deviation are measures of data dispersion, they indicate how spread out a data distribution is.

A low standard deviation means that the data observations tend to be very close to the mean. while a high standard deviation indicates that the data is spread out over a large range of values.

Variance – 

Variance is the average of the squared differences from the mean,  it gives us a very general idea of the spread of our data. A value of zero means that there is no variability; All the numbers in the data set are the same.

Varience = 1/N * SUM( (xi – X) 2) ::: X = Mean, N= Number of items in the sample

Standard Deviation – 

The square root of the variance is the standard deviation. While variance gives us a rough idea of spread, the standard deviation is more concrete, giving you exact distances from the mean.

Standard Deviation = SQRT (varience)

Five Number Summary and Boxplots

If the distribution is skewed the mean, median and mode are really not very informative. it’s more informative to also provide quartiles also.

Five – number Summary – The five number summary of a distribution consists of the Median, the quatiles Q1 and Q3, and the smallest and largest observations written in the order of Minimum, Q1, Median, Q3 and Maximum.

Boxplots are a popular way of visualizing a distribution, specifically five number summary. A Boxplot incorporates the five number summary as follows.

  • The ends of the box are at the quartiles
  • The median is marked by a line within the box
  • Two lines outside the box extend the smallest and largest observations.

Image result for box plot example

Measuring Dispersion of Data

The dispersion measures the spread of numeric data. the measures include range, quartiles, quantiles, and percentiles.

Range –  The range of a set of values is the difference between the largest and smallest values in the set.

Quantiles – Quantiles are the points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets.

Quartiles – If the total set is divided into four equal parts then it’s known as quartiles.

Percentile – If we divide the set into 100 equal parts or quantiles then it’s known as a percentile.