Varience and Standard Deviation

Variance and Standard deviation are measures of data dispersion, they indicate how spread out a data distribution is.

A low standard deviation means that the data observations tend to be very close to the mean. while a high standard deviation indicates that the data is spread out over a large range of values.

Variance – 

Variance is the average of the squared differences from the mean,  it gives us a very general idea of the spread of our data. A value of zero means that there is no variability; All the numbers in the data set are the same.

Varience = 1/N * SUM( (xi – X) 2) ::: X = Mean, N= Number of items in the sample

Standard Deviation – 

The square root of the variance is the standard deviation. While variance gives us a rough idea of spread, the standard deviation is more concrete, giving you exact distances from the mean.

Standard Deviation = SQRT (varience)

Advertisements

Five Number Summary and Boxplots

If the distribution is skewed the mean, median and mode are really not very informative. it’s more informative to also provide quartiles also.

Five – number Summary – The five number summary of a distribution consists of the Median, the quatiles Q1 and Q3, and the smallest and largest observations written in the order of Minimum, Q1, Median, Q3 and Maximum.

Boxplots are a popular way of visualizing a distribution, specifically five number summary. A Boxplot incorporates the five number summary as follows.

  • The ends of the box are at the quartiles
  • The median is marked by a line within the box
  • Two lines outside the box extend the smallest and largest observations.

Image result for box plot example

Measuring Dispersion of Data

The dispersion measures the spread of numeric data. the measures include range, quartiles, quantiles, and percentiles.

Range –  The range of a set of values is the difference between the largest and smallest values in the set.

Quantiles – Quantiles are the points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets.

Quartiles – If the total set is divided into four equal parts then it’s known as quartiles.

Percentile – If we divide the set into 100 equal parts or quantiles then it’s known as a percentile.

 

Measuring Central Tendency of Data

Assume that we have a set of values or observations if we were to find where would most of the values fall? this gives us an idea of the central tendency of the data. There are various ways to measure the central tendency of data. Let’s explore Mean, Median and Mode.

Assume we have the salary data of N employees as below. 30,36,47,50,52,52,56,60,63,70,70,110.

Mean (Avg() in SQL)

Mean is the most common and effective numeric measure of the center of a set of data. let x1, x2,…xN be a set of N values or observations such as for some numeric attributes like salary the mean is X1+x2+….+xN/N.

Example:

Using the above salary data mean = 696/12 = 58.

Sometimes each value of the observation may be associated with a weight, which reflects the significance in such cases we do the weighted arithmetic mean.

Challenges with mean

  • Sensitivity to extreme values (some small number of extreme higher values or lower values can corrupt the mean.
  • In such cases, we may use trimmed mean, mean by removing the extreme values

Median

Median is a better measure of the center of data. which is the middle value in a set of ordered data values.  it’s the data value that separates the lower half from the upper half of data set.

If the number of values in a data set is odd then the median is the central value, if the number of values is even then two median exists. if the values are numeric then we could take the average of the two median values.

Example:

Using the above salary data, we have 12 values and the middle values are 52 and 56. Since the values are numeric we can find the median as 56+52/2 = 54.

Mode

The mode for a set of data is the value that occurs most frequently in the set. therefore it can be determined for quantitative and qualitative attributes. It’s possible that greatest frequency corresponds to multiple values in a dataset and hence more than one mode.

Example:

Using the above salary data, we have 52 and 70 repeating twice hence two modes exist, 52 and 72.

Midrange

Midrange is the average of the largest and smallest values in the set.

Example:

Using the above salary data, the max value is 110 and min value is 30 hence midrange = 30+110/2 = 70

In the real world, data is not always symmetric. they may be positively skewed or negatively skewed.

Positively Skewed – The mode occurs at a value that’s smaller than the median

Negatively Skewed – The mode occurs at a value that’s higher than the median

Attribute Types

Attribute

An attribute is a data field representing a characteristic or feature of a data object or an entity like Customer, Store items etc. The nouns attribute, dimensions, feature, nd variable are often used interchangably in the industry. The term dimension is common in the data warehousing world, feature in machine learning, variable in statistics and attribute in data mining and databases.

Example

Name, address, and ID is the attributes of the customer object.

1. Nominal Attribute –  The values of a nominal attribute are symbols or names of things. Each value represents some kind of category, code or state. Nominal attributes are also known as categorical.

  1. Example- Hair color is a nominal attribute of person object, possible values are black, brown, blond, red etc.

2. Binary Attribute –  A binary attibute is a nominal attribute with only two categorical values or states, 0 and 1

Example- Smoker attribute of person object is a binary attribute with values 1 indicates person smokes and 0 indicates non-smoker.

    Symmetric vs asymmetric Binary Attributes

Symmetric– both of its states are equally valuable and carry the same weight. Gender       is a      symmetric attribute with values male and female.

   Asymmetric – the outcomes of the states are not equally important. the positive and          negative outcome of a medical test for HIV is an example.

3.Ordinal Attributes –  Are attributes with possible values that have a meaningful order or ranking among them, but the magnitude between the successive values is not known. Customer satisfaction is an example with ordinal values, Very dissatisfied, Somewhat dissatisfied, neutral, satisfied and very satisfied.

4. Numeric Attributes – Are quantitative attributes which are measurable quantities represented in integer or real values. Numeric Attributes are generally interval or ratio scaled.

Interval-scaled– are measured on a scale of equal size units. Temperature is an example

Ratio-Scaled – have an inherent zero point. years of experience is an example

5. Discrete Attributes –  A discrete attribute has a finite or countably infinite set of values, which may or may not be represented as integers. Hair Color and Smoker attributes are examples

6. Continuous Attributes –  A continuous attribute can have any value. weight is an example.