Assume that we have a set of values or observations if we were to find where would most of the values fall? this gives us an idea of the central tendency of the data. There are various ways to measure the central tendency of data. Let’s explore Mean, Median and Mode.
Assume we have the salary data of N employees as below. 30,36,47,50,52,52,56,60,63,70,70,110.
Mean (Avg() in SQL)
Mean is the most common and effective numeric measure of the center of a set of data. let x1, x2,…xN be a set of N values or observations such as for some numeric attributes like salary the mean is X1+x2+….+xN/N.
Using the above salary data mean = 696/12 = 58.
Sometimes each value of the observation may be associated with a weight, which reflects the significance in such cases we do the weighted arithmetic mean.
Challenges with mean
- Sensitivity to extreme values (some small number of extreme higher values or lower values can corrupt the mean.
- In such cases, we may use trimmed mean, mean by removing the extreme values
Median is a better measure of the center of data. which is the middle value in a set of ordered data values. it’s the data value that separates the lower half from the upper half of data set.
If the number of values in a data set is odd then the median is the central value, if the number of values is even then two median exists. if the values are numeric then we could take the average of the two median values.
Using the above salary data, we have 12 values and the middle values are 52 and 56. Since the values are numeric we can find the median as 56+52/2 = 54.
The mode for a set of data is the value that occurs most frequently in the set. therefore it can be determined for quantitative and qualitative attributes. It’s possible that greatest frequency corresponds to multiple values in a dataset and hence more than one mode.
Using the above salary data, we have 52 and 70 repeating twice hence two modes exist, 52 and 72.
Midrange is the average of the largest and smallest values in the set.
Using the above salary data, the max value is 110 and min value is 30 hence midrange = 30+110/2 = 70
In the real world, data is not always symmetric. they may be positively skewed or negatively skewed.
Positively Skewed – The mode occurs at a value that’s smaller than the median
Negatively Skewed – The mode occurs at a value that’s higher than the median