What’s MongoDB

MongoDB

MongoDB is a cross-platform, document-oriented NOSQL database that provides, high performance, high availability, and easy scalability. MongoDB works on the concept of collection and document.

A MongoDB database is a set of collections. Each collection can have one or more documents. A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

{

name :”Abdul”,

Age :26

}

The advantages of using documents are:

  • Documents correspond to native data types in many programming languages.
  • Embedded documents and arrays reduce need for expensive joins.
  • Dynamic schema supports fluent polymorphism.

Key Features

  • high-performance
  • high availability
  • Easy scalability.
  • NoSQL
  • Open Source
  • Rich Query Support
  • Flexible Schema or schema less design
  • JSON Support

Measuring Central Tendency of Data

Assume that we have a set of values or observations if we were to find where would most of the values fall? this gives us an idea of the central tendency of the data. There are various ways to measure the central tendency of data. Let’s explore Mean, Median and Mode.

Assume we have the salary data of N employees as below. 30,36,47,50,52,52,56,60,63,70,70,110.

Mean (Avg() in SQL)

Mean is the most common and effective numeric measure of the center of a set of data. let x1, x2,…xN be a set of N values or observations such as for some numeric attributes like salary the mean is X1+x2+….+xN/N.

Example:

Using the above salary data mean = 696/12 = 58.

Sometimes each value of the observation may be associated with a weight, which reflects the significance in such cases we do the weighted arithmetic mean.

Challenges with mean

  • Sensitivity to extreme values (some small number of extreme higher values or lower values can corrupt the mean.
  • In such cases, we may use trimmed mean, mean by removing the extreme values

Median

Median is a better measure of the center of data. which is the middle value in a set of ordered data values.  it’s the data value that separates the lower half from the upper half of data set.

If the number of values in a data set is odd then the median is the central value, if the number of values is even then two median exists. if the values are numeric then we could take the average of the two median values.

Example:

Using the above salary data, we have 12 values and the middle values are 52 and 56. Since the values are numeric we can find the median as 56+52/2 = 54.

Mode

The mode for a set of data is the value that occurs most frequently in the set. therefore it can be determined for quantitative and qualitative attributes. It’s possible that greatest frequency corresponds to multiple values in a dataset and hence more than one mode.

Example:

Using the above salary data, we have 52 and 70 repeating twice hence two modes exist, 52 and 72.

Midrange

Midrange is the average of the largest and smallest values in the set.

Example:

Using the above salary data, the max value is 110 and min value is 30 hence midrange = 30+110/2 = 70

In the real world, data is not always symmetric. they may be positively skewed or negatively skewed.

Positively Skewed – The mode occurs at a value that’s smaller than the median

Negatively Skewed – The mode occurs at a value that’s higher than the median

Attribute Types

Attribute

An attribute is a data field representing a characteristic or feature of a data object or an entity like Customer, Store items etc. The nouns attribute, dimensions, feature, nd variable are often used interchangably in the industry. The term dimension is common in the data warehousing world, feature in machine learning, variable in statistics and attribute in data mining and databases.

Example

Name, address, and ID is the attributes of the customer object.

1. Nominal Attribute –  The values of a nominal attribute are symbols or names of things. Each value represents some kind of category, code or state. Nominal attributes are also known as categorical.

  1. Example- Hair color is a nominal attribute of person object, possible values are black, brown, blond, red etc.

2. Binary Attribute –  A binary attibute is a nominal attribute with only two categorical values or states, 0 and 1

Example- Smoker attribute of person object is a binary attribute with values 1 indicates person smokes and 0 indicates non-smoker.

    Symmetric vs asymmetric Binary Attributes

Symmetric– both of its states are equally valuable and carry the same weight. Gender       is a      symmetric attribute with values male and female.

   Asymmetric – the outcomes of the states are not equally important. the positive and          negative outcome of a medical test for HIV is an example.

3.Ordinal Attributes –  Are attributes with possible values that have a meaningful order or ranking among them, but the magnitude between the successive values is not known. Customer satisfaction is an example with ordinal values, Very dissatisfied, Somewhat dissatisfied, neutral, satisfied and very satisfied.

4. Numeric Attributes – Are quantitative attributes which are measurable quantities represented in integer or real values. Numeric Attributes are generally interval or ratio scaled.

Interval-scaled– are measured on a scale of equal size units. Temperature is an example

Ratio-Scaled – have an inherent zero point. years of experience is an example

5. Discrete Attributes –  A discrete attribute has a finite or countably infinite set of values, which may or may not be represented as integers. Hair Color and Smoker attributes are examples

6. Continuous Attributes –  A continuous attribute can have any value. weight is an example.

 

What’s Machine Learning

Machine Learning

The term machine learning refers to the automated detection of meaningful patterns in data. Machine learning investigates how computer programs can learn and recognize complex patterns from past data and make intelligent decisions out of it. Ie machines are taught using past experiences to take intelligent decisions.

Example:
A program to automatically recognize handwritten postal codes on mail after learning from a set of examples

Types
 of 
Learning
:
Supervised
 – In this method the learning comes from the labeled examples in the training data, Training 
data 
includes 
both 
the 
input 
and 
the
 desired
 results.
 Classification problems are example
Unsupervised – The
 model 
is
 not 
provided 
with 
the 
correct
 results
 during 
the 
training.
 Clustering problems are example

What’s Data Mining

Data Mining

Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. In the industry, the term data mining is often used to refer to the knowledge discovery process. Many people treat data mining as a synonym for knowledge discovery process and many others view data mining as a step in the process of knowledge discovery.

Typical steps involved in the knowledge discovery process is as follows

  1. Data Cleaning – Removing noise and inconsistent data
  2. Data Integration – Combine data from multiple data sources
  3. Data Selection – Filter and retrieve data that is relevant for the analysis task
  4. Data Transformation – Consolidate and transform the selected data to a form appropriate for data mining
  5. Data Mining – Apply intelligent methods to extract patterns and trends
  6. Pattern Evaluation – Identify truly interesting patterns of knowledge
  7. Knowledge presentation – Present mined knowledge to users using visualization and presentation techniques.

Examples for Data Mining:

Data mining systems can analyze customer data and predict the credit risk of new customers based on their income, age, and previous credit information.

 

Execution Plan Basics

An execution plan is the query optimizer’s output after trying a number of possible ways to execute a T-SQL request in a most effecient way. It tells us how actually a query is going to be executed or was executed by SQL server. It’s a series of steps or physical and logical operations that need to be performed to produce the required the result or satisfy the T-SQL request.

The Query optimizer generates the execution plan from the Query Processor Tree or Query Tree produced by the Query Parser after parsing and algibrayzing the T-SQL request by using the statistics it has about the data.

There are two distinct types of execution plans,
•Estimated Execution Plan
•Actual Execution Plan

Estimated Execution Plan : This represents the optimzer’s view of the plan, This a logical plan, generated by the optimizer with out actually executing the T-SQL request.

Actual Execution Plan : This represents the out puts from the actual query execution. It tells what actually happend the T-SQL request exeuted.

Execution Plan Formats

SQL Server execution plan can be viewed in three different formats
•Graphical plan
•Text Plan
•XML Plan

Graphical Plan

Quick and easy to read graphical form of the plan ,uses icons. both estimated and actual plans can be viewed in Graphical format.

Graphical Estimated Execution plan can be viewed by using the short cut key CTRL+L or from the Query options menu as shown in the below screen shot or by clicking the Display Estimated Execution plan tool bar icon.

Graphical Actual Execution plan can be viewed by using the short cut key CTRL+M or from the Query options menu as shown in the below screen shot or by clicking the Include Actual Execution plan tool bar icon.

Textual Plan

Bit harder to read , displays the plan textually

Estimated Execution plan can be viewed in text format with the below set options

1. SET SHOWPLAN_ALL ON

2. SET SHOWPLAN_TEXT ON

Actual Execution plan can be viewed in text format with the below set options

1. SET STATISTICS PROFILE ON

XML Plan

Displays the complete set of data of a plan in XML format.

Estimated Exection plan can be viewed in XML format with the below set options

1. SET SHOWPLAN_XML ON

Actual Exection plan can be viewed in XML format with the below set options

1. SET STATISTICS XML ON