Microsoft Data Mining Algorithms

The data mining algorithm is the mechanism that creates a data mining model. To create a model, an algorithm first analyzes a set of data and looks for specific patterns and trends. The algorithm uses the results of this analysis to define the parameters of the mining model. These parameters are then applied across the entire data set to extract actionable patterns and detailed statistics.

The mining model that an algorithm creates can take various forms, including:

  • A set of rules that describe how products are grouped together in a transaction.
  • A decision tree that predicts whether a particular customer will buy a product.
  • A mathematical model that forecasts sales.
  • A set of clusters that describe how the cases in a dataset are related.

Microsoft SQL Server Analysis Services provides several algorithms for use in your data mining solutions. These algorithms are a subset of all the algorithms that can be used for data mining

Microsoft Association Algorithm

The Microsoft Association algorithm is an association algorithm provided by Analysis Services that is useful for recommendation engines. A recommendation engine recommends products to customers based on items they have already bought, or in which they have indicated an interest. The Microsoft Association algorithm is also useful for market basket analysis.

Microsoft Clustering Algorithm

The algorithm uses iterative techniques to group cases in a dataset into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, and creating predictions.  Clustering models identify relationships in a dataset that you might not logically derive through casual observation. For example, you can logically discern that people who commute to their jobs by bicycle do not typically live a long distance from where they work. The algorithm, however, can find other characteristics about bicycle commuters that are not as obvious. 

Microsoft Clustering

Microsoft Decision Trees Algorithm

For discrete attributes, the algorithm makes predictions based on the relationships between input columns in a dataset. It uses the values, known as states, of those columns to predict the states of a column that you designate as predictable. Specifically, the algorithm identifies the input columns that are correlated with the predictable column. For example, in a scenario to predict which customers are likely to purchase a bicycle, if nine out of ten younger customers buy a bicycle, but only two out of ten older customers do so, the algorithm infers that age is a good predictor of bicycle purchase. The decision tree makes predictions based on this tendency toward a particular outcome. For continuous attributes, the algorithm uses linear regression to determine where a decision tree splits. If more than one column is set to predictable, or if the input data contains a nested table that is set to predictable, the algorithm builds a separate decision tree for each predictable column

Microsoft Linear Regression Algorithm

The Microsoft Linear Regression algorithm is a variation of the Microsoft Decision Trees algorithm that helps you calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction. The relationship takes the form of an equation for a line that best represents a series of data.

Microsoft Linear Regression

Microsoft Logistic Regression Algorithm

The Microsoft Logistic Regression algorithm is a variation of the Microsoft Neural Network algorithm. Logistic regression is a well-known statistical technique that is used for modeling binary outcomes, such as a yes-No outcome. Logistic regression is highly flexible, taking any kind of input, and supports several different analytical tasks:

  • Use demographics to make predictions about outcomes, such as risk for a certain disease.
  • Explore and weight the factors that contribute to a result. For example, find the factors that influence customers to make a repeat visit to a store.
  • Classify documents, e-mail, or other objects that have many attributes.

Microsoft Naive Bayes Algorithm

The Microsoft Naive Bayes algorithm is a classification algorithm provided by Microsoft SQL Server Analysis Services for use in predictive modeling. The name Naive Bayes derives from the fact that the algorithm uses Bayes theorem but does not take into account dependencies that may exist, and therefore its assumptions are said to be naive. This algorithm is less computationally intense than other Microsoft algorithms, and therefore is useful for quickly generating mining models to discover relationships between input columns and predictable columns. You can use this algorithm to do initial explorations of data, and then later you can apply the results to create additional mining models with other algorithms that are more computationally intense and more accurate.

Microsoft Neural Network Algorithm

In SQL Server Analysis Services, the Microsoft Neural Network algorithm combines each possible state of the input attribute with each possible state of the predictable attribute, and uses the training data to calculate probabilities. You can later use these probabilities for classification or regression, and to predict an outcome of the predicted attribute, based on the input attributes. A mining model that is constructed with the Microsoft Neural Network algorithm can contain multiple networks, depending on the number of columns that are used for both input and prediction, or that are used only for prediction. The number of networks that a single mining model contains depends on the number of states that are contained by the input columns and predictable columns that the mining model uses.

Microsoft Sequence Clustering Algorithm

The Microsoft Sequence Clustering algorithm is a sequence analysis algorithm provided by Microsoft SQL Server Analysis Services. You can use this algorithm to explore data that contains events that can be linked by following paths, or sequences. The algorithm finds the most common sequences by grouping, or clustering, sequences that are identical. The following are some examples of sequences:

  • Data that describes the click paths that are created when users navigate or browse a Web site.
  • Data that describes the order in which a customer adds items to a shopping cart at an online retailer.

This algorithm is similar in many ways to the Microsoft Clustering algorithm. However, instead of finding clusters of cases that contain similar attributes, the Microsoft Sequence Clustering algorithm finds clusters of cases that contain similar paths in a sequence.

Microsoft Time Series Algorithm

The Microsoft Time Series algorithm provides regression algorithms that are optimized for the forecasting of continuous values, such as product sales, over time. Whereas other Microsoft algorithms, such as decision trees, require additional columns of new information as input to predict a trend, a time series model does not. A time series model can predict trends based only on the original dataset that is used to create the model. You can also add new data to the model when you make a prediction and automatically incorporate the new data in the trend analysis.The following diagram shows a typical model for forecasting sales of a product in four different sales regions over time. The model that is shown in the diagram shows sales for each region plotted as red, yellow, purple, and blue lines. The line for each region has two parts:

  • Historical information appears to the left of the vertical line and represents the data that the algorithm uses to create the model.
  • Predicted information appears to the right of the vertical line and represents the forecast that the model makes.

The combination of the source data and the prediction data is called a series.

An important feature of the Microsoft Time Series algorithm is that it can perform cross prediction. If you train the algorithm with two separate, but related, series, you can use the resulting model to predict the outcome of one series based on the behavior of the other series. For example, the observed sales of one product can influence the forecasted sales of another product. Cross prediction is also useful for creating a general model that can be applied to multiple series. For example, the predictions for a particular region are unstable because the series lacks good quality data. You could train a general model on an average of all four regions, and then apply the model to the individual series to create more stable predictions for each region.

Microsoft Time Series

References

SQL Server Books

Rate:
 

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters (without spaces) shown in the image.