Predictive Analytics: Analyst/X

Visualization empowers the analyst to discover patterns and anomalies in data, by noticing unexpected relationships or by actively searching. Predictive analytics (sometimes called “data mining”) provides a powerful adjunct to this: algorithms are used to find relationships in data, and these relationships can be used with new data to predict values.

Tasks you can do with the predictive analytics in ADVIZOR/X are:

Build a model of your data that describes what fields in a table influence the value of a target field.
Evaluate the quality of models you build using quality metrics.
Examine the model to understand the relationships between the target field and the explanatory fields.
Use the model with new data to predict values.

You can also model your ADVIZOR selection state to get a concise description of that set of selected items.

Start with a Question

You begin the process of modeling with a business question. The question must be about the relationships between data fields in a single table. The business question must be in terms of the values of a single field in the data being analyzed, the target field.

The target field must be either a numeric field or an integer field with exactly 2 values, "0" and "1". For example, if you have customer sales data and you want to understand the characteristics of highly profitable customers, than your data table must contain a field with customer profitability; this will be the target in your model.

The target field can be an existing field in a table, or you can create a model on the current selection state from interacting with charts in ADVIZOR Analyst. A new field (with values "0" and "1") can be created from the current selection state, which can then be used as the model target.

The model you create gives the relationship of the single target field with all of the other fields in the same table, the "explanatory fields". So the beginning process is:

Start with a question.
Pick the table in your project that contains data relevant to the question.
Pick a single field that answers the question. Other table fields are "explanatory fields" that determine the value of the target field.
Build a model using ADVIZOR Analyst that describes the relationship between the target field and explanatory fields.

There are two types of models that may be built:

Predict a numeric value, or
Classify data into two classes, where each case in your data is "in" or "out" (has a target field with values of "0" or "1").

Models

A mathematical model is created by predictive analytics. This model describes the relationship between the target field and the explanatory fields in a single data table. Since the model describes the relationship between the target and explanatory fields, there must be values for the target field in every row in the data table. You must have a sufficiently large volume of data to be able to build a valid model that is both relevant and robust. For example, a model that is generated from a data set of 50 rows may not do well when applied to different data , and it may not do a good job of predicting the target values.

The models created by Analyst/X are regression models: mathematical polynomial functions that relate the descriptive attributes (model inputs) and a target attribute (model output). The returned models are expressed as a first degree polynomial expression of the inputs. A polynomial of degree 1 is of the form:

f(X1, X2, ..., Xn) = w0 + w1.X1 + w2.X2 + ... + wn.Xn

where the “w”s are weights and the “X”s are fields. Although higher degree polynomials could also be used to define this relationship, in the large majority of cases a first degree polynomial is sufficient for generation of a relevant and robust model. ADVIZOR/X currently only supports first degree polynomials.

For classification ("0" or "1") models, a slightly different approach called "logistic regression" is used. The result is still a polynomial equation, but the prediction is a "score", the predicted probability of the case/row falling into the "1" or "0" category. This score is used to predict the result; it may also be used to group cases into categories based on how likely they are to fall into the "1" category.

Condition data

A model is the relationship between many explanatory fields and one target field in one table. There are constraints on what fields are usable as explanatory fields. There also may be data in other tables that you want to include in your model as well. Data may need to be conditioned before a model is built; this is described as part of the Analytics Process description.

Evaluate the Quality Indicator

Every model created must first be evaluated for adequacy before it is used. The quality metric is a number between 0.0 and 1.0 that gives the quality of the model. Models may always be compared with each other based on this information indicator.

For ordinary regression models, the information indicator is the "coefficient of determination" (often called R2 or "R squared"). This corresponds to the proportion of information contained in the target field that the explanatory fields are able to explain. For example, a model with an Information indicator of “0.79” explains 79% of the information contained in the target field using the explanatory fields defined.

For classification models, the R2 statistic is not appropriate. A "Percent Concordant" metric is used instead. This does NOT give the amount of variability in the target explained by the model, but it may be used to compare the quality of different models for the same target.

A perfect model would have an indicator of “1”; a random model has an indicator of “0”. A model with an indicator greater than or equal to .95 has excellent predictive power, but any score above 0 indicates some predictive power, better than random results. To improve the Information indicator of a model, add new fields to the data table.

Interpret the Model

After a model has been determined to be adequate based on its quality indicators, it can be used to understand the relationships within the data. The major relationship described by the model is the contributions by variables to predicting the target, how much of the variability of the target is explained by each explanatory field. This information is shown in two pages that are added to your project.

Apply the Model to New Data

Modeling produces an equation that may be saved with the project as an Expression Builder expression. This will be run whenever the project is loaded, so it can be applied whenever your project is regenerated with new data.

What's Next?

Read the detailed process for using Analyst/X predictive analytics.
Understand the Predictive Modeling pane, the user interface to predictive analytics.
Use a model to predict values in new data.
Read techniques for modeling Zip Codes.