Predictive Modelling: Popular open source and commercial tools

Analytics   |   
Published August 11, 2014   |   

Predictive modeling is a process of creating a statistical model to predict future behavior. It is more of the area in data mining forecasting probabilities and trends. A predictive model is made up of predictors which are factors that influence future results. For example, a retail shop’s model should consider the customer’s gender, age and purchase history which might be used to predict the future sale.

Key features: The predictive model is described using three key features.

1. The predicted outcome

2. Predictors

3. How predictors are used to create the outcome?

Scope: The need for developing a predictive model is on the rise. Initially, predictive analytics was used in the spam filtering system. Now the scenario has changed. Predicting models have become a vital part in CRM, change management, disaster recovery, security management, and meteorology.

Predictive analytics supports decision making by diagnosing the business. It helps the firm to eliminate processes which are time to consume. It can be used in almost all the fields with marketing and pricing using it predominantly.

Developing a model: The first step in developing a predictive model is selecting relevant candidate predictor variables for possible inclusion in the model. A limited number of variables are selected from a vast list to bring bias to the selection process. Inappropriate selection of variables is an important and common cause of poor model performance. The major issue in developing a predictive model is to deal with the missing data.

Validating the model: Validation can be performed using internal or external validation. A common approach to internal validation is to split the data set into two portions—a “training set” and “validation set”. The objective of the external validation is to apply a previously developed model to new individuals whose data were not used in the model development and quantify the model’s predictive performance.

When a validation study shows disappointing results, researchers are often tempted to reject the initial model and to develop a new predictive model using the validation cohort data.

Assessing the performance of the model: When assessing model performance, it is important to remember that explanatory models are judged based on strength of associations, whereas predictive models are judged solely based on their ability to make accurate predictions. The performance of a predictive model is assessed using several complementary tests, which assess overall performance, calibration, discrimination, and reclassification.

There are factors driving predictive analytics. Some are;

Technological advances: Depending on the amount of data the processes use the techniques used in the predictive analysis are decided. Some may require thousands of calculations to be carried out with great performance. Advanced hardware and software packages to handle the calculations are analyzed and validated.

Data Availability: The validation of the predictive model depends on the data available to develop it. Data enough to be able to decide a model is mandatory. In addition to the data of the business, there are data from the third party users that the business must use to predict the model.

Limitations: The predictive models have their limitations. A predictive model cannot be created without sufficient dataset. The concept of the product to be predicted should be clear. Changes are likely to occur in a business. So one cannot be sure if the data can provide with the actual model.

Some companies treat the predictive model as a black box. The results of the predictive model should be checked if it makes sense.


There are a number of tools that are available in the market which can help us with the predictive analytics.

Some notable open source tools are

1. scikit-learn

Scikit-learn is an open source machine learning library for the Python programming language.


KNIME is an open source data analytics, reporting and integration platform that integrates various components for machine learning and data mining.

3. OpenNN

OpenNN is a software library which is able to learn from both datasets and mathematical models. The software can be used to solve pattern recognition problems.

4. Orange:

Orange is a component-based data mining and machine learning software suite.  It includes a set of components for data pre-processing, modelling, model evaluation, and exploration techniques.

5. R

R is a free software programming language and software environment for statistical computing and graphics.

6. RapidMiner

RapidMiner is a software platform that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics.

7. Weka

Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. It contains a collection of visualization tools and algorithms for analysis and predictive modelling.

8. GNU Octave

GNU Octave is a high-level programming language, primarily intended for numerical computations.

9. Apache Mahout

Apache Mahout is a project to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification.

Notable commercial predictive analytic tools are;

1. Alpine Data Labs

Alpine Data Labs is an advanced analytics interface working with Apache Hadoop and big data.

2. BIRT Analytics

BIRT Analytics is a self-service predictive analytics tool that allows non-technical business users to engage in visual data mining with Big Data from multiple sources without IT help.

3. Angoss Knowledge STUDIO

Angoss provides client-server and big data analytics software products and cloud solutions.

4. IBM SPSS Modeler

IBM SPSS Modeler is a data mining and text analytics software application built by IBM. It is used to build predictive and conduct other analytic tasks.


InfiniteInsight is a predictive modeling suite developed by KXEN that assists analytic professionals, and business executives to extract information from data.

6. Mathematica

Mathematica is a computational software program used in many scientific, engineering, mathematical and computing fields, based on symbolic mathematics.


MATLAB is a multi-paradigm numerical computing environment and fourth-generation programming language. MATLAB offers features to develop algorithms and create models and applications.

8. Minitab

Minitab is a statistics package developed at the Pennsylvania State University. It helps us to analyze our data and improve the products and services with the leading statistical software used for quality improvement.

9. Oracle Data Mining (ODM)

Oracle Data mining provides powerful data mining functionality as native SQL functions within the Oracle database. The Oracle spreadsheet add-in for predictive analytics provides predictive analytics operations.

10. Revolution Analytics

Revolution Analytics is a statistical software company focused on developing “open-core” versions of the free and open source software R for enterprise, academic, and analytics customers.

11. SAS Enterprise Miner

SAS (Statistical Analysis System; not to be confused with SAP) is a software suite developed by SAS Institute for advanced analytics, business intelligence, data management, and predictive analytics.


Stata is a general-purpose statistical software package created in 1985 by StataCorp. Stata’s capabilities include data management, statistical analysis, graphics, simulations, and regression analysis.


STATISTICA is a statistics and analytics software package developed by StatSoft. STATISTICA provides data analysis, data management, statistics, data mining, and data visualization procedures.


TIBCO Spotfire® analytics solution offers to visualize and interact with our data to instantly develop insights and take immediate action.

15. FICO

FICO provides analytics software and tools used across multiple industries to manage risk, fight fraud, build more profitable customer relationships, and optimize operations.