AutoML: Hyperparameter Optimization

The Hyperparameter Optimization is a component for AutoML and serves as an environment to optimize the hyper parameters of data mining models in TIBCO Data Science / Statistica.

It offers the identification of optimal...

  • Model parameters
  • Misclassification costs
  • Stratification strategies
  • Feature selection

It enables to gain deep insights into:

  • Sensible ranges for the parameters
  • Interdependencies of parameters
  • Validation of the current configuration
  • Influence of sampling
  • Influence / Predictability of individual cases / observations
  • Configuration of the models
  • Relation of different error/accuracy measures

It offers additionally:

  • Easy setup of experiments
  • Automated visualization and summarization of the results
  • Simple expandability

Optimal Binning

The Optimal Binning Node combines the levels of single variables to groups (bins) with similar properties in regard to a target variable.

Why use (supervised) Binning?

  • Easier interpretation of the data
  • Helpful for data selection (based on a target variable)
  • Reduction of data complexity
  • Handling of missing data
  • Preparation for using methods with linearity assumption
  • Preparation for use of methods that only support categorical data

The bin variables (that are created) are always categorical in nature, the input variables can be categorical or continuous/metric.

To create the bins a CHAID tree is fitted to the data. The tree then is converted into Statistica’s formulas, that are being used to create the bins.
The CHAID tree is controlled via one parameter alone (the p-value for merging).
Cases with missing data are being excluded during the tree fitting and creation of the bin formulas but each formula is extended by a rule assigning cases with missing data to a separate bin “Missing”. New levels (not previously observed in the data) are assigned to the bin “Unknown”.

Advantages

  • Simple deployment of the binning solution
  • Works for classification and regression tasks and metric and categorical inputs
  • Can be automized

The primary output of the node are Statistica formulas. These will calculate the bin-variables when applied to data. The formulas can be easily copied to other places and used where necessary. It is also possible to output the transformed data directly.

Report Node

The StatSoft Report Node makes the creation of huge and pretty reports based on analytical results a simple automatic procedure.
The node extracts the workbook "Reporting Documents" from the workspace and adds the workbook items (spreadsheets, graphs, reports) to a word report.

 

This process requires a word template (conveniently stored in Statistica Enterprise or the file system) to start with. At minimum, this template can be an empty word document (.docx).
The interaction with Microsoft Word is handled via Microsoft's own OpenXML library hence resulting in standard compliant word documents.

The most important properties of the items and their placement can be configured via a configuration table. The definition can be provided for individual items and for groups of items via wildcards.
The node and its interface are simple, but do not let it fool you, it is a powerful tool to bring the automation of your reporting to a new level (of convenience and effectiveness).