top of page
Search

TPOT- AutoML


ree

The last article I wrote was about LayPredict and how it can help you to find an optimal Supervised algorithm. This article will take us one step ahead.

Most of us(Machine Learning engineers/ aspirants) face one daunting task during modeling i.e. tuning of hyperparameters. Automated Machine Learning is not just a tool, it's your helper or assistant or whatever you like to call. The thing is it makes your work easy by identifying the best pipeline for you. You read the title we are going to talk about one of the first AutoML tools- TPOT.

Tree-based Pipeline Optimization Tool was released by Epistasis Lab. It is based on genetic programming concepts that are inspired by Darwin’s idea of natural selection. Genetic programming uses:-


. Selection: A fitness function evaluate each individual and normalizes their function. Their value come between 0 and 1 and their sum is 1. After deciding the random number which will be between 0 and 1, we will compare individuals with it. Keep only those whose fitness function values are greater or equal to the random number.


.Crossover: So do you remember what happens in the crossover from natural selection theory? If you don't, well, no worries. We already selected the fittest individuals from the Selection phase and now we will generate a new population using these individuals. That’s it.


. Mutation: Now those new individuals which we got using crossover will go through mutation. There will be some random modification done and repeat these steps until we get the best population.


ree

It uses Tree-based structures to represent pipelines for predictive modeling. This includes the preparation of data and modeling models with tuned parameters.

Let us start with the installation of TPOT

!pip install TPOT

Once installation takes place we can move forward and import the required libraries along with TPOT. It is simple to use TPOT. You have to create an instance of TPOTRegressor or TPOTClassifier as per the problem. Define the evaluation procedure and search for the instance and then export the model pipeline.

I came across this: explicitly specifying a cross-validation class with your chosen configuration and the performance metric to use. I am going to use StratifiedKFold with accuracy scoring.

cv = StratifiedKFold(n_splits=10)
model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring=’accuracy’, verbosity=2, random_state=1, n_jobs=-1)
model.fit(x_train,y_train)
model.export(‘tpot_data.py’)

Now, this can take a long time maybe hours maybe days.

The code I got which was best fit for my dataset was


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
from tpot.export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled ‘target’ in the data file
tpot_data = pd.read_csv(‘PATH/TO/DATA/FILE’, sep=’COLUMN_SEPARATOR’, dtype=np.float64)
features = tpot_data.drop(‘target’, axis=1)
training_features, testing_features, training_target, testing_target = \
 train_test_split(features, tpot_data[‘target’], random_state=1)
# Average CV score on the training set was: 0.8073042720825973
exported_pipeline = make_pipeline(
 MinMaxScaler(),
 XGBClassifier(learning_rate=0.1, max_depth=4, min_child_weight=7, n_estimators=100, n_jobs=1, subsample=0.8500000000000001, verbosity=0)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, ‘random_state’, 1)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

The output will look something like this:

ree

A new Python file will be there which you can use in your code.


There are many built-in configurations for the AutoML TPOT. These variations are listed below:


1. TPOT light: If you want simple operators to be utilized in pipelines. Moreover, this configuration makes sure these operators are fast-executing too.

2. TPOT MDR: If your problem is in the domain of bioinformatics studies and this configuration is ideal for genome-wide association studies.

3. TPOT sparse: If you need a configuration that is suitable for sparse matrices.

4. TPOT NN: If you want to exploit neural network estimators with default TPOT. Furthermore, these estimators are written in PyTorch.

5. TPOT cuML: If your dataset size is medium or large and searching for the best pipelines over a limited configuration exploiting the GPU-accelerated estimators.


Many different solutions can be recommended by this method for the same dataset.


Warning: Don’t try to substitute your work and knowledge with this. It is indeed just a helper which will help you when you are stuck. It will make your work easy and a little bit faster. I hope this gives you a little insight into TPOT and how to use it.

 
 
 

Comments


bottom of page