Model selection with scikit-learn and ploomber

Model selection is an important part of any Machine Learning task. Since each model encodes their own inductive bias, it is important to compare them to understand their subtleties and choose the best one for the problem at hand. While knowing each learning algorithm in detail is important to have an intuition about which ones to try, it is always helpful to visualize actual results in our data.

Note: This blog post assumes you are familiar with the model selection framework via nested cross-validation and with the following scikit-learn modules (click for documentation): GridSearchCV, cross_val_predict and Pipeline.

The quick and dirty approach for model selection would be to have a long Jupyter notebook, where we train all models and output charts for each one. In this post we will show how to achieve this in a cleaner way by using scikit-learn and ploomber.

Project layout

We split the code in three files:

  1. pipelines.py. Contains functions to instantiate scikit-learn pipelines
  2. report.py. Contains the source code that performs hyperparameter tuning and model evaluation, imports pipelines defined in pipelines.py
  3. main.py. Contains the loop that executes report.py for each pipeline using ploomber

Unless otherwise noted, the snippets shown in this post belong to main.py.

Functions to instantiate pipelines (pipelines.py)

We start declaring each of our model pipelines, which are just functions that return a scikit-learn Pipeline instance, we will use this in a nested cross-validation loop to choose the best hyperparameters and estimate generalization performance.

<span># Content of pipelines.py </span><span>from</span> <span>sklearn.pipeline</span> <span>import</span> <span>Pipeline</span>
<span>from</span> <span>sklearn.preprocessing</span> <span>import</span> <span>StandardScaler</span>
<span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>Ridge</span>
<span>from</span> <span>sklearn.svm</span> <span>import</span> <span>NuSVR</span>
<span>def</span> <span>ridge</span><span>():</span>
<span>return</span> <span>Pipeline</span><span>([(</span><span>'scaler'</span><span>,</span> <span>StandardScaler</span><span>()),</span>
<span>(</span><span>'reg'</span><span>,</span> <span>Ridge</span><span>())])</span>
<span>def</span> <span>nusvr</span><span>():</span>
<span>return</span> <span>Pipeline</span><span>([(</span><span>'scaler'</span><span>,</span> <span>StandardScaler</span><span>()),</span>
<span>(</span><span>'reg'</span><span>,</span> <span>NuSVR</span><span>())])</span>
<span># Content of pipelines.py </span><span>from</span> <span>sklearn.pipeline</span> <span>import</span> <span>Pipeline</span>
<span>from</span> <span>sklearn.preprocessing</span> <span>import</span> <span>StandardScaler</span>
<span>from</span> <span>sklearn.linear_model</span> <span>import</span> <span>Ridge</span>
<span>from</span> <span>sklearn.svm</span> <span>import</span> <span>NuSVR</span>


<span>def</span> <span>ridge</span><span>():</span>
    <span>return</span> <span>Pipeline</span><span>([(</span><span>'scaler'</span><span>,</span> <span>StandardScaler</span><span>()),</span>
                     <span>(</span><span>'reg'</span><span>,</span> <span>Ridge</span><span>())])</span>


<span>def</span> <span>nusvr</span><span>():</span>
    <span>return</span> <span>Pipeline</span><span>([(</span><span>'scaler'</span><span>,</span> <span>StandardScaler</span><span>()),</span>
                     <span>(</span><span>'reg'</span><span>,</span> <span>NuSVR</span><span>())])</span>
# Content of pipelines.py from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Ridge from sklearn.svm import NuSVR def ridge(): return Pipeline([('scaler', StandardScaler()), ('reg', Ridge())]) def nusvr(): return Pipeline([('scaler', StandardScaler()), ('reg', NuSVR())])

Enter fullscreen mode Exit fullscreen mode

We have one factory for NuSVR and another one Ridge Regression. Since these two models are sensitive to scaling, we include them in a scikit-learn pipeline that scales all features before feeding the data into the model.

Hyperparameter tuning and performance estimation (report.py)

We will process each model separately, generating three HTML reports in total, the reports will be generated using the following source code:

<span># Content of report.py </span><span>from</span> <span>IPython.display</span> <span>import</span> <span>Markdown</span>
<span>import</span> <span>importlib</span>
<span>from</span> <span>sklearn.datasets</span> <span>import</span> <span>load_boston</span>
<span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>cross_val_predict</span><span>,</span> <span>GridSearchCV</span>
<span>import</span> <span>seaborn</span> <span>as</span> <span>sns</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span># + tags=["parameters"] </span><span>m_init</span> <span>=</span> <span>None</span>
<span>m_params</span> <span>=</span> <span>None</span>
<span># - </span>
<span>Markdown</span><span>(</span><span>'# Report for {}'</span><span>.</span><span>format</span><span>(</span><span>m_init</span><span>))</span>
<span>print</span><span>(</span><span>'Params: '</span><span>,</span> <span>m_params</span><span>)</span>
<span># + # m_init is module.sub_module.constructor import it from the string </span><span>parts</span> <span>=</span> <span>m_init</span><span>.</span><span>split</span><span>(</span><span>'.'</span><span>)</span>
<span>mod_str</span><span>,</span> <span>constructor</span> <span>=</span> <span>'.'</span><span>.</span><span>join</span><span>(</span><span>parts</span><span>[:</span><span>-</span><span>1</span><span>]),</span> <span>parts</span><span>[</span><span>-</span><span>1</span><span>]</span>
<span>mod</span> <span>=</span> <span>importlib</span><span>.</span><span>import_module</span><span>(</span><span>mod_str</span><span>)</span>
<span># instantiate it </span><span>model</span> <span>=</span> <span>getattr</span><span>(</span><span>mod</span><span>,</span> <span>constructor</span><span>)()</span>
<span>print</span><span>(</span><span>model</span><span>)</span>
<span># - </span>
<span># load data </span><span>dataset</span> <span>=</span> <span>load_boston</span><span>()</span>
<span>X</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>dataset</span><span>.</span><span>data</span><span>,</span> <span>columns</span><span>=</span><span>dataset</span><span>.</span><span>feature_names</span><span>)</span>
<span>y</span> <span>=</span> <span>dataset</span><span>.</span><span>target</span>
<span># + # Perform grid search over the passed parameters </span><span>grid</span> <span>=</span> <span>GridSearchCV</span><span>(</span><span>model</span><span>,</span> <span>m_params</span><span>,</span> <span>n_jobs</span><span>=-</span><span>1</span><span>)</span>
<span># We want to estimate generalization performance *and* tune hyperparameters # so we are using nested cross-validation </span><span>y_pred</span> <span>=</span> <span>cross_val_predict</span><span>(</span><span>grid</span><span>,</span> <span>X</span><span>,</span> <span>y</span><span>)</span>
<span># - </span>
<span># prev vs actual scatter plot </span><span>fig</span><span>,</span> <span>ax</span> <span>=</span> <span>plt</span><span>.</span><span>subplots</span><span>()</span>
<span>fig</span><span>.</span><span>set_size_inches</span><span>(</span><span>6</span><span>,</span> <span>6</span><span>)</span>
<span>ax</span><span>.</span><span>scatter</span><span>(</span><span>y_pred</span><span>,</span> <span>y</span><span>)</span>
<span>ax</span><span>.</span><span>grid</span><span>()</span>
<span>ax</span><span>.</span><span>set_xlabel</span><span>(</span><span>'Predicted'</span><span>)</span>
<span>ax</span><span>.</span><span>set_ylabel</span><span>(</span><span>'Actual'</span><span>)</span>
<span># residuals </span><span>fig</span><span>,</span> <span>ax</span> <span>=</span> <span>plt</span><span>.</span><span>subplots</span><span>()</span>
<span>fig</span><span>.</span><span>set_size_inches</span><span>(</span><span>6</span><span>,</span> <span>6</span><span>)</span>
<span>res</span> <span>=</span> <span>y</span> <span>-</span> <span>y_pred</span>
<span>ax</span><span>.</span><span>scatter</span><span>(</span><span>np</span><span>.</span><span>arange</span><span>(</span><span>len</span><span>(</span><span>res</span><span>)),</span> <span>res</span><span>)</span>
<span>ax</span><span>.</span><span>grid</span><span>()</span>
<span>ax</span><span>.</span><span>set_ylabel</span><span>(</span><span>'Residual'</span><span>)</span>
<span># residuals distribution </span><span>fig</span><span>,</span> <span>ax</span> <span>=</span> <span>plt</span><span>.</span><span>subplots</span><span>()</span>
<span>fig</span><span>.</span><span>set_size_inches</span><span>(</span><span>8</span><span>,</span> <span>6</span><span>)</span>
<span>sns</span><span>.</span><span>distplot</span><span>(</span><span>res</span><span>,</span> <span>ax</span><span>=</span><span>ax</span><span>)</span>
<span>ax</span><span>.</span><span>grid</span><span>()</span>
<span>ax</span><span>.</span><span>set_title</span><span>(</span><span>'Residual distribution'</span><span>)</span>
<span># print metrics </span><span>mae</span> <span>=</span> <span>np</span><span>.</span><span>abs</span><span>(</span><span>y</span> <span>-</span> <span>y_pred</span><span>).</span><span>mean</span><span>()</span>
<span>mse</span> <span>=</span> <span>((</span><span>y</span> <span>-</span> <span>y_pred</span><span>)</span> <span>**</span> <span>2</span><span>).</span><span>mean</span><span>()</span>
<span>print</span><span>(</span><span>f</span><span>'MAE: </span><span>{</span><span>mae</span><span>:</span><span>.</span><span>2</span><span>f</span><span>}</span><span>'</span><span>)</span>
<span>print</span><span>(</span><span>f</span><span>'MSE: </span><span>{</span><span>mse</span><span>:</span><span>.</span><span>2</span><span>f</span><span>}</span><span>'</span><span>)</span>
<span># Content of report.py </span><span>from</span> <span>IPython.display</span> <span>import</span> <span>Markdown</span>
<span>import</span> <span>importlib</span>
<span>from</span> <span>sklearn.datasets</span> <span>import</span> <span>load_boston</span>
<span>from</span> <span>sklearn.model_selection</span> <span>import</span> <span>cross_val_predict</span><span>,</span> <span>GridSearchCV</span>
<span>import</span> <span>seaborn</span> <span>as</span> <span>sns</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span># + tags=["parameters"] </span><span>m_init</span> <span>=</span> <span>None</span>
<span>m_params</span> <span>=</span> <span>None</span>
<span># - </span>
<span>Markdown</span><span>(</span><span>'# Report for {}'</span><span>.</span><span>format</span><span>(</span><span>m_init</span><span>))</span>

<span>print</span><span>(</span><span>'Params: '</span><span>,</span> <span>m_params</span><span>)</span>

<span># + # m_init is module.sub_module.constructor import it from the string </span><span>parts</span> <span>=</span> <span>m_init</span><span>.</span><span>split</span><span>(</span><span>'.'</span><span>)</span>
<span>mod_str</span><span>,</span> <span>constructor</span> <span>=</span> <span>'.'</span><span>.</span><span>join</span><span>(</span><span>parts</span><span>[:</span><span>-</span><span>1</span><span>]),</span> <span>parts</span><span>[</span><span>-</span><span>1</span><span>]</span>
<span>mod</span> <span>=</span> <span>importlib</span><span>.</span><span>import_module</span><span>(</span><span>mod_str</span><span>)</span>

<span># instantiate it </span><span>model</span> <span>=</span> <span>getattr</span><span>(</span><span>mod</span><span>,</span> <span>constructor</span><span>)()</span>
<span>print</span><span>(</span><span>model</span><span>)</span>
<span># - </span>
<span># load data </span><span>dataset</span> <span>=</span> <span>load_boston</span><span>()</span>
<span>X</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>dataset</span><span>.</span><span>data</span><span>,</span> <span>columns</span><span>=</span><span>dataset</span><span>.</span><span>feature_names</span><span>)</span>
<span>y</span> <span>=</span> <span>dataset</span><span>.</span><span>target</span>

<span># + # Perform grid search over the passed parameters </span><span>grid</span> <span>=</span> <span>GridSearchCV</span><span>(</span><span>model</span><span>,</span> <span>m_params</span><span>,</span> <span>n_jobs</span><span>=-</span><span>1</span><span>)</span>

<span># We want to estimate generalization performance *and* tune hyperparameters # so we are using nested cross-validation </span><span>y_pred</span> <span>=</span> <span>cross_val_predict</span><span>(</span><span>grid</span><span>,</span> <span>X</span><span>,</span> <span>y</span><span>)</span>
<span># - </span>
<span># prev vs actual scatter plot </span><span>fig</span><span>,</span> <span>ax</span> <span>=</span> <span>plt</span><span>.</span><span>subplots</span><span>()</span>
<span>fig</span><span>.</span><span>set_size_inches</span><span>(</span><span>6</span><span>,</span> <span>6</span><span>)</span>
<span>ax</span><span>.</span><span>scatter</span><span>(</span><span>y_pred</span><span>,</span> <span>y</span><span>)</span>
<span>ax</span><span>.</span><span>grid</span><span>()</span>
<span>ax</span><span>.</span><span>set_xlabel</span><span>(</span><span>'Predicted'</span><span>)</span>
<span>ax</span><span>.</span><span>set_ylabel</span><span>(</span><span>'Actual'</span><span>)</span>

<span># residuals </span><span>fig</span><span>,</span> <span>ax</span> <span>=</span> <span>plt</span><span>.</span><span>subplots</span><span>()</span>
<span>fig</span><span>.</span><span>set_size_inches</span><span>(</span><span>6</span><span>,</span> <span>6</span><span>)</span>
<span>res</span> <span>=</span> <span>y</span> <span>-</span> <span>y_pred</span>
<span>ax</span><span>.</span><span>scatter</span><span>(</span><span>np</span><span>.</span><span>arange</span><span>(</span><span>len</span><span>(</span><span>res</span><span>)),</span> <span>res</span><span>)</span>
<span>ax</span><span>.</span><span>grid</span><span>()</span>
<span>ax</span><span>.</span><span>set_ylabel</span><span>(</span><span>'Residual'</span><span>)</span>

<span># residuals distribution </span><span>fig</span><span>,</span> <span>ax</span> <span>=</span> <span>plt</span><span>.</span><span>subplots</span><span>()</span>
<span>fig</span><span>.</span><span>set_size_inches</span><span>(</span><span>8</span><span>,</span> <span>6</span><span>)</span>
<span>sns</span><span>.</span><span>distplot</span><span>(</span><span>res</span><span>,</span> <span>ax</span><span>=</span><span>ax</span><span>)</span>
<span>ax</span><span>.</span><span>grid</span><span>()</span>
<span>ax</span><span>.</span><span>set_title</span><span>(</span><span>'Residual distribution'</span><span>)</span>

<span># print metrics </span><span>mae</span> <span>=</span> <span>np</span><span>.</span><span>abs</span><span>(</span><span>y</span> <span>-</span> <span>y_pred</span><span>).</span><span>mean</span><span>()</span>
<span>mse</span> <span>=</span> <span>((</span><span>y</span> <span>-</span> <span>y_pred</span><span>)</span> <span>**</span> <span>2</span><span>).</span><span>mean</span><span>()</span>
<span>print</span><span>(</span><span>f</span><span>'MAE: </span><span>{</span><span>mae</span><span>:</span><span>.</span><span>2</span><span>f</span><span>}</span><span>'</span><span>)</span>
<span>print</span><span>(</span><span>f</span><span>'MSE: </span><span>{</span><span>mse</span><span>:</span><span>.</span><span>2</span><span>f</span><span>}</span><span>'</span><span>)</span>
# Content of report.py from IPython.display import Markdown import importlib from sklearn.datasets import load_boston from sklearn.model_selection import cross_val_predict, GridSearchCV import seaborn as sns import matplotlib.pyplot as plt import numpy as np import pandas as pd # + tags=["parameters"] m_init = None m_params = None # - Markdown('# Report for {}'.format(m_init)) print('Params: ', m_params) # + # m_init is module.sub_module.constructor import it from the string parts = m_init.split('.') mod_str, constructor = '.'.join(parts[:-1]), parts[-1] mod = importlib.import_module(mod_str) # instantiate it model = getattr(mod, constructor)() print(model) # - # load data dataset = load_boston() X = pd.DataFrame(dataset.data, columns=dataset.feature_names) y = dataset.target # + # Perform grid search over the passed parameters grid = GridSearchCV(model, m_params, n_jobs=-1) # We want to estimate generalization performance *and* tune hyperparameters # so we are using nested cross-validation y_pred = cross_val_predict(grid, X, y) # - # prev vs actual scatter plot fig, ax = plt.subplots() fig.set_size_inches(6, 6) ax.scatter(y_pred, y) ax.grid() ax.set_xlabel('Predicted') ax.set_ylabel('Actual') # residuals fig, ax = plt.subplots() fig.set_size_inches(6, 6) res = y - y_pred ax.scatter(np.arange(len(res)), res) ax.grid() ax.set_ylabel('Residual') # residuals distribution fig, ax = plt.subplots() fig.set_size_inches(8, 6) sns.distplot(res, ax=ax) ax.grid() ax.set_title('Residual distribution') # print metrics mae = np.abs(y - y_pred).mean() mse = ((y - y_pred) ** 2).mean() print(f'MAE: {mae:.2f}') print(f'MSE: {mse:.2f}')

Enter fullscreen mode Exit fullscreen mode

Running the execution loop (main.py)

We now turn our attention to main script that will take the model pipelines, the report source code and execute them. First we have to define the parameters we want to try for each model. We define one dictionary for each, the key m_init has the pipeline location (we will dynamically import this using the importlib library, finally, the m_params key contains the hyperparameters to try, not that for Ridge Regression and NuSVR, we have to add a ref__ prefix to each parameter, this is because the factories return scikit-learn Pipeline objects and we need to specify to which step the parameters belong to.

<span>from</span> <span>pathlib</span> <span>import</span> <span>Path</span>
<span>from</span> <span>ploomber.tasks</span> <span>import</span> <span>NotebookRunner</span>
<span>from</span> <span>ploomber.products</span> <span>import</span> <span>File</span>
<span>from</span> <span>ploomber</span> <span>import</span> <span>DAG</span>
<span># Ridge Regression grid </span><span>params_ridge</span> <span>=</span> <span>{</span>
<span>'m_init'</span><span>:</span> <span>'pipelines.ridge'</span><span>,</span>
<span>'m_params'</span><span>:</span> <span>{</span>
<span>'reg__alpha'</span><span>:</span> <span>[</span><span>0.5</span><span>,</span> <span>1.0</span><span>,</span> <span>1.5</span><span>,</span> <span>2.0</span><span>,</span> <span>3.0</span><span>]</span>
<span>}</span>
<span>}</span>
<span># Random Forest Regression grid </span><span>params_rf</span> <span>=</span> <span>{</span>
<span>'m_init'</span><span>:</span> <span>'sklearn.ensemble.RandomForestRegressor'</span><span>,</span>
<span>'m_params'</span><span>:</span> <span>{</span>
<span>'n_estimators'</span><span>:</span> <span>[</span><span>5</span><span>,</span> <span>50</span><span>,</span> <span>100</span><span>],</span>
<span>'min_samples_leaf'</span><span>:</span> <span>[</span><span>5</span><span>,</span> <span>10</span><span>,</span> <span>20</span><span>],</span>
<span>}</span>
<span>}</span>
<span># Nu Support Vector Regression grid </span><span>params_nusvr</span> <span>=</span> <span>{</span>
<span>'m_init'</span><span>:</span> <span>'pipelines.nusvr'</span><span>,</span>
<span>'m_params'</span><span>:</span> <span>{</span>
<span>'reg__nu'</span><span>:</span> <span>[</span><span>0.3</span><span>,</span> <span>0.5</span><span>,</span> <span>0.8</span><span>],</span>
<span>'reg__C'</span><span>:</span> <span>[</span><span>0.5</span><span>,</span> <span>1.0</span><span>,</span> <span>1.5</span><span>,</span> <span>2.0</span><span>],</span>
<span>'reg__kernel'</span><span>:</span> <span>[</span><span>'rbf'</span><span>,</span> <span>'sigmoid'</span><span>]</span>
<span>}</span>
<span>}</span>
<span>from</span> <span>pathlib</span> <span>import</span> <span>Path</span>

<span>from</span> <span>ploomber.tasks</span> <span>import</span> <span>NotebookRunner</span>
<span>from</span> <span>ploomber.products</span> <span>import</span> <span>File</span>
<span>from</span> <span>ploomber</span> <span>import</span> <span>DAG</span>

<span># Ridge Regression grid </span><span>params_ridge</span> <span>=</span> <span>{</span>
    <span>'m_init'</span><span>:</span> <span>'pipelines.ridge'</span><span>,</span>
    <span>'m_params'</span><span>:</span> <span>{</span>
        <span>'reg__alpha'</span><span>:</span> <span>[</span><span>0.5</span><span>,</span> <span>1.0</span><span>,</span> <span>1.5</span><span>,</span> <span>2.0</span><span>,</span> <span>3.0</span><span>]</span>
    <span>}</span>
<span>}</span>

<span># Random Forest Regression grid </span><span>params_rf</span> <span>=</span> <span>{</span>
    <span>'m_init'</span><span>:</span> <span>'sklearn.ensemble.RandomForestRegressor'</span><span>,</span>
    <span>'m_params'</span><span>:</span> <span>{</span>
        <span>'n_estimators'</span><span>:</span> <span>[</span><span>5</span><span>,</span> <span>50</span><span>,</span> <span>100</span><span>],</span>
        <span>'min_samples_leaf'</span><span>:</span> <span>[</span><span>5</span><span>,</span> <span>10</span><span>,</span> <span>20</span><span>],</span>
    <span>}</span>
<span>}</span>

<span># Nu Support Vector Regression grid </span><span>params_nusvr</span> <span>=</span> <span>{</span>
    <span>'m_init'</span><span>:</span> <span>'pipelines.nusvr'</span><span>,</span>
    <span>'m_params'</span><span>:</span> <span>{</span>
        <span>'reg__nu'</span><span>:</span> <span>[</span><span>0.3</span><span>,</span> <span>0.5</span><span>,</span> <span>0.8</span><span>],</span>
        <span>'reg__C'</span><span>:</span> <span>[</span><span>0.5</span><span>,</span> <span>1.0</span><span>,</span> <span>1.5</span><span>,</span> <span>2.0</span><span>],</span>
        <span>'reg__kernel'</span><span>:</span> <span>[</span><span>'rbf'</span><span>,</span> <span>'sigmoid'</span><span>]</span>
    <span>}</span>
<span>}</span>
from pathlib import Path from ploomber.tasks import NotebookRunner from ploomber.products import File from ploomber import DAG # Ridge Regression grid params_ridge = { 'm_init': 'pipelines.ridge', 'm_params': { 'reg__alpha': [0.5, 1.0, 1.5, 2.0, 3.0] } } # Random Forest Regression grid params_rf = { 'm_init': 'sklearn.ensemble.RandomForestRegressor', 'm_params': { 'n_estimators': [5, 50, 100], 'min_samples_leaf': [5, 10, 20], } } # Nu Support Vector Regression grid params_nusvr = { 'm_init': 'pipelines.nusvr', 'm_params': { 'reg__nu': [0.3, 0.5, 0.8], 'reg__C': [0.5, 1.0, 1.5, 2.0], 'reg__kernel': ['rbf', 'sigmoid'] } }

Enter fullscreen mode Exit fullscreen mode

Note that we do not have a pipeline for RandomForestRegressor, Random Forest is not sensitive to scaling so we use the model directly.

We now add the execution loop, we will execute it using ploomber. We just have to tell ploomber where to load the source code from, which parameters to use on each iteration and where to save the output:

<span># load report source code </span><span>notebook</span> <span>=</span> <span>Path</span><span>(</span><span>'report.py'</span><span>).</span><span>read_text</span><span>()</span>
<span># we will save all notebooks in the artifacts/ folder </span><span>out</span> <span>=</span> <span>Path</span><span>(</span><span>'artifacts'</span><span>)</span>
<span>out</span><span>.</span><span>mkdir</span><span>(</span><span>exist_ok</span><span>=</span><span>True</span><span>)</span>
<span>params_all</span> <span>=</span> <span>{</span><span>'ridge'</span><span>:</span> <span>params_ridge</span><span>,</span> <span>'rf'</span><span>:</span> <span>params_rf</span><span>,</span> <span>'nusvr'</span><span>:</span> <span>params_nusvr</span><span>}</span>
<span>dag</span> <span>=</span> <span>DAG</span><span>()</span>
<span># loop over params and create one notebook task for each... </span><span>for</span> <span>name</span><span>,</span> <span>params</span> <span>in</span> <span>params_all</span><span>.</span><span>items</span><span>():</span>
<span># NotebookRunner is able to execute ipynb files using </span> <span># papermill under the hood, if the input file has a </span> <span># different extension (like in our case), it will first </span> <span># convert it to an ipynb file using jupytext </span> <span>NotebookRunner</span><span>(</span><span>notebook</span><span>,</span>
<span># save it in artifacts/{name}.html </span> <span># NotebookRunner will generate ipynb files by </span> <span># default, but you can choose other formats, </span> <span># any format supported by the official nbconvert </span> <span># package is supported here </span> <span>product</span><span>=</span><span>File</span><span>(</span><span>out</span> <span>/</span> <span>(</span><span>name</span> <span>+</span> <span>'.html'</span><span>)),</span>
<span>dag</span><span>=</span><span>dag</span><span>,</span>
<span>name</span><span>=</span><span>name</span><span>,</span>
<span># pass the parameters </span> <span>params</span><span>=</span><span>params</span><span>,</span>
<span>ext_in</span><span>=</span><span>'py'</span><span>,</span>
<span>kernelspec_name</span><span>=</span><span>'python3'</span><span>)</span>
<span># load report source code </span><span>notebook</span> <span>=</span> <span>Path</span><span>(</span><span>'report.py'</span><span>).</span><span>read_text</span><span>()</span>

<span># we will save all notebooks in the artifacts/ folder </span><span>out</span> <span>=</span> <span>Path</span><span>(</span><span>'artifacts'</span><span>)</span>
<span>out</span><span>.</span><span>mkdir</span><span>(</span><span>exist_ok</span><span>=</span><span>True</span><span>)</span>

<span>params_all</span> <span>=</span> <span>{</span><span>'ridge'</span><span>:</span> <span>params_ridge</span><span>,</span> <span>'rf'</span><span>:</span> <span>params_rf</span><span>,</span> <span>'nusvr'</span><span>:</span> <span>params_nusvr</span><span>}</span>

<span>dag</span> <span>=</span> <span>DAG</span><span>()</span>

<span># loop over params and create one notebook task for each... </span><span>for</span> <span>name</span><span>,</span> <span>params</span> <span>in</span> <span>params_all</span><span>.</span><span>items</span><span>():</span>
    <span># NotebookRunner is able to execute ipynb files using </span>    <span># papermill under the hood, if the input file has a </span>    <span># different extension (like in our case), it will first </span>    <span># convert it to an ipynb file using jupytext </span>    <span>NotebookRunner</span><span>(</span><span>notebook</span><span>,</span>
                   <span># save it in artifacts/{name}.html </span>                   <span># NotebookRunner will generate ipynb files by </span>                   <span># default, but you can choose other formats, </span>                   <span># any format supported by the official nbconvert </span>                   <span># package is supported here </span>                   <span>product</span><span>=</span><span>File</span><span>(</span><span>out</span> <span>/</span> <span>(</span><span>name</span> <span>+</span> <span>'.html'</span><span>)),</span>
                   <span>dag</span><span>=</span><span>dag</span><span>,</span>
                   <span>name</span><span>=</span><span>name</span><span>,</span>
                   <span># pass the parameters </span>                   <span>params</span><span>=</span><span>params</span><span>,</span>
                   <span>ext_in</span><span>=</span><span>'py'</span><span>,</span>
                   <span>kernelspec_name</span><span>=</span><span>'python3'</span><span>)</span>
# load report source code notebook = Path('report.py').read_text() # we will save all notebooks in the artifacts/ folder out = Path('artifacts') out.mkdir(exist_ok=True) params_all = {'ridge': params_ridge, 'rf': params_rf, 'nusvr': params_nusvr} dag = DAG() # loop over params and create one notebook task for each... for name, params in params_all.items(): # NotebookRunner is able to execute ipynb files using # papermill under the hood, if the input file has a # different extension (like in our case), it will first # convert it to an ipynb file using jupytext NotebookRunner(notebook, # save it in artifacts/{name}.html # NotebookRunner will generate ipynb files by # default, but you can choose other formats, # any format supported by the official nbconvert # package is supported here product=File(out / (name + '.html')), dag=dag, name=name, # pass the parameters params=params, ext_in='py', kernelspec_name='python3')

Enter fullscreen mode Exit fullscreen mode

Build the DAG:

<span>dag</span><span>.</span><span>build</span><span>()</span>
<span>dag</span><span>.</span><span>build</span><span>()</span>
dag.build()

Enter fullscreen mode Exit fullscreen mode

# Output:
name Ran? Elapsed (s) Percentage
------ ------ ------------- ------------
nusvr True 6.95555 27.8197
rf True 11.6961 46.78
ridge True 6.35066 25.4003
# Output:
name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
nusvr   True          6.95555       27.8197
rf      True         11.6961        46.78
ridge   True          6.35066       25.4003
# Output: name Ran? Elapsed (s) Percentage ------ ------ ------------- ------------ nusvr True 6.95555 27.8197 rf True 11.6961 46.78 ridge True 6.35066 25.4003

Enter fullscreen mode Exit fullscreen mode

That’s it. After building the DAG, each model will generate one report, you can see them here: Ridge, Random Forest and NuSVR.

Splitting logic into separate files improves readability and maintainability, if we want to add another model we only have to add a new dictionary with the parameter grid, if preprocessing is needed, we just add a factory in pipelines.py.

Using ploomber provides a concise and clean framework for generating reports, in just a few lines of code, we generated all our reports, however, we made a big simplifications in our report.py file: we are loading, training and evaluating in a single source file, if we made even a small change to our charts we would have to re-train every model again. A better approach is to split that logic in several steps, and that scenario is where ploomber is very effective:

  1. Clean raw data (save clean dataset)
  2. Train model and predict (save predictions)
  3. Evaluate predictions

If we split each model pipeline in three steps, and run build, we will obtain the same results, now let’s say you want to add a new chart, so you modify step 3. All you have to do to update your reports is dag.build(), ploomber will figure out that it does not have to re-run steps 1-2 and overwrite the old reports with the new ones.

Closing remarks

Developing Machine Learning model is an iterative process, by breaking down the entire pipeline logic in small steps and maximizing code reusability, we can develop short and maintainable pipelines. Jupyter is a superb tool (I use it every day and I’m actually writing this blog post from Jupyter), but do not fall into the habit of coding everything in a big notebook, which inevitably leads to unmaintainable code, prefer many short notebooks (or .py files) over a big single one.

Source code for this post is available here.

Found an error in this post? Click here to let us know.

This blog post was generated using package versions:

# Output:
matplotlib==3.1.3
numpy==1.18.1
pandas==1.0.1
scikit-learn==0.22.2
seaborn==0.10.0
# Output:
matplotlib==3.1.3
numpy==1.18.1
pandas==1.0.1
scikit-learn==0.22.2
seaborn==0.10.0
# Output: matplotlib==3.1.3 numpy==1.18.1 pandas==1.0.1 scikit-learn==0.22.2 seaborn==0.10.0

Enter fullscreen mode Exit fullscreen mode

原文链接:Model selection with scikit-learn and ploomber

© 版权声明
THE END
喜欢就支持一下吧
点赞9 分享
The best hearts are always the bravest.
心灵最高尚的人也总是最勇敢的人
评论 抢沙发

请登录后发表评论

    暂无评论内容