Comparing Machine Learning Algorithms Using Friedman Test and Critical Difference Diagrams in Python

When you’re working with machine learning, deciding which algorithm performs best across multiple datasets can be quite challenging. Simply looking at performance metrics might not be enough, you need statistical methods to be sure. That’s when the Friedman Test and Critical Difference (CD) Diagrams can help.

My classmates and I faced this challenge firsthand when preparing a project presentation. We struggled to find a clear way to generate the diagram, so after finally figuring it out, I decided to share this guide to save others time.

In this article, you’ll find a Python code that performs this evaluation and visualization. You can also access the complete code on my GitHub gist.

I’ll also show you how to modify the code to use accuracy instead of the error rate. The script has been tested on Python 3.8 and above.

The Python script does three main things:

Performs the Friedman Test to statistically evaluate performance differences.
Creates a ranking table comparing the algorithm scores.
Generates and saves a PNG image of the Critical Difference Diagram and the ranking table.

Critical Difference Diagram generated

In the diagram, algorithms connected by a horizontal bar are not significantly different from each other based on the statistical test. Lower-ranked algorithms (positioned further right) generally performed better. The ranking remains the same whether using error rate or accuracy as the performance metric.

Ranking table generated

This table shows the error rates of each algorithm across all datasets. Each cell contains the error rate along with its ranking in parentheses (where 1 is the best and 13 is the worst, since I used 12 datasets).

At the bottom, you’ll find the rank sums and average rankings for each algorithm, for better overall comparison.

I edited the original image and removed some columns for better readability here.

Why Use the Friedman Test and Critical Difference Diagram?

The Friedman test is a non-parametric statistical test designed to detect differences between multiple algorithms across various datasets. It ranks algorithms based on their performance, helping you understand if differences in performance are genuinely significant or just due to chance.

The Critical Difference Diagram visually presents these rankings. It clearly shows which algorithms perform similarly and which are significantly better or worse, making it easy to interpret results at a glance. This diagram is particularly useful when comparing numerous algorithms across multiple datasets.

Preparing Your Data

Now for the implementation, you’ll need your data structured like this:

Datasets: Names of your datasets (e.g., MNIST, Fashion-MNIST).
Algorithms: Names of the algorithms you’re evaluating. Keep these ordered consistently.
Performance (Error): Lists of error rates for each algorithm per dataset, aligned with your Algorithms list.

For example:


<span>data</span> <span>=</span> <span>{</span>
    <span>'</span><span>Datasets</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>MNIST</span><span>'</span><span>,</span> <span>'</span><span>Fashion-MNIST</span><span>'</span><span>,</span> <span>...],</span>
    <span>'</span><span>Algorithms</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>NaiveBayes</span><span>'</span><span>,</span> <span>'</span><span>IBk</span><span>'</span><span>,</span> <span>...,</span> <span>'</span><span>RandomForest</span><span>'</span><span>],</span>
    <span>'</span><span>Performance (Error)</span><span>'</span><span>:</span> <span>[</span>
        <span>[</span><span>'</span><span>30.34%</span><span>'</span><span>,</span> <span>'</span><span>3.09%</span><span>'</span><span>,</span> <span>...,</span> <span>'</span><span>88.65%</span><span>'</span><span>],</span>   <span># MNIST </span>        <span>[</span><span>'</span><span>36.72%</span><span>'</span><span>,</span> <span>'</span><span>14.35%</span><span>'</span><span>,</span> <span>...,</span> <span>'</span><span>90%</span><span>'</span><span>],</span>     <span># Fashion-MNIST </span>        <span># Other datasets... </span>    <span>]</span>
<span>}</span>
<span>data</span> <span>=</span> <span>{</span>
    <span>'</span><span>Datasets</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>MNIST</span><span>'</span><span>,</span> <span>'</span><span>Fashion-MNIST</span><span>'</span><span>,</span> <span>...],</span>
    <span>'</span><span>Algorithms</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>NaiveBayes</span><span>'</span><span>,</span> <span>'</span><span>IBk</span><span>'</span><span>,</span> <span>...,</span> <span>'</span><span>RandomForest</span><span>'</span><span>],</span>
    <span>'</span><span>Performance (Error)</span><span>'</span><span>:</span> <span>[</span>
        <span>[</span><span>'</span><span>30.34%</span><span>'</span><span>,</span> <span>'</span><span>3.09%</span><span>'</span><span>,</span> <span>...,</span> <span>'</span><span>88.65%</span><span>'</span><span>],</span>   <span># MNIST </span>        <span>[</span><span>'</span><span>36.72%</span><span>'</span><span>,</span> <span>'</span><span>14.35%</span><span>'</span><span>,</span> <span>...,</span> <span>'</span><span>90%</span><span>'</span><span>],</span>     <span># Fashion-MNIST </span>        <span># Other datasets... </span>    <span>]</span>
<span>}</span>
data = {
    'Datasets': ['MNIST', 'Fashion-MNIST', ...],
    'Algorithms': ['NaiveBayes', 'IBk', ..., 'RandomForest'],
    'Performance (Error)': [
        ['30.34%', '3.09%', ..., '88.65%'],   # MNIST         ['36.72%', '14.35%', ..., '90%'],     # Fashion-MNIST         # Other datasets...     ]
}