Automatic Reporting in Python (3 Part Series)
1 Automatic Reporting in Python – Part 1: From Planning to Hello World
2 Automatic Reporting in Python – Part 2: From Hello World to Real Insights
3 Automatic Reporting in Python – Part 3: Packaging It Up
As outlined in my previous posts (Part I and Part II available on this fine website), the goal of this project is to make an automatic reporting tool.
In this series of guides, the outcome I’m shooting for is a single HTML page that allows me to interrogate and compare the output of machine learning models.
At the conclusion of the previous tutorial, the reporting tool was actually showing some genuine use! It could accept a number of .csv
summaries of machine learning summary files and output a single .html
page that presented the information is a form that was… functional.
There three main features that I’d like to add to the tool for now:
- The report looks incredibly dull. Readability counts! We need to improve the
a e s t h e t i c s
of the report. - It’s hard to dig into the tables – currently it’s just 100 rows presented with no tools to search. Some basic search functionality would be grouse.
- The datasets that we want to run the report with are currently hard-coded into the script. This needs to be split out to make the tool slightly more flexible.
A brief comment
Compared to the previous posts, this post is much more of an exploratory, learning experience – this post represents a neophyte’s attempt to build a working tool, rather than a beautiful depiction of all that is possible. If you can see a much better approach for anything listed in this post, please feel free to share it with myself and all the other readers!
But without further ado, let’s take a crack at improving the aesthetics.
Step Seven – Improving the Aesthetics
Those of you with an understanding of HTML pages might know what comes next: Cascading Style Sheets, known more commonly as CSS.
Taking your first steps down a path
As indicated above, the challenge in the context of this tutorial is that this is an enormous field that I personally am not actually particularly well-versed in, having only dabbled and hacked in this space. However, I am familiar with learning new things.
So, if this is your first real introduction to CSS, let me take you down the same path I would recommend in learning any new tech:
- Do some background reading on the fundamentals of CSS. Hit up Wikipedia. If you’re keen, hit up the CSS standard! Never be afraid to Google “simple introduction to [topic].”
- As you read and explore, make note of potential good resources of future information. I would encourage everyone to take a look for an “Awesome List” relevant to your topic – and in case, the Awesome CSS List is here.
- With a basic understanding of what the hell is going on under your belt, play around to your heart’s content with local files and implement as much as you can in local scratch files. In this case, create small (or large?) HTML pages and figure out how to structure the CSS neatly. I find this really helps to get to know some of the practical realities and challenges of working with the language before I start diving into frameworks.
Being a bit more mercenary
For self improvement, I find nothing beats spending the time working up solutions from scratch. Of course, if you’re trying to hack together a solution for a business need, it may not behoove you to spend hours and days coming up with an elegant CSS framework from the beginning.
Instead, we may like to quickly jump off of someone else’s CSS framework. Fortunately, there are a number of these available, with a great number of them focusing on being ‘minimal’. A quick Google search should get you started down this path.
Integration
How is the CSS file to be integrated our report? There’s a few options that are typically at play:
- Download the CSS file and keep it as external file. The advantage is that we have our
.html
and.css
files neatly separated, and we have full control over both; the far more significant disadvantage is that now if we want to move our report around, we have to drag a bunch of.css
files around with it. - Use a Content Delivery Network (CDN) copy of the CSS file. Most frameworks will offer a CDN link for their file: this is essentially a link to an efficient, readily available copy of the data. The advantage is that you can get a CSS up and going in your page just by dropping a single link in the
<head>
section of your.html
, no mussin’ and fussin’ with local files. The disadvantage is that you don’t have control of the file, and an internet connection is required. - A slightly more complex option is to have local copies of the CSS file, and then write them into the
.html
file. This could probably be done relatively easily and sustainibly if we got clever with our templating. The advantage is that we’d have our report in a single file, and it wouldn’t require an internet connection to use; the disadvantage is that it is going to require a bit more effort to get set up. (This is commonly used approach when creating standalone versions of interactive pages. Write a Jupyter Notebook to.html
, inspect the file, and you’ll find all the CSS and JavaScript magic packaged up in the<head>
section.)
At this early stage of prototyping, I prefer to use CDNs if possible. The advantage of being able to swap CSS frameworks just by changing a single line of code and not having to bother with local files is worth the cost of not being able to play with and edit the framework. Optimisation (in the form of being able to automatically integrate the CSS into the .html
file) can come a little later.
To start with, I’m going to use Milligram, a lightweight little framework. To get started using the CDN method, I simply follow the provided instructions to integrate into the CDN. Under our templates/report.html
file, I’ll add the requisite links into the <head>
section:
report.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>{{ title }}</title>
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Roboto:300,300italic,700,700italic">
<link rel="stylesheet" href="//cdn.rawgit.com/necolas/normalize.css/master/normalize.css">
<link rel="stylesheet" href="//cdn.rawgit.com/milligram/milligram/master/dist/milligram.min.css">
</head>
<!-- body section continues below... -->
Enter fullscreen mode Exit fullscreen mode
And all of a sudden, our plain, early 90’s looking webpage has been transformed into something a bit more pleasing to the eye:
But we note there’s something still not quite right here – primarily, why does the page (and the table in particular) always take up the whole width of the window? Why doesn’t this look right on mobile?
It turns out just adding a bunch of .css
files isn’t quite enough. We need to make sure the layout of our .html
pages match what’s expected by the .css
layout.
HTML layouts
Like most topics in this space, the layout of your HTML page is a reasonably intuitive concept, while simultaneously being a problem that you spend years diving into. What makes it a bit more challenging is that despite there being a number of somewhat fragmentary explanations, I’ve struggled to find a simple and/or holistic to the field (although this explanation is currently my favourite gentle introduction, and the Mozilla guide appears to be quite thorough).
For brevity, I’m going to leave most of the further reading to you, the reader (sorry!), and instead focus on what Milligram expects.
If we inspect the code of the Milligram page, we’ll see that within the <body>
, we can see the HTML of the site is structured roughly as:
<body>
<main class="wrapper">
<header class="header">
<section class="container">
<section class="container">
Enter fullscreen mode Exit fullscreen mode
Now, based on some of the readings in the above links, and assuming that this is the structure that Milligram is expecting, we can apply the same structure to our own report, giving us something like the following:
report.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<!-- Note the addition of the viewport! -->
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimal-ui">
<title>{{ title }}</title>
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Roboto:300,300italic,700,700italic">
<link rel="stylesheet" href="//cdn.rawgit.com/necolas/normalize.css/master/normalize.css">
<link rel="stylesheet" href="//cdn.rawgit.com/milligram/milligram/master/dist/milligram.min.css">
</head>
<body>
<main class="wrapper">
<header class="header">
<section class="container">
<h1>{{ title }}</h1>
<p>This report was automatically generated.</p>
</section>
</header>
{% for section in sections %}
{{ section }}
{% endfor %}
</main>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode
summary_section.html
<section class="container" id="summary">
<h2>Quick summary</h2>
<h3>Accuracy</h3>
{% for model_results in model_results_list %}
<p><em>{{ model_results.model_name }}</em> analysed <em>{{ model_results.number_of_images }} image(s)</em>, achieving an
accuracy of <em>{{ "{:.2%}".format(model_results.accuracy) }}.</em></p>
{% endfor %}
<h3>Trouble spots</h3>
{% for model_results in model_results_list %}
<p><em>{{ model_results.model_name }}</em> misidentified <em>{{ model_results.number_misidentified }} image(s)</em>.</p>
{% endfor %}
<p><em>{{ number_misidentified }}</em> misidentified image(s) were common to all models.</p>
</section>
Enter fullscreen mode Exit fullscreen mode
table_section.html
<section class="container" id="{{ model }}">
<h2>{{ model }} - Model Results</h2>
<p>Results for each image as predicted by model <i>'{{ model }}'</i>, as captured in file <i>'{{ dataset }}'</i>.</p>
{{ table }}
</section>
Enter fullscreen mode Exit fullscreen mode
We had already structured this report to be a collection of largely independent collection of sections – we were even using this terminology! – so it’s not a huge drama to add the <section>
tags to the system.
Proof that this works
Run autoreporting.py
and inspect the report – try it both at full-screen and simulating a mobile screen.
Progress! That’s the benefit of using a well-made responsive layout.
GitHub status
Oooft. That was a lot of background reading for not a huge amount of code. All the same, the project should look like this.
Step Eight – Making Our Tables Interactive
Okay. So now we have our tables, and they look pretty good – the challenge is that they’re not interactive. For instance, it would be wonderful to have the functionality to filter by a category, or drill down to a specific image.
Now, as with exploring CSS, we have a couple of options. Certainly, we can explore the option of creating all of this ourselves – there are plenty of examples around, and they’re not terribly difficult. But, if we’re being pragmatic (or pressed by business needs!) we can probably find a pre-built package of what we need.
After a bit of googling, I came across DataTables – this is a plugin for JQuery, a very common JavaScript framework. DataTables looks like it covers most of the functionality we need, and provides a wealth of extensions and plugins for any of the functionality we don’t have. All in all, a promising candidate.
Implementing DataTables
Fortunately, it turns out that implementing DataTables is relatively straight forward. From the front page of the DataTables site, we can see that the general principles are that we need to:
- Although not spelled out explicitly – DataTables is a JQuery plugin – so first we need load the Jquery
.js
file. - We’ll then load the DataTables
.js
and.css
files. - Finally, we call the DataTables function, pointing it at the HTML
id
of the table we want to add the functionality to.
That’s all relatively straightforward, with just some very solvable wrinkles:
- Our tables don’t have
id
tags to refer to. - We need a way to call the DataTables function and point it at the
id
of the table, in a way that fits with our templating system.
Let’s address these one by one.
Importing the relevant files
Before we get to our wrinkles, let’s hit our basics. As we imported our files from CDNs previously, we’ll do the same for JQuery.
Let’s update the <head>
section of templates/report.html
and add the links:
report.html
<!-- More above! -->
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>{{ title }}</title>
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=Roboto:300,300italic,700,700italic">
<link rel="stylesheet" href="//cdn.rawgit.com/necolas/normalize.css/master/normalize.css">
<link rel="stylesheet" href="//cdn.rawgit.com/milligram/milligram/master/dist/milligram.min.css">
<link rel="stylesheet" href="//cdn.datatables.net/1.10.19/css/jquery.dataTables.min.css">
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js"></script>
<script src="//cdn.datatables.net/1.10.19/js/jquery.dataTables.min.js"></script>
</head>
<!-- More below! -->
Enter fullscreen mode Exit fullscreen mode
Adding id
tags to the tables
We need to add an id
tag to the <table>
objects in our report. How can we do this?
Well, let’s work backwards from the templates/table_section.html
file. Within that file, we note that we insert our fully-formed HTML tables via the {{ table }}
insert.
The {{ table }}
insert is generated in autoreporting.py
when we call the get_results_df_as_html
method of ModelResults
class. This method takes the pandas
DataFrame and converts it into a string of HTML using the DataFrame.to_html
function.
If we inspect the docs for that function, we see that there’s an optional argument table_id
. Ah yeah, cool man! If we pass the model name as that argument, the HTML table will be generated with the id
that we want. The ModelResults
class already has model_name
as an attribute, so we can include that:
class ModelResults:
# ...
def get_results_df_as_html(self):
""" Return the results DataFrame as an HTML object. :return: String of HTML. """
html = self.df_results.to_html(table_id=self.model_name)
return html
Enter fullscreen mode Exit fullscreen mode
You can run autoreporting.py
and inspect the tables that are generated to confirm that they do indeed have the model name as an id
.
Easy! Just a matter of tracing it back from the end result of HTML to the actual source within the code.
Calling the DataTables function
We need to call the DataTables function listed above, pointing at the appropriate id
as we’ve just generated. The actual code to call the function is pretty straightforward. The question is: where do we put it?
This challenge has a couple of bounding constraints on it.
- We need to call the function to generate the DataTable using the model name – something like this:
$(document).ready( function () {
$('#VGG19').DataTable();
} );
Enter fullscreen mode Exit fullscreen mode
- Traditionally, JavaScript is placed either in the
<head>
section, although this is a somewhat controversial discussion. Note that this is tradition – it will actually run anywhere.
This is a challenge because of the structuring of our template. We want to place the JavaScript call for DataTables in <head>
, which is in templates/report.html
. Right now, when report.html
is rendered under the main()
function in autoreporting.py
, it doesn’t know anything about the model names: the render()
call only has arguments for title
, the overall title of the report, and sections
, a list of pre-rendered strings of HTML ready to be inserted into the document. We just need to modify autoreporting.py
to pass in the model names and tweak report.html
accordingly. We tweak our files thusly:
autoreporting.py
def main():
# ...
# Production and write the report to file f.write(base_template.render(
title=title,
sections=sections,
model_results_list=[vgg19_results, mobilenet_results]
))
Enter fullscreen mode Exit fullscreen mode
report.html
<head>
<!-- Lots of calls above... -->
<script>
{% for model_results in model_results_list %}
$(document).ready(function() {
$('#{{ model_results.model_name }}').DataTable();
} );
{% endfor %}
</script>
</head>
Enter fullscreen mode Exit fullscreen mode
Bingo-bango: when we render report.html
, the calls to render the DataTable functionality for each existing table is included. Nice!
And now, if we run autoreporting.py
and inspect the output, we get something like this:
We can now order our tables, search for categories, filter by image names – we have some rich functionality available, with more available via extensions.
This is a huge advantage for a report! Imagine if you only wanted to check out the common themes of incorrect image categorisations, or rapidly narrow down on a specific image.
GitHub status
Your repo should look a little something like this.
Step Nine – Packaging it up
AKA Step the Last.
The good news: we have the functionality we want and need. We can take a .csv
file or two and punch out an interactive report.
The bad news: we’ve hardcoded it to two files, VGG19_results.csv
and MobileNet_results.csv
, which limits the functionality.
The final step for this exploration is therefore to turn this hard-coded script into a tool that can be called from the command-line. We want to be able to call the report and an arbitrary number of .csv
files and have the report spat-out. So if we called our script and specified the relevant .csv
files, we’d get a report successfully written to the /outputs
folder – looking a little like this on the command line:
$ python autoreporting.py VGG19_results.csv MobileNet_results.csv
Successfully wrote "report.html" to folder "outputs".
Enter fullscreen mode Exit fullscreen mode
This can be accomplished by utilising command line arguments – or, to put it rather simply, the commands that follow the call to Python. (In the example above, the first argument is autoreporting.py
, our script. The second and third command line arguments are VGG19_results.csv
and MobileNet_results.csv
, respectively.) We have a couple of main ways we can approach this:
- We can crunch the arguments manually, using
sys.argv
. There’s absolutely nothing wrong with this approach,sys.argv
is really quite simple to use. - We can use a Parser like
argparse
, primarily to assist in generating help and error messages.
Because I’ve not used argparse
before, I’m interested in giving it a go and testing it for these purposes.
Implementing argparse
Most everything we want to work with in argparse is handled within the main()
call within autoreporting.py
. To make this work, we’re going to:
- Define and parse the arguments we’re interested in (specifically, filepaths to the results
.csv
files), usingargparse
; - Convert these filepaths into
ModelResults
objects that we can use to generate our reports; - Adapt our existing code to output reports using these
ModelResults
objects.
So, first of all, we need to make sure that argparse
is imported. It’s been part of the standard library since Python 3.2.
autoreporting.py
import argparse
# ...
Enter fullscreen mode Exit fullscreen mode
Next, within main()
, we’ll define the parser – we’re saying how we want the command line arguments to be interpreted. This code is adapted pretty quick smart from the demo in the argparse
docs. We actually only have one argument, by how argparse
defines it – just the filepaths to our results .csv
files. The key thing to note is that we set the nargs
argument to "+"
, indicating that we can have a undefined number of arguments of this kind, but we do need at least one.
When we call parser.parse_args()
, all the arguments are neatly returned as a Namespace object that makes the inputs very easy to access, as we’ll see in the following steps.
# ...
def main():
""" Entry point for the script. Render a template and write it to file. :return: """
# Define and parse our arguments parser = argparse.ArgumentParser(description="Convert results .csv files into an interactive report.")
parser.add_argument(
"results_filepaths",
nargs="+",
help="Path(s) to results file(s) with filename(s) '<model_name>_results.csv'."
)
args = parser.parse_args()
Enter fullscreen mode Exit fullscreen mode
From args
, the Namespace
object, we can pull out the filepaths and use them to generate ModelResults
objects.
args.result_filepaths
holds a list of our filepaths, which we have indicated should point at filenames in the format <model_name>_results.csv
. We use the os.path
module to manipulate this filepath, extract the model name, and generate the ModelResults
object, adding it into a list called model_results
as we go.
This filename manipulation can look a little tricky, but inspect the doccies of os.path
and you’ll see it’s mostly clever string manipulation. os.path
is full of very, very useful functions that can save you a lot of time with common path manipulations, and help your code to work cross-platform!
# Create the model_results list, which holds the relevant information model_results = []
for results_filepath in args.results_filepaths:
results_root_name = os.path.splitext(os.path.basename(results_filepath))[0]
model_name = results_root_name.split("_results")[0]
model_results.append(
ModelResults(model_name, results_filepath))
Enter fullscreen mode Exit fullscreen mode
The logic for the set intersection – how we figure out which images are common across all results files – has be changed to account for the fact that we now have an arbitrary number of ModelResults
objects in a list.
To make this work, we quickly extract the misidentified_images
property of each object using a list comprehension, and then calculate the intersection of sets based on this resulting list. (Note that we have to use a leading asterix (*
) when we call set.intersection()
so that each member of the list gets passed in as an individual argument).
# Create some more content to be published as part of this analysis title = "Model Report"
misidentified_images = [set(results.misidentified_images) for results in model_results]
number_misidentified = len(set.intersection(*misidentified_images))
Enter fullscreen mode Exit fullscreen mode
Everything below this point is relatively consistent with our previous version, but now we’re taking advantage of the fact that we have our ModelResults
objects already packed up into the model_results
list.
# Produce our section blocks sections = list()
sections.append(summary_section_template.render(
model_results_list=model_results,
number_misidentified=number_misidentified
))
for model_result in model_results:
sections.append(table_section_template.render(
model=model_result.model_name,
dataset=model_result.dataset,
table=model_result.get_results_df_as_html())
)
# Produce and write the report to file with open("outputs/report.html", "w") as f:
f.write(base_template.render(
title=title,
sections=sections,
model_results_list=model_results
))
print('Successfully wrote "report.html" to folder "outputs".')
Enter fullscreen mode Exit fullscreen mode
Oooft! With all of the explanations, this looks a little complex. However, when you compare this code to the previous commit, you’ll see there’s not a great deal that’s actually significantly different here – we’ve really kept the core principles the same and just played with the packaging a bit.
GitHub status
Your project should look a little like this.
The End?
At the very start of the first post, I indicated that the goal of this project was to create an automatic HTML reporting tool, where the outcome was a single stand-alone HTML file, with info and interactivity.
Well, it’s done! We’ve got a tool that can accept an arbitrary number of standard results files, and spit out a report that crunches them into an interactive format.
Take a breather, push your chair away from your desk, and pat yourself on the back. We’ve done what we set out to do!
Does that mean we’re done? That depends, really.
What comes next?
At this point, we have a tool that works for a very narrow use-case, and assumes perfect inputs and perfect operation from the user. Now, if you’re using a tool like this just for yourself, and the inputs to the tool are quite consistent, then that could in fact be perfectly satisfactory – so no more work to be done, you’ve got something that’s fit for purpose.
But of course, there are any number of ways we can work to extend and harden this tool. As I was writing this tutorial, I made notes on some of these. Running losely from simpler to more complex, here are a few notes and ideas:
- Could we add a default command-line argument with
argparse
that allows us to specify the title of the report? - Our
.csv
inputs need to be named in a perfectly consistent format. How can we restructure our inputs so that we can define the model name and not have it read from the filename? - How can we make this a command-line script that can be run anywhere on our machine – not just in the folder the script is in? If our data is generated and stored elsewhere, it would certainly be more useful to be able to call
autoreport
on the terminal, rather than trace back to where the script is stored, for instance. - In its current form, we’re analysing images – can we add functionality to show the images we’re analysing? This would be great for generating hypotheses for why a model failed.
- Our reports need an internet connection each time they’re open. As part of the templating process, could we pull down the JavaScript and CSS files and embed them into our files?
This is a tiny sliver of the possible extensions – and this is not even to mention that there’s plenty of refactoring and tidying to be done across the project. The job of improving your work is never done!
A thankyou and a call to action
This has been the first tutorial of this scope I’ve ever written. I’ve had a lot of fun doing so, and to paraphrase Sigur Rós, this has been a good beginning.
But I’m really keen to hear what parts of this you, the tutorial-reader, enjoyed and what parts were challenging or obscure. Feel free to drop a comment or send me a message on what worked and what didn’t.
See you next time!
Automatic Reporting in Python (3 Part Series)
1 Automatic Reporting in Python – Part 1: From Planning to Hello World
2 Automatic Reporting in Python – Part 2: From Hello World to Real Insights
3 Automatic Reporting in Python – Part 3: Packaging It Up
原文链接:Automatic Reporting in Python – Part 3: Packaging It Up
暂无评论内容