Deep Learning Journal (5 Part Series)
1 Using Python, NodeJS, Angular, and MongoDB to Create a Machine Learning System
2 Distributing Machine Learning Jobs
3 Training a Toxic Comment Detector
4 Preparing a Small Server for a Neural Network Webservice
5 Creating a Neural Network Webservice
I’ve started designing a system to manage data analysis tools I build.
- An illegitimate REST interface
- Interface for existing Python scripts
- Process for creating micro-services from Python scripts
- Interface for creating machine learning jobs to be picked up my free machines.
- Manage a job queue for work machines to systematically tackle machine learning jobs
- Data storage and access
- Results access and job meta data
- A way to visualize results
I’ve landed on a fairly complicated process of handling the above. I’ve tried cutting frameworks, as I know it’ll be a nightmare to maintain, but I’m not seeing it.
- Node for creating RESTful interfaces between the
HQ Machine
and theWorker Nodes
- Node on the workers to ping the HQ machine periodically to see if their are jobs to run
- MongoDB on the
HQ Machine
to store the job results data, paths to datasets, and possibly primary data - Angular to interact with the
HQ Node
for creating job creation and results viewing UI. - ngx-datatables for viewing tabular results.
- ngx-charts for viewing job results (e.g., visualizing variance and linearity )
- Python for access to all the latest awesome ML frameworks
- python-shell (npm) for creating an interface between Node and Python.
Utilizing all Machines in the House
Machine learning is a new world for me. But, it’s pretty dern cool. I like making machines do the hard stuff while I’m off doing other work. It makes me feel extra productive–like, “I created that machine, so any work it does I get credit for. And! The work I did while it as doing its work.” This is the reason I own two 3D-printers.
I’m noticing there is a possibility of utilizing old computers I’ve lying around the house for the same effect. The plan is to abstract a neural network script, install it on all the computers lying about, and create a HQ Computer
where I can create a sets of hyperparameters passed to the Worker Nodes
throughout the house.
Why? Glad I asked for you. I feel guilty there are computers used. There’s an old AMD desktop with a GFX1060 in it, a 2013 MacBook Pro (my son’s), and my 2015 MacBook Pro. These don’t see much use anymore, since my employer has provided an iMac to work on. They need to earn their keep.
How? Again, glad to ask for you. I’ll create a system to make deep-learning jobs from hyperparameter sets and send them to these idle machines, thus, trying to get them to solve problems while I’m working on paying the bills. This comes from the power of neural networks. They need little manual tweaking. You simply provide them with hyperparameters and let them run.
Here are the napkin-doodles:
+-Local------------------------------------------------------+| || ____ ____ Each machine runs || |""| |""| Node and Express || HQ |__| #1 |__| server, creating || [ ==.]`) [ ==.]`) routes to Python || ====== 0 ====== 0 scripts using || The HQ machine runs ____ stdin and stdout || Node and Express, but |""| || the routes are for #2 |__| || storing results in a [ ==.]`) || database. ====== 0 || ____ || |""| || #3 |__| Worker || [ ==.]`) Nodes || ====== 0 || |+------------------------------------------------------------++-Local------------------------------------------------------+ | | | ____ ____ Each machine runs | | |""| |""| Node and Express | | HQ |__| #1 |__| server, creating | | [ ==.]`) [ ==.]`) routes to Python | | ====== 0 ====== 0 scripts using | | The HQ machine runs ____ stdin and stdout | | Node and Express, but |""| | | the routes are for #2 |__| | | storing results in a [ ==.]`) | | database. ====== 0 | | ____ | | |""| | | #3 |__| Worker | | [ ==.]`) Nodes | | ====== 0 | | | +------------------------------------------------------------++-Local------------------------------------------------------+ | | | ____ ____ Each machine runs | | |""| |""| Node and Express | | HQ |__| #1 |__| server, creating | | [ ==.]`) [ ==.]`) routes to Python | | ====== 0 ====== 0 scripts using | | The HQ machine runs ____ stdin and stdout | | Node and Express, but |""| | | the routes are for #2 |__| | | storing results in a [ ==.]`) | | database. ====== 0 | | ____ | | |""| | | #3 |__| Worker | | [ ==.]`) Nodes | | ====== 0 | | | +------------------------------------------------------------+
Enter fullscreen mode Exit fullscreen mode
+-Local------------------------------------------------------+| Each worker Node checks Workers || ____ with HQ on a set interval ____ || |""| for jobs to run |""| || HQ |__| <--------------------------+ #1 |__| || [ ==.]`) [ ==.]`) || ====== 0 ====== 0 || ^ | ____ || | | #2 |""| || | +--------------------------------------->|__| || | If there is a job, the [ ==.]`) || | Worker will send a GET ====== 0 || | request for the job ____ || | parameters |""| || | #3 |__| || +-----------------------------------------[ ==.]`) || Once completed, the Worker updates HQ ====== 0 || with the job results. |+------------------------------------------------------------++-Local------------------------------------------------------+ | Each worker Node checks Workers | | ____ with HQ on a set interval ____ | | |""| for jobs to run |""| | | HQ |__| <--------------------------+ #1 |__| | | [ ==.]`) [ ==.]`) | | ====== 0 ====== 0 | | ^ | ____ | | | | #2 |""| | | | +--------------------------------------->|__| | | | If there is a job, the [ ==.]`) | | | Worker will send a GET ====== 0 | | | request for the job ____ | | | parameters |""| | | | #3 |__| | | +-----------------------------------------[ ==.]`) | | Once completed, the Worker updates HQ ====== 0 | | with the job results. | +------------------------------------------------------------++-Local------------------------------------------------------+ | Each worker Node checks Workers | | ____ with HQ on a set interval ____ | | |""| for jobs to run |""| | | HQ |__| <--------------------------+ #1 |__| | | [ ==.]`) [ ==.]`) | | ====== 0 ====== 0 | | ^ | ____ | | | | #2 |""| | | | +--------------------------------------->|__| | | | If there is a job, the [ ==.]`) | | | Worker will send a GET ====== 0 | | | request for the job ____ | | | parameters |""| | | | #3 |__| | | +-----------------------------------------[ ==.]`) | | Once completed, the Worker updates HQ ====== 0 | | with the job results. | +------------------------------------------------------------+
Enter fullscreen mode Exit fullscreen mode
Worker Nodes
The Worker Nodes
code is pretty straightforward. It uses Node, Express, and python-shell to create a bastardized REST interface to create simple interactions between the HQ Node
controlling the job queue.
Node Side
Here’s the proof-of-concept NodeJS code.
<span>var</span> <span>express</span> <span>=</span> <span>require</span><span>(</span><span>'</span><span>express</span><span>'</span><span>);</span><span>var</span> <span>bodyParser</span> <span>=</span> <span>require</span><span>(</span><span>'</span><span>body-parser</span><span>'</span><span>);</span><span>var</span> <span>pythonRunner</span> <span>=</span> <span>require</span><span>(</span><span>'</span><span>./preprocessing-services/python-runner</span><span>'</span><span>);</span><span>var</span> <span>app</span> <span>=</span> <span>express</span><span>();</span><span>const</span> <span>port</span> <span>=</span> <span>3000</span><span>;</span><span>app</span><span>.</span><span>use</span><span>(</span><span>bodyParser</span><span>.</span><span>json</span><span>())</span><span>// Python script runner interface</span><span>app</span><span>.</span><span>post</span><span>(</span><span>'</span><span>/scripts/run</span><span>'</span><span>,</span> <span>(</span><span>req</span><span>,</span> <span>res</span><span>)</span> <span>=></span> <span>{</span><span>try</span> <span>{</span><span>let</span> <span>pythonJob</span> <span>=</span> <span>req</span><span>.</span><span>body</span><span>;</span><span>pythonRunner</span><span>.</span><span>scriptRun</span><span>(</span><span>pythonJob</span><span>)</span><span>.</span><span>then</span><span>((</span><span>response</span><span>,</span> <span>rejection</span><span>)</span> <span>=></span> <span>{</span><span>res</span><span>.</span><span>send</span><span>(</span><span>response</span><span>);</span><span>});</span><span>}</span> <span>catch </span><span>(</span><span>err</span><span>)</span> <span>{</span><span>res</span><span>.</span><span>send</span><span>(</span><span>err</span><span>);</span><span>}</span><span>});</span><span>app</span><span>.</span><span>listen</span><span>(</span><span>port</span><span>,</span> <span>()</span> <span>=></span> <span>{</span><span>console</span><span>.</span><span>log</span><span>(</span><span>`Started on port </span><span>${</span><span>port</span><span>}</span><span>`</span><span>);</span><span>});</span><span>var</span> <span>express</span> <span>=</span> <span>require</span><span>(</span><span>'</span><span>express</span><span>'</span><span>);</span> <span>var</span> <span>bodyParser</span> <span>=</span> <span>require</span><span>(</span><span>'</span><span>body-parser</span><span>'</span><span>);</span> <span>var</span> <span>pythonRunner</span> <span>=</span> <span>require</span><span>(</span><span>'</span><span>./preprocessing-services/python-runner</span><span>'</span><span>);</span> <span>var</span> <span>app</span> <span>=</span> <span>express</span><span>();</span> <span>const</span> <span>port</span> <span>=</span> <span>3000</span><span>;</span> <span>app</span><span>.</span><span>use</span><span>(</span><span>bodyParser</span><span>.</span><span>json</span><span>())</span> <span>// Python script runner interface</span> <span>app</span><span>.</span><span>post</span><span>(</span><span>'</span><span>/scripts/run</span><span>'</span><span>,</span> <span>(</span><span>req</span><span>,</span> <span>res</span><span>)</span> <span>=></span> <span>{</span> <span>try</span> <span>{</span> <span>let</span> <span>pythonJob</span> <span>=</span> <span>req</span><span>.</span><span>body</span><span>;</span> <span>pythonRunner</span><span>.</span><span>scriptRun</span><span>(</span><span>pythonJob</span><span>)</span> <span>.</span><span>then</span><span>((</span><span>response</span><span>,</span> <span>rejection</span><span>)</span> <span>=></span> <span>{</span> <span>res</span><span>.</span><span>send</span><span>(</span><span>response</span><span>);</span> <span>});</span> <span>}</span> <span>catch </span><span>(</span><span>err</span><span>)</span> <span>{</span> <span>res</span><span>.</span><span>send</span><span>(</span><span>err</span><span>);</span> <span>}</span> <span>});</span> <span>app</span><span>.</span><span>listen</span><span>(</span><span>port</span><span>,</span> <span>()</span> <span>=></span> <span>{</span> <span>console</span><span>.</span><span>log</span><span>(</span><span>`Started on port </span><span>${</span><span>port</span><span>}</span><span>`</span><span>);</span> <span>});</span>var express = require('express'); var bodyParser = require('body-parser'); var pythonRunner = require('./preprocessing-services/python-runner'); var app = express(); const port = 3000; app.use(bodyParser.json()) // Python script runner interface app.post('/scripts/run', (req, res) => { try { let pythonJob = req.body; pythonRunner.scriptRun(pythonJob) .then((response, rejection) => { res.send(response); }); } catch (err) { res.send(err); } }); app.listen(port, () => { console.log(`Started on port ${port}`); });
Enter fullscreen mode Exit fullscreen mode
The above code is a dead simple NodeJS server using Express. It is using body-parser
middleware to shape JSON objects. The pythonJob
object looks something like this (real paths names have been changed to help protect their anonymity).
<span>{</span><span> </span><span>"scriptsPath"</span><span>:</span><span> </span><span>"/Users/hinky-dink/dl-principal/python-scripts/"</span><span>,</span><span> </span><span>"scriptName"</span><span>:</span><span> </span><span>"union.py"</span><span>,</span><span> </span><span>"jobParameters"</span><span>:</span><span> </span><span>{</span><span> </span><span>"dataFileName"</span><span>:</span><span> </span><span>""</span><span>,</span><span> </span><span>"dataPath"</span><span>:</span><span> </span><span>"/Users/hinky-dink/bit-dl/data/lot-data/wine_encoded/"</span><span>,</span><span> </span><span>"writePath"</span><span>:</span><span> </span><span>"/Users/hinky-dink/bit-dl/data/lot-data/wine_encoded/"</span><span>,</span><span> </span><span>"execution"</span><span>:</span><span> </span><span>{</span><span> </span><span>"dataFileOne"</span><span>:</span><span> </span><span>"wine_2017_encoded.csv"</span><span>,</span><span> </span><span>"dataFileTwo"</span><span>:</span><span> </span><span>"wine_2018_encoded.csv"</span><span>,</span><span> </span><span>"outputFilename"</span><span>:</span><span> </span><span>"wine_17-18.csv"</span><span> </span><span>}</span><span> </span><span>}</span><span> </span><span>}</span><span> </span><span>{</span><span> </span><span>"scriptsPath"</span><span>:</span><span> </span><span>"/Users/hinky-dink/dl-principal/python-scripts/"</span><span>,</span><span> </span><span>"scriptName"</span><span>:</span><span> </span><span>"union.py"</span><span>,</span><span> </span><span>"jobParameters"</span><span>:</span><span> </span><span>{</span><span> </span><span>"dataFileName"</span><span>:</span><span> </span><span>""</span><span>,</span><span> </span><span>"dataPath"</span><span>:</span><span> </span><span>"/Users/hinky-dink/bit-dl/data/lot-data/wine_encoded/"</span><span>,</span><span> </span><span>"writePath"</span><span>:</span><span> </span><span>"/Users/hinky-dink/bit-dl/data/lot-data/wine_encoded/"</span><span>,</span><span> </span><span>"execution"</span><span>:</span><span> </span><span>{</span><span> </span><span>"dataFileOne"</span><span>:</span><span> </span><span>"wine_2017_encoded.csv"</span><span>,</span><span> </span><span>"dataFileTwo"</span><span>:</span><span> </span><span>"wine_2018_encoded.csv"</span><span>,</span><span> </span><span>"outputFilename"</span><span>:</span><span> </span><span>"wine_17-18.csv"</span><span> </span><span>}</span><span> </span><span>}</span><span> </span><span>}</span><span> </span>{ "scriptsPath": "/Users/hinky-dink/dl-principal/python-scripts/", "scriptName": "union.py", "jobParameters": { "dataFileName": "", "dataPath": "/Users/hinky-dink/bit-dl/data/lot-data/wine_encoded/", "writePath": "/Users/hinky-dink/bit-dl/data/lot-data/wine_encoded/", "execution": { "dataFileOne": "wine_2017_encoded.csv", "dataFileTwo": "wine_2018_encoded.csv", "outputFilename": "wine_17-18.csv" } } }
Enter fullscreen mode Exit fullscreen mode
Each of these attributes will be passed to the Python shell in order to execute data_prep.py
. They are passed to the shell as system arguments.
Here’s the python-runner.js
<span>let</span> <span>{</span><span>PythonShell</span><span>}</span> <span>=</span> <span>require</span><span>(</span><span>'</span><span>python-shell</span><span>'</span><span>)</span><span>var</span> <span>scriptRun</span> <span>=</span> <span>function</span><span>(</span><span>pythonJob</span><span>){</span><span>return</span> <span>new</span> <span>Promise</span><span>((</span><span>resolve</span><span>,</span> <span>reject</span><span>)</span> <span>=></span> <span>{</span><span>console</span><span>.</span><span>log</span><span>(</span><span>pythonJob</span><span>)</span><span>try</span> <span>{</span><span>let</span> <span>options</span> <span>=</span> <span>{</span><span>mode</span><span>:</span> <span>'</span><span>text</span><span>'</span><span>,</span><span>pythonOptions</span><span>:</span> <span>[</span><span>'</span><span>-u</span><span>'</span><span>],</span> <span>// get print results in real-time</span><span>scriptPath</span><span>:</span> <span>pythonJob</span><span>.</span><span>scriptsPath</span><span>,</span><span>args</span><span>:</span> <span>[</span><span>pythonJob</span><span>.</span><span>jobParameters</span><span>.</span><span>dataFileName</span><span>,</span><span>pythonJob</span><span>.</span><span>jobParameters</span><span>.</span><span>dataPath</span><span>,</span><span>pythonJob</span><span>.</span><span>jobParameters</span><span>.</span><span>writePath</span><span>,</span><span>JSON</span><span>.</span><span>stringify</span><span>(</span><span>pythonJob</span><span>.</span><span>jobParameters</span><span>.</span><span>execution</span><span>)]</span><span>};</span><span>PythonShell</span><span>.</span><span>run</span><span>(</span><span>pythonJob</span><span>.</span><span>scriptName</span><span>,</span> <span>options</span><span>,</span> <span>function </span><span>(</span><span>err</span><span>,</span> <span>results</span><span>)</span> <span>{</span><span>if </span><span>(</span><span>err</span><span>)</span> <span>throw</span> <span>err</span><span>;</span><span>try</span> <span>{</span><span>result</span> <span>=</span> <span>JSON</span><span>.</span><span>parse</span><span>(</span><span>results</span><span>.</span><span>pop</span><span>());</span><span>if</span><span>(</span><span>result</span><span>)</span> <span>{</span><span>resolve</span><span>(</span><span>result</span><span>);</span><span>}</span> <span>else</span> <span>{</span><span>reject</span><span>({</span><span>'</span><span>err</span><span>'</span><span>:</span> <span>''</span><span>})</span><span>}</span><span>}</span> <span>catch </span><span>(</span><span>err</span><span>)</span> <span>{</span><span>reject</span><span>({</span><span>'</span><span>error</span><span>'</span><span>:</span> <span>'</span><span>Failed to parse Python script return object.</span><span>'</span><span>})</span><span>}</span><span>});</span><span>}</span> <span>catch </span><span>(</span><span>err</span><span>)</span> <span>{</span><span>reject</span><span>(</span><span>err</span><span>)</span><span>}</span><span>});</span><span>}</span><span>module</span><span>.</span><span>exports</span> <span>=</span> <span>{</span><span>scriptRun</span><span>}</span><span>let</span> <span>{</span><span>PythonShell</span><span>}</span> <span>=</span> <span>require</span><span>(</span><span>'</span><span>python-shell</span><span>'</span><span>)</span> <span>var</span> <span>scriptRun</span> <span>=</span> <span>function</span><span>(</span><span>pythonJob</span><span>){</span> <span>return</span> <span>new</span> <span>Promise</span><span>((</span><span>resolve</span><span>,</span> <span>reject</span><span>)</span> <span>=></span> <span>{</span> <span>console</span><span>.</span><span>log</span><span>(</span><span>pythonJob</span><span>)</span> <span>try</span> <span>{</span> <span>let</span> <span>options</span> <span>=</span> <span>{</span> <span>mode</span><span>:</span> <span>'</span><span>text</span><span>'</span><span>,</span> <span>pythonOptions</span><span>:</span> <span>[</span><span>'</span><span>-u</span><span>'</span><span>],</span> <span>// get print results in real-time</span> <span>scriptPath</span><span>:</span> <span>pythonJob</span><span>.</span><span>scriptsPath</span><span>,</span> <span>args</span><span>:</span> <span>[</span><span>pythonJob</span><span>.</span><span>jobParameters</span><span>.</span><span>dataFileName</span><span>,</span> <span>pythonJob</span><span>.</span><span>jobParameters</span><span>.</span><span>dataPath</span><span>,</span> <span>pythonJob</span><span>.</span><span>jobParameters</span><span>.</span><span>writePath</span><span>,</span> <span>JSON</span><span>.</span><span>stringify</span><span>(</span><span>pythonJob</span><span>.</span><span>jobParameters</span><span>.</span><span>execution</span><span>)]</span> <span>};</span> <span>PythonShell</span><span>.</span><span>run</span><span>(</span><span>pythonJob</span><span>.</span><span>scriptName</span><span>,</span> <span>options</span><span>,</span> <span>function </span><span>(</span><span>err</span><span>,</span> <span>results</span><span>)</span> <span>{</span> <span>if </span><span>(</span><span>err</span><span>)</span> <span>throw</span> <span>err</span><span>;</span> <span>try</span> <span>{</span> <span>result</span> <span>=</span> <span>JSON</span><span>.</span><span>parse</span><span>(</span><span>results</span><span>.</span><span>pop</span><span>());</span> <span>if</span><span>(</span><span>result</span><span>)</span> <span>{</span> <span>resolve</span><span>(</span><span>result</span><span>);</span> <span>}</span> <span>else</span> <span>{</span> <span>reject</span><span>({</span><span>'</span><span>err</span><span>'</span><span>:</span> <span>''</span><span>})</span> <span>}</span> <span>}</span> <span>catch </span><span>(</span><span>err</span><span>)</span> <span>{</span> <span>reject</span><span>({</span><span>'</span><span>error</span><span>'</span><span>:</span> <span>'</span><span>Failed to parse Python script return object.</span><span>'</span><span>})</span> <span>}</span> <span>});</span> <span>}</span> <span>catch </span><span>(</span><span>err</span><span>)</span> <span>{</span> <span>reject</span><span>(</span><span>err</span><span>)</span> <span>}</span> <span>});</span> <span>}</span> <span>module</span><span>.</span><span>exports</span> <span>=</span> <span>{</span><span>scriptRun</span><span>}</span>let {PythonShell} = require('python-shell') var scriptRun = function(pythonJob){ return new Promise((resolve, reject) => { console.log(pythonJob) try { let options = { mode: 'text', pythonOptions: ['-u'], // get print results in real-time scriptPath: pythonJob.scriptsPath, args: [pythonJob.jobParameters.dataFileName, pythonJob.jobParameters.dataPath, pythonJob.jobParameters.writePath, JSON.stringify(pythonJob.jobParameters.execution)] }; PythonShell.run(pythonJob.scriptName, options, function (err, results) { if (err) throw err; try { result = JSON.parse(results.pop()); if(result) { resolve(result); } else { reject({'err': ''}) } } catch (err) { reject({'error': 'Failed to parse Python script return object.'}) } }); } catch (err) { reject(err) } }); } module.exports = {scriptRun}
Enter fullscreen mode Exit fullscreen mode
Python Side
Here’s the Python script in the above example. It is meant to detect what type of data is in a table. If it’s is continuous it leaves it alone (I’ll probably add normalization option as some point), if it is categorical, it converts it to a dummy variable. It then saves this encoded data on the Worker Node
side (right now). Lastly, it returns a JSON
string back to the node
side.
<span>"""</span><span> Created on Mon Jun 11 21:12:10 2018 @author: cthomasbrittain </span><span>"""</span><span>import</span> <span>sys</span><span>import</span> <span>json</span><span># </span><span>filename</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>1</span><span>]</span><span>filepath</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>2</span><span>]</span><span>pathToWriteProcessedFile</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>3</span><span>]</span><span>request</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>4</span><span>]</span><span>request</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>request</span><span>)</span><span>try</span><span>:</span><span>cols_to_remove</span> <span>=</span> <span>request</span><span>[</span><span>'</span><span>columnsToRemove</span><span>'</span><span>]</span><span>unreasonable_increase</span> <span>=</span> <span>request</span><span>[</span><span>'</span><span>unreasonableIncreaseThreshold</span><span>'</span><span>]</span><span>except</span><span>:</span><span># If columns aren't contained or no columns, exit nicely </span> <span>result</span> <span>=</span> <span>{</span><span>'</span><span>status</span><span>'</span><span>:</span> <span>400</span><span>,</span> <span>'</span><span>message</span><span>'</span><span>:</span> <span>'</span><span>Expected script parameters not found.</span><span>'</span><span>}</span><span>print</span><span>(</span><span>str</span><span>(</span><span>json</span><span>.</span><span>dumps</span><span>(</span><span>result</span><span>)))</span><span>quit</span><span>()</span><span>pathToData</span> <span>=</span> <span>filepath</span> <span>+</span> <span>filename</span><span># Clean Data -------------------------------------------------------------------- # ------------------------------------------------------------------------------- </span><span># Importing data transformation libraries </span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span># The following method will do the following:a # 1. Add a prefix to columns based upon datatypes (cat and con) # 2. Convert all continuous variables to numeric (float64) # 3. Convert all categorical variables to objects # 4. Rename all columns with prefixes, convert to lower-case, and replace # spaces with underscores. # 5. Continuous blanks are replaced with 0 and categorical 'not collected' # This method will also detect manually assigned prefixes and adjust the # columns and data appropriately. # Prefix key: # a) con = continuous # b) cat = categorical # c) rem = removal (discards entire column) </span><span>def</span> <span>add_datatype_prefix</span><span>(</span><span>df</span><span>,</span> <span>date_to_cont</span> <span>=</span> <span>True</span><span>):</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span># Get a list of current column names. </span> <span>column_names</span> <span>=</span> <span>list</span><span>(</span><span>df</span><span>.</span><span>columns</span><span>.</span><span>values</span><span>)</span><span># Encode each column based with a three letter prefix based upon assigned datatype. </span> <span># 1. con = continuous </span> <span># 2. cat = categorical </span><span>for</span> <span>name</span> <span>in</span> <span>column_names</span><span>:</span><span>if</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>object</span><span>'</span><span>:</span><span>try</span><span>:</span><span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>to_datetime</span><span>(</span><span>df</span><span>[</span><span>name</span><span>])</span><span>if</span><span>(</span><span>date_to_cont</span><span>):</span><span>new_col_names</span> <span>=</span> <span>"</span><span>con_</span><span>"</span> <span>+</span> <span>name</span><span>.</span><span>lower</span><span>().</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>)</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>{</span><span>name</span><span>:</span> <span>new_col_names</span><span>})</span><span>else</span><span>:</span><span>new_col_names</span> <span>=</span> <span>"</span><span>date_</span><span>"</span> <span>+</span> <span>name</span><span>.</span><span>lower</span><span>().</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>)</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>{</span><span>name</span><span>:</span> <span>new_col_names</span><span>})</span><span>except</span> <span>ValueError</span><span>:</span><span>pass</span><span>column_names</span> <span>=</span> <span>list</span><span>(</span><span>df</span><span>.</span><span>columns</span><span>.</span><span>values</span><span>)</span><span>for</span> <span>name</span> <span>in</span> <span>column_names</span><span>:</span><span>if</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>rem</span><span>"</span> <span>or</span> <span>"</span><span>con</span><span>"</span> <span>or</span> <span>"</span><span>cat</span><span>"</span> <span>or</span> <span>"</span><span>date</span><span>"</span><span>:</span><span>pass</span><span>if</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>object</span><span>'</span><span>:</span><span>new_col_names</span> <span>=</span> <span>"</span><span>cat_</span><span>"</span> <span>+</span> <span>name</span><span>.</span><span>lower</span><span>().</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>)</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>{</span><span>name</span><span>:</span> <span>new_col_names</span><span>})</span><span>elif</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>float64</span><span>'</span> <span>or</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>int64</span><span>'</span> <span>or</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>datetime64[ns]</span><span>'</span><span>:</span><span>new_col_names</span> <span>=</span> <span>"</span><span>con_</span><span>"</span> <span>+</span> <span>name</span><span>.</span><span>lower</span><span>().</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>)</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>{</span><span>name</span><span>:</span> <span>new_col_names</span><span>})</span><span>column_names</span> <span>=</span> <span>list</span><span>(</span><span>df</span><span>.</span><span>columns</span><span>.</span><span>values</span><span>)</span><span># Get lists of coolumns for conversion </span> <span>con_column_names</span> <span>=</span> <span>[]</span><span>cat_column_names</span> <span>=</span> <span>[]</span><span>rem_column_names</span> <span>=</span> <span>[]</span><span>date_column_names</span> <span>=</span> <span>[]</span><span>for</span> <span>name</span> <span>in</span> <span>column_names</span><span>:</span><span>if</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>cat</span><span>"</span><span>:</span><span>cat_column_names</span><span>.</span><span>append</span><span>(</span><span>name</span><span>)</span><span>elif</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>con</span><span>"</span><span>:</span><span>con_column_names</span><span>.</span><span>append</span><span>(</span><span>name</span><span>)</span><span>elif</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>rem</span><span>"</span><span>:</span><span>rem_column_names</span><span>.</span><span>append</span><span>(</span><span>name</span><span>)</span><span>elif</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>4</span><span>]</span> <span>==</span> <span>"</span><span>date</span><span>"</span><span>:</span><span>date_column_names</span><span>.</span><span>append</span><span>(</span><span>name</span><span>)</span><span># Make sure continuous variables are correct datatype. (Otherwise, they'll be dummied). </span> <span>for</span> <span>name</span> <span>in</span> <span>con_column_names</span><span>:</span><span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>to_numeric</span><span>(</span><span>df</span><span>[</span><span>name</span><span>],</span> <span>errors</span><span>=</span><span>'</span><span>coerce</span><span>'</span><span>)</span><span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>fillna</span><span>(</span><span>value</span><span>=</span><span>0</span><span>)</span><span>for</span> <span>name</span> <span>in</span> <span>cat_column_names</span><span>:</span><span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>apply</span><span>(</span><span>str</span><span>)</span><span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>fillna</span><span>(</span><span>value</span><span>=</span><span>'</span><span>not_collected</span><span>'</span><span>)</span><span># Remove unwanted columns </span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>columns</span><span>=</span><span>rem_column_names</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span><span>return</span> <span>df</span><span># ------------------------------------------------------ # Encoding Categorical variables # ------------------------------------------------------ </span><span># The method below creates dummy variables from columns with # the prefix "cat". There is the argument to drop the first column # to avoid the Dummy Variable Trap. </span><span>def</span> <span>dummy_categorical</span><span>(</span><span>df</span><span>,</span> <span>drop_first</span> <span>=</span> <span>True</span><span>):</span><span># Get categorical data columns. </span> <span>columns</span> <span>=</span> <span>list</span><span>(</span><span>df</span><span>.</span><span>columns</span><span>.</span><span>values</span><span>)</span><span>columnsToEncode</span> <span>=</span> <span>columns</span><span>.</span><span>copy</span><span>()</span><span>for</span> <span>name</span> <span>in</span> <span>columns</span><span>:</span><span>if</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>!=</span> <span>'</span><span>cat</span><span>'</span><span>:</span><span>columnsToEncode</span><span>.</span><span>remove</span><span>(</span><span>name</span><span>)</span><span># if there are no columns to encode, return unmutated. </span> <span>if</span> <span>not</span> <span>columnsToEncode</span><span>:</span><span>return</span> <span>df</span><span># Encode categories </span> <span>for</span> <span>name</span> <span>in</span> <span>columnsToEncode</span><span>:</span><span>if</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>!=</span> <span>'</span><span>cat</span><span>'</span><span>:</span><span>continue</span><span>tmp</span> <span>=</span> <span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>name</span><span>],</span> <span>drop_first</span> <span>=</span> <span>drop_first</span><span>)</span><span>names</span> <span>=</span> <span>{}</span><span># Get a clean column name. </span> <span>clean_name</span> <span>=</span> <span>name</span><span>.</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>lower</span><span>()</span><span># Get a dictionary for renaming the dummay variables in the scheme of old_col_name + response_string </span> <span>if</span> <span>clean_name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>cat</span><span>"</span><span>:</span><span>for</span> <span>tmp_name</span> <span>in</span> <span>tmp</span><span>:</span><span>tmp_name</span> <span>=</span> <span>str</span><span>(</span><span>tmp_name</span><span>)</span><span>new_tmp_name</span> <span>=</span> <span>tmp_name</span><span>.</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>lower</span><span>()</span><span>new_tmp_name</span> <span>=</span> <span>clean_name</span> <span>+</span> <span>"</span><span>_</span><span>"</span> <span>+</span> <span>new_tmp_name</span><span>names</span><span>[</span><span>tmp_name</span><span>]</span> <span>=</span> <span>new_tmp_name</span><span># Rename the dummy variable dataframe </span> <span>tmp</span> <span>=</span> <span>tmp</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>names</span><span>)</span><span># join the dummy variable back to original dataframe. </span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>tmp</span><span>)</span><span># Drop all old categorical columns </span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>columns</span><span>=</span><span>columnsToEncode</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span><span>return</span> <span>df</span><span># Read the file </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>pathToData</span><span>)</span><span># Drop columns such as unique IDs </span><span>try</span><span>:</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>cols_to_remove</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span><span>except</span><span>:</span><span># If columns aren't contained or no columns, exit nicely </span> <span>result</span> <span>=</span> <span>{</span><span>'</span><span>status</span><span>'</span><span>:</span> <span>404</span><span>,</span> <span>'</span><span>message</span><span>'</span><span>:</span> <span>'</span><span>Problem with columns to remove.</span><span>'</span><span>}</span><span>print</span><span>(</span><span>str</span><span>(</span><span>json</span><span>.</span><span>dumps</span><span>(</span><span>result</span><span>)))</span><span>quit</span><span>()</span><span># Get the number of columns before hot encoding </span><span>num_cols_before</span> <span>=</span> <span>df</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span><span># Encode the data. </span><span>df</span> <span>=</span> <span>add_datatype_prefix</span><span>(</span><span>df</span><span>)</span><span>df</span> <span>=</span> <span>dummy_categorical</span><span>(</span><span>df</span><span>)</span><span># Get the new dataframe shape. </span><span>num_cols_after</span> <span>=</span> <span>df</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span><span>percentage_increase</span> <span>=</span> <span>num_cols_after</span> <span>/</span> <span>num_cols_before</span><span>result</span> <span>=</span> <span>""</span><span>if</span> <span>percentage_increase</span> <span>></span> <span>unreasonable_increase</span><span>:</span><span>message</span> <span>=</span> <span>"</span><span>\"</span><span>error</span><span>\"</span><span>: </span><span>\"</span><span>Feature increase is greater than unreasonableIncreaseThreshold, most likely a unique id was included.</span><span>"</span><span>result</span> <span>=</span> <span>{</span><span>'</span><span>status</span><span>'</span><span>:</span> <span>400</span><span>,</span> <span>'</span><span>message</span><span>'</span><span>:</span> <span>message</span><span>}</span><span>else</span><span>:</span><span>filename</span> <span>=</span> <span>filename</span><span>.</span><span>replace</span><span>(</span><span>"</span><span>.csv</span><span>"</span><span>,</span> <span>""</span><span>)</span><span>import</span> <span>os</span><span>if</span> <span>not</span> <span>os</span><span>.</span><span>path</span><span>.</span><span>exists</span><span>(</span><span>pathToWriteProcessedFile</span><span>):</span><span>os</span><span>.</span><span>makedirs</span><span>(</span><span>pathToWriteProcessedFile</span><span>)</span><span>writeFile</span> <span>=</span> <span>pathToWriteProcessedFile</span> <span>+</span> <span>filename</span> <span>+</span> <span>"</span><span>_encoded.csv</span><span>"</span><span>df</span><span>.</span><span>to_csv</span><span>(</span><span>path_or_buf</span><span>=</span><span>writeFile</span><span>,</span> <span>sep</span><span>=</span><span>'</span><span>,</span><span>'</span><span>)</span><span># Process the results and return JSON results object </span> <span>result</span> <span>=</span> <span>{</span><span>'</span><span>status</span><span>'</span><span>:</span> <span>200</span><span>,</span> <span>'</span><span>message</span><span>'</span><span>:</span> <span>'</span><span>encoded data</span><span>'</span><span>,</span> <span>'</span><span>path</span><span>'</span><span>:</span> <span>writeFile</span><span>}</span><span>print</span><span>(</span><span>str</span><span>(</span><span>json</span><span>.</span><span>dumps</span><span>(</span><span>result</span><span>)))</span><span>"""</span><span> Created on Mon Jun 11 21:12:10 2018 @author: cthomasbrittain </span><span>"""</span> <span>import</span> <span>sys</span> <span>import</span> <span>json</span> <span># </span><span>filename</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>1</span><span>]</span> <span>filepath</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>2</span><span>]</span> <span>pathToWriteProcessedFile</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>3</span><span>]</span> <span>request</span> <span>=</span> <span>sys</span><span>.</span><span>argv</span><span>[</span><span>4</span><span>]</span> <span>request</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>request</span><span>)</span> <span>try</span><span>:</span> <span>cols_to_remove</span> <span>=</span> <span>request</span><span>[</span><span>'</span><span>columnsToRemove</span><span>'</span><span>]</span> <span>unreasonable_increase</span> <span>=</span> <span>request</span><span>[</span><span>'</span><span>unreasonableIncreaseThreshold</span><span>'</span><span>]</span> <span>except</span><span>:</span> <span># If columns aren't contained or no columns, exit nicely </span> <span>result</span> <span>=</span> <span>{</span><span>'</span><span>status</span><span>'</span><span>:</span> <span>400</span><span>,</span> <span>'</span><span>message</span><span>'</span><span>:</span> <span>'</span><span>Expected script parameters not found.</span><span>'</span><span>}</span> <span>print</span><span>(</span><span>str</span><span>(</span><span>json</span><span>.</span><span>dumps</span><span>(</span><span>result</span><span>)))</span> <span>quit</span><span>()</span> <span>pathToData</span> <span>=</span> <span>filepath</span> <span>+</span> <span>filename</span> <span># Clean Data -------------------------------------------------------------------- # ------------------------------------------------------------------------------- </span> <span># Importing data transformation libraries </span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span># The following method will do the following:a # 1. Add a prefix to columns based upon datatypes (cat and con) # 2. Convert all continuous variables to numeric (float64) # 3. Convert all categorical variables to objects # 4. Rename all columns with prefixes, convert to lower-case, and replace # spaces with underscores. # 5. Continuous blanks are replaced with 0 and categorical 'not collected' # This method will also detect manually assigned prefixes and adjust the # columns and data appropriately. # Prefix key: # a) con = continuous # b) cat = categorical # c) rem = removal (discards entire column) </span> <span>def</span> <span>add_datatype_prefix</span><span>(</span><span>df</span><span>,</span> <span>date_to_cont</span> <span>=</span> <span>True</span><span>):</span> <span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span># Get a list of current column names. </span> <span>column_names</span> <span>=</span> <span>list</span><span>(</span><span>df</span><span>.</span><span>columns</span><span>.</span><span>values</span><span>)</span> <span># Encode each column based with a three letter prefix based upon assigned datatype. </span> <span># 1. con = continuous </span> <span># 2. cat = categorical </span> <span>for</span> <span>name</span> <span>in</span> <span>column_names</span><span>:</span> <span>if</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>object</span><span>'</span><span>:</span> <span>try</span><span>:</span> <span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>to_datetime</span><span>(</span><span>df</span><span>[</span><span>name</span><span>])</span> <span>if</span><span>(</span><span>date_to_cont</span><span>):</span> <span>new_col_names</span> <span>=</span> <span>"</span><span>con_</span><span>"</span> <span>+</span> <span>name</span><span>.</span><span>lower</span><span>().</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>)</span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>{</span><span>name</span><span>:</span> <span>new_col_names</span><span>})</span> <span>else</span><span>:</span> <span>new_col_names</span> <span>=</span> <span>"</span><span>date_</span><span>"</span> <span>+</span> <span>name</span><span>.</span><span>lower</span><span>().</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>)</span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>{</span><span>name</span><span>:</span> <span>new_col_names</span><span>})</span> <span>except</span> <span>ValueError</span><span>:</span> <span>pass</span> <span>column_names</span> <span>=</span> <span>list</span><span>(</span><span>df</span><span>.</span><span>columns</span><span>.</span><span>values</span><span>)</span> <span>for</span> <span>name</span> <span>in</span> <span>column_names</span><span>:</span> <span>if</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>rem</span><span>"</span> <span>or</span> <span>"</span><span>con</span><span>"</span> <span>or</span> <span>"</span><span>cat</span><span>"</span> <span>or</span> <span>"</span><span>date</span><span>"</span><span>:</span> <span>pass</span> <span>if</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>object</span><span>'</span><span>:</span> <span>new_col_names</span> <span>=</span> <span>"</span><span>cat_</span><span>"</span> <span>+</span> <span>name</span><span>.</span><span>lower</span><span>().</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>)</span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>{</span><span>name</span><span>:</span> <span>new_col_names</span><span>})</span> <span>elif</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>float64</span><span>'</span> <span>or</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>int64</span><span>'</span> <span>or</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>dtype</span> <span>==</span> <span>'</span><span>datetime64[ns]</span><span>'</span><span>:</span> <span>new_col_names</span> <span>=</span> <span>"</span><span>con_</span><span>"</span> <span>+</span> <span>name</span><span>.</span><span>lower</span><span>().</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>)</span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>{</span><span>name</span><span>:</span> <span>new_col_names</span><span>})</span> <span>column_names</span> <span>=</span> <span>list</span><span>(</span><span>df</span><span>.</span><span>columns</span><span>.</span><span>values</span><span>)</span> <span># Get lists of coolumns for conversion </span> <span>con_column_names</span> <span>=</span> <span>[]</span> <span>cat_column_names</span> <span>=</span> <span>[]</span> <span>rem_column_names</span> <span>=</span> <span>[]</span> <span>date_column_names</span> <span>=</span> <span>[]</span> <span>for</span> <span>name</span> <span>in</span> <span>column_names</span><span>:</span> <span>if</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>cat</span><span>"</span><span>:</span> <span>cat_column_names</span><span>.</span><span>append</span><span>(</span><span>name</span><span>)</span> <span>elif</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>con</span><span>"</span><span>:</span> <span>con_column_names</span><span>.</span><span>append</span><span>(</span><span>name</span><span>)</span> <span>elif</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>rem</span><span>"</span><span>:</span> <span>rem_column_names</span><span>.</span><span>append</span><span>(</span><span>name</span><span>)</span> <span>elif</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>4</span><span>]</span> <span>==</span> <span>"</span><span>date</span><span>"</span><span>:</span> <span>date_column_names</span><span>.</span><span>append</span><span>(</span><span>name</span><span>)</span> <span># Make sure continuous variables are correct datatype. (Otherwise, they'll be dummied). </span> <span>for</span> <span>name</span> <span>in</span> <span>con_column_names</span><span>:</span> <span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>to_numeric</span><span>(</span><span>df</span><span>[</span><span>name</span><span>],</span> <span>errors</span><span>=</span><span>'</span><span>coerce</span><span>'</span><span>)</span> <span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>fillna</span><span>(</span><span>value</span><span>=</span><span>0</span><span>)</span> <span>for</span> <span>name</span> <span>in</span> <span>cat_column_names</span><span>:</span> <span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>apply</span><span>(</span><span>str</span><span>)</span> <span>df</span><span>[</span><span>name</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>name</span><span>].</span><span>fillna</span><span>(</span><span>value</span><span>=</span><span>'</span><span>not_collected</span><span>'</span><span>)</span> <span># Remove unwanted columns </span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>columns</span><span>=</span><span>rem_column_names</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span> <span>return</span> <span>df</span> <span># ------------------------------------------------------ # Encoding Categorical variables # ------------------------------------------------------ </span> <span># The method below creates dummy variables from columns with # the prefix "cat". There is the argument to drop the first column # to avoid the Dummy Variable Trap. </span><span>def</span> <span>dummy_categorical</span><span>(</span><span>df</span><span>,</span> <span>drop_first</span> <span>=</span> <span>True</span><span>):</span> <span># Get categorical data columns. </span> <span>columns</span> <span>=</span> <span>list</span><span>(</span><span>df</span><span>.</span><span>columns</span><span>.</span><span>values</span><span>)</span> <span>columnsToEncode</span> <span>=</span> <span>columns</span><span>.</span><span>copy</span><span>()</span> <span>for</span> <span>name</span> <span>in</span> <span>columns</span><span>:</span> <span>if</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>!=</span> <span>'</span><span>cat</span><span>'</span><span>:</span> <span>columnsToEncode</span><span>.</span><span>remove</span><span>(</span><span>name</span><span>)</span> <span># if there are no columns to encode, return unmutated. </span> <span>if</span> <span>not</span> <span>columnsToEncode</span><span>:</span> <span>return</span> <span>df</span> <span># Encode categories </span> <span>for</span> <span>name</span> <span>in</span> <span>columnsToEncode</span><span>:</span> <span>if</span> <span>name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>!=</span> <span>'</span><span>cat</span><span>'</span><span>:</span> <span>continue</span> <span>tmp</span> <span>=</span> <span>pd</span><span>.</span><span>get_dummies</span><span>(</span><span>df</span><span>[</span><span>name</span><span>],</span> <span>drop_first</span> <span>=</span> <span>drop_first</span><span>)</span> <span>names</span> <span>=</span> <span>{}</span> <span># Get a clean column name. </span> <span>clean_name</span> <span>=</span> <span>name</span><span>.</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>lower</span><span>()</span> <span># Get a dictionary for renaming the dummay variables in the scheme of old_col_name + response_string </span> <span>if</span> <span>clean_name</span><span>[</span><span>0</span><span>:</span><span>3</span><span>]</span> <span>==</span> <span>"</span><span>cat</span><span>"</span><span>:</span> <span>for</span> <span>tmp_name</span> <span>in</span> <span>tmp</span><span>:</span> <span>tmp_name</span> <span>=</span> <span>str</span><span>(</span><span>tmp_name</span><span>)</span> <span>new_tmp_name</span> <span>=</span> <span>tmp_name</span><span>.</span><span>replace</span><span>(</span><span>"</span><span> </span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>replace</span><span>(</span><span>"</span><span>/</span><span>"</span><span>,</span> <span>"</span><span>_</span><span>"</span><span>).</span><span>lower</span><span>()</span> <span>new_tmp_name</span> <span>=</span> <span>clean_name</span> <span>+</span> <span>"</span><span>_</span><span>"</span> <span>+</span> <span>new_tmp_name</span> <span>names</span><span>[</span><span>tmp_name</span><span>]</span> <span>=</span> <span>new_tmp_name</span> <span># Rename the dummy variable dataframe </span> <span>tmp</span> <span>=</span> <span>tmp</span><span>.</span><span>rename</span><span>(</span><span>columns</span><span>=</span><span>names</span><span>)</span> <span># join the dummy variable back to original dataframe. </span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>join</span><span>(</span><span>tmp</span><span>)</span> <span># Drop all old categorical columns </span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>columns</span><span>=</span><span>columnsToEncode</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span> <span>return</span> <span>df</span> <span># Read the file </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>pathToData</span><span>)</span> <span># Drop columns such as unique IDs </span><span>try</span><span>:</span> <span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>cols_to_remove</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span> <span>except</span><span>:</span> <span># If columns aren't contained or no columns, exit nicely </span> <span>result</span> <span>=</span> <span>{</span><span>'</span><span>status</span><span>'</span><span>:</span> <span>404</span><span>,</span> <span>'</span><span>message</span><span>'</span><span>:</span> <span>'</span><span>Problem with columns to remove.</span><span>'</span><span>}</span> <span>print</span><span>(</span><span>str</span><span>(</span><span>json</span><span>.</span><span>dumps</span><span>(</span><span>result</span><span>)))</span> <span>quit</span><span>()</span> <span># Get the number of columns before hot encoding </span><span>num_cols_before</span> <span>=</span> <span>df</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span> <span># Encode the data. </span><span>df</span> <span>=</span> <span>add_datatype_prefix</span><span>(</span><span>df</span><span>)</span> <span>df</span> <span>=</span> <span>dummy_categorical</span><span>(</span><span>df</span><span>)</span> <span># Get the new dataframe shape. </span><span>num_cols_after</span> <span>=</span> <span>df</span><span>.</span><span>shape</span><span>[</span><span>1</span><span>]</span> <span>percentage_increase</span> <span>=</span> <span>num_cols_after</span> <span>/</span> <span>num_cols_before</span> <span>result</span> <span>=</span> <span>""</span> <span>if</span> <span>percentage_increase</span> <span>></span> <span>unreasonable_increase</span><span>:</span> <span>message</span> <span>=</span> <span>"</span><span>\"</span><span>error</span><span>\"</span><span>: </span><span>\"</span><span>Feature increase is greater than unreasonableIncreaseThreshold, most likely a unique id was included.</span><span>"</span> <span>result</span> <span>=</span> <span>{</span><span>'</span><span>status</span><span>'</span><span>:</span> <span>400</span><span>,</span> <span>'</span><span>message</span><span>'</span><span>:</span> <span>message</span><span>}</span> <span>else</span><span>:</span> <span>filename</span> <span>=</span> <span>filename</span><span>.</span><span>replace</span><span>(</span><span>"</span><span>.csv</span><span>"</span><span>,</span> <span>""</span><span>)</span> <span>import</span> <span>os</span> <span>if</span> <span>not</span> <span>os</span><span>.</span><span>path</span><span>.</span><span>exists</span><span>(</span><span>pathToWriteProcessedFile</span><span>):</span> <span>os</span><span>.</span><span>makedirs</span><span>(</span><span>pathToWriteProcessedFile</span><span>)</span> <span>writeFile</span> <span>=</span> <span>pathToWriteProcessedFile</span> <span>+</span> <span>filename</span> <span>+</span> <span>"</span><span>_encoded.csv</span><span>"</span> <span>df</span><span>.</span><span>to_csv</span><span>(</span><span>path_or_buf</span><span>=</span><span>writeFile</span><span>,</span> <span>sep</span><span>=</span><span>'</span><span>,</span><span>'</span><span>)</span> <span># Process the results and return JSON results object </span> <span>result</span> <span>=</span> <span>{</span><span>'</span><span>status</span><span>'</span><span>:</span> <span>200</span><span>,</span> <span>'</span><span>message</span><span>'</span><span>:</span> <span>'</span><span>encoded data</span><span>'</span><span>,</span> <span>'</span><span>path</span><span>'</span><span>:</span> <span>writeFile</span><span>}</span> <span>print</span><span>(</span><span>str</span><span>(</span><span>json</span><span>.</span><span>dumps</span><span>(</span><span>result</span><span>)))</span>""" Created on Mon Jun 11 21:12:10 2018 @author: cthomasbrittain """ import sys import json # filename = sys.argv[1] filepath = sys.argv[2] pathToWriteProcessedFile = sys.argv[3] request = sys.argv[4] request = json.loads(request) try: cols_to_remove = request['columnsToRemove'] unreasonable_increase = request['unreasonableIncreaseThreshold'] except: # If columns aren't contained or no columns, exit nicely result = {'status': 400, 'message': 'Expected script parameters not found.'} print(str(json.dumps(result))) quit() pathToData = filepath + filename # Clean Data -------------------------------------------------------------------- # ------------------------------------------------------------------------------- # Importing data transformation libraries import pandas as pd # The following method will do the following:a # 1. Add a prefix to columns based upon datatypes (cat and con) # 2. Convert all continuous variables to numeric (float64) # 3. Convert all categorical variables to objects # 4. Rename all columns with prefixes, convert to lower-case, and replace # spaces with underscores. # 5. Continuous blanks are replaced with 0 and categorical 'not collected' # This method will also detect manually assigned prefixes and adjust the # columns and data appropriately. # Prefix key: # a) con = continuous # b) cat = categorical # c) rem = removal (discards entire column) def add_datatype_prefix(df, date_to_cont = True): import pandas as pd # Get a list of current column names. column_names = list(df.columns.values) # Encode each column based with a three letter prefix based upon assigned datatype. # 1. con = continuous # 2. cat = categorical for name in column_names: if df[name].dtype == 'object': try: df[name] = pd.to_datetime(df[name]) if(date_to_cont): new_col_names = "con_" + name.lower().replace(" ", "_").replace("/", "_") df = df.rename(columns={name: new_col_names}) else: new_col_names = "date_" + name.lower().replace(" ", "_").replace("/", "_") df = df.rename(columns={name: new_col_names}) except ValueError: pass column_names = list(df.columns.values) for name in column_names: if name[0:3] == "rem" or "con" or "cat" or "date": pass if df[name].dtype == 'object': new_col_names = "cat_" + name.lower().replace(" ", "_").replace("/", "_") df = df.rename(columns={name: new_col_names}) elif df[name].dtype == 'float64' or df[name].dtype == 'int64' or df[name].dtype == 'datetime64[ns]': new_col_names = "con_" + name.lower().replace(" ", "_").replace("/", "_") df = df.rename(columns={name: new_col_names}) column_names = list(df.columns.values) # Get lists of coolumns for conversion con_column_names = [] cat_column_names = [] rem_column_names = [] date_column_names = [] for name in column_names: if name[0:3] == "cat": cat_column_names.append(name) elif name[0:3] == "con": con_column_names.append(name) elif name[0:3] == "rem": rem_column_names.append(name) elif name[0:4] == "date": date_column_names.append(name) # Make sure continuous variables are correct datatype. (Otherwise, they'll be dummied). for name in con_column_names: df[name] = pd.to_numeric(df[name], errors='coerce') df[name] = df[name].fillna(value=0) for name in cat_column_names: df[name] = df[name].apply(str) df[name] = df[name].fillna(value='not_collected') # Remove unwanted columns df = df.drop(columns=rem_column_names, axis=1) return df # ------------------------------------------------------ # Encoding Categorical variables # ------------------------------------------------------ # The method below creates dummy variables from columns with # the prefix "cat". There is the argument to drop the first column # to avoid the Dummy Variable Trap. def dummy_categorical(df, drop_first = True): # Get categorical data columns. columns = list(df.columns.values) columnsToEncode = columns.copy() for name in columns: if name[0:3] != 'cat': columnsToEncode.remove(name) # if there are no columns to encode, return unmutated. if not columnsToEncode: return df # Encode categories for name in columnsToEncode: if name[0:3] != 'cat': continue tmp = pd.get_dummies(df[name], drop_first = drop_first) names = {} # Get a clean column name. clean_name = name.replace(" ", "_").replace("/", "_").lower() # Get a dictionary for renaming the dummay variables in the scheme of old_col_name + response_string if clean_name[0:3] == "cat": for tmp_name in tmp: tmp_name = str(tmp_name) new_tmp_name = tmp_name.replace(" ", "_").replace("/", "_").lower() new_tmp_name = clean_name + "_" + new_tmp_name names[tmp_name] = new_tmp_name # Rename the dummy variable dataframe tmp = tmp.rename(columns=names) # join the dummy variable back to original dataframe. df = df.join(tmp) # Drop all old categorical columns df = df.drop(columns=columnsToEncode, axis=1) return df # Read the file df = pd.read_csv(pathToData) # Drop columns such as unique IDs try: df = df.drop(cols_to_remove, axis=1) except: # If columns aren't contained or no columns, exit nicely result = {'status': 404, 'message': 'Problem with columns to remove.'} print(str(json.dumps(result))) quit() # Get the number of columns before hot encoding num_cols_before = df.shape[1] # Encode the data. df = add_datatype_prefix(df) df = dummy_categorical(df) # Get the new dataframe shape. num_cols_after = df.shape[1] percentage_increase = num_cols_after / num_cols_before result = "" if percentage_increase > unreasonable_increase: message = "\"error\": \"Feature increase is greater than unreasonableIncreaseThreshold, most likely a unique id was included." result = {'status': 400, 'message': message} else: filename = filename.replace(".csv", "") import os if not os.path.exists(pathToWriteProcessedFile): os.makedirs(pathToWriteProcessedFile) writeFile = pathToWriteProcessedFile + filename + "_encoded.csv" df.to_csv(path_or_buf=writeFile, sep=',') # Process the results and return JSON results object result = {'status': 200, 'message': 'encoded data', 'path': writeFile} print(str(json.dumps(result)))
Enter fullscreen mode Exit fullscreen mode
That’s the premise. I’ll be adding more services to as a series of articles.
Deep Learning Journal (5 Part Series)
1 Using Python, NodeJS, Angular, and MongoDB to Create a Machine Learning System
2 Distributing Machine Learning Jobs
3 Training a Toxic Comment Detector
4 Preparing a Small Server for a Neural Network Webservice
5 Creating a Neural Network Webservice
原文链接:Using Python, NodeJS, Angular, and MongoDB to Create a Machine Learning System
暂无评论内容