How to make Python code concurrent with 3 lines

I was inspired by @rpalo‘s quest to uncover gems in Python’s standard library

图片[1]-How to make Python code concurrent with 3 lines - 拾光赋-拾光赋

defaultdicts: Never Check If a Key is Present Again!

Ryan Palo ・ Nov 18 ’18

#python #standardlibrary #defaultdict

I decided to share one of my favorite tricks in Python’s standard library through an example. The entire code runs on Python 3.2+ without external packages.

The initial problem

Let’s say you have a thousand URLs to process/download/examine, so you need to issue as much HTTP GET calls and retrieve the body of each response.

This is a way to do it:

import http.client
import socket

def get_it(url):
    try:
        # always set a timeout when you connect to an external server         connection = http.client.HTTPSConnection(url, timeout=2)

        connection.request("GET", "/")

        response = connection.getresponse()

        return response.read()
    except socket.timeout:
        # in a real world scenario you would probably do stuff if the         # socket goes into timeout         pass

urls = [
    "www.google.com",
    "www.youtube.com",
    "www.wikipedia.org",
    "www.reddit.com",
    "www.httpbin.org"
] * 200

for url in urls:
    get_it(url)

Enter fullscreen mode Exit fullscreen mode

(I wouldn’t use the standard library as an HTTP client but for the purpose of this post it’s okay)

As you can see there’s no magic here. Python iterates on 1000 URLs and calls each of them.

This thing on my computer occupies 2% of the CPU and spends most of the time waiting for I/O:

$ time python io_bound_serial.py
20.67s user 5.37s system 855.03s real 24292kB mem

Enter fullscreen mode Exit fullscreen mode

It runs for roughly 14 minutes. We can do better.

Show me the trick!

from concurrent.futures import ThreadPoolExecutor as PoolExecutor
import http.client
import socket

def get_it(url):
    try:
        # always set a timeout when you connect to an external server         connection = http.client.HTTPSConnection(url, timeout=2)

        connection.request("GET", "/")

        response = connection.getresponse()

        return response.read()
    except socket.timeout:
        # in a real world scenario you would probably do stuff if the         # socket goes into timeout         pass

urls = [
    "www.google.com",
    "www.youtube.com",
    "www.wikipedia.org",
    "www.reddit.com",
    "www.httpbin.org"
] * 200

with PoolExecutor(max_workers=4) as executor:
    for _ in executor.map(get_it, urls):
        pass

Enter fullscreen mode Exit fullscreen mode

Let’s see what changed:

# import a new API to create a thread pool from concurrent.futures import ThreadPoolExecutor as PoolExecutor

# create a thread pool of 4 threads with PoolExecutor(max_workers=4) as executor:

    # distribute the 1000 URLs among 4 threads in the pool     # _ is the body of each page that I'm ignoring right now     for _ in executor.map(get_it, urls):
        pass

Enter fullscreen mode Exit fullscreen mode

So, 3 lines of code, we made a slow serial task into a concurrent one, taking little short of 5 minutes:

$ time python io_bound_threads.py
21.40s user 6.10s system 294.07s real 31784kB mem

Enter fullscreen mode Exit fullscreen mode

We went from 855.03s to 294.07s, a 2.9x increase!

Wait, there’s more

The great thing about this new API is that you can substitute

from concurrent.futures import ThreadPoolExecutor as PoolExecutor

Enter fullscreen mode Exit fullscreen mode

with

from concurrent.futures import ProcessPoolExecutor as PoolExecutor

Enter fullscreen mode Exit fullscreen mode

to tell Python to use processes instead of threads. Out of curiosity, let’s see what happens to the running time:

$ time python io_bound_processes.py
22.19s user 6.03s system 270.28s real 23324kB mem

Enter fullscreen mode Exit fullscreen mode

20 seconds less than the threaded version, not much different. Keep in mind that these are unscientific experiments and I’m using the computer while these scripts run.

Bonus content

My computer has 4 cores, let’s see what happens to the threaded versions increasing the number of worker threads:

# 6 threads
20.48s user 5.19s system 155.92s real 35876kB mem
# 8 threads
23.48s user 5.55s system 178.29s real 40472kB mem
# 16 threads
23.77s user 5.44s system 119.69s real 58928kB mem
# 32 threads
21.88s user 4.81s system 119.26s real 96136kB mem

Enter fullscreen mode Exit fullscreen mode

Three things to notice: RAM occupation obviously increases, we hit a wall around 16 threads and at 16 threads we’re more than 7x faster than the serial version.

If you don’t recognize time‘s output is because I’ve aliased it like this:

time='gtime -f '\''%Us user %Ss system %es real %MkB mem -- %C'\'

Enter fullscreen mode Exit fullscreen mode

where gtime is installed by brew install gnu-time

Conclusions

I think ThreadPoolExecutor and ProcessPoolExecutor are super cool additions to Python’s standard library. You could have done mostly everything they do with the “older” threading, multiprocessing and with FIFO queues but this API is so much better.

原文链接:How to make Python code concurrent with 3 lines

© 版权声明
THE END
喜欢就支持一下吧
点赞10 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容