How to make Python code concurrent with 3 lines - 拾光赋-拾光赋

How to make Python code concurrent with 3 lines

7年前发布

04810

I was inspired by @rpalo‘s quest to uncover gems in Python’s standard library

defaultdicts: Never Check If a Key is Present Again!

Ryan Palo ・ Nov 18 ’18

#python #standardlibrary #defaultdict

I decided to share one of my favorite tricks in Python’s standard library through an example. The entire code runs on Python 3.2+ without external packages.

The initial problem

Let’s say you have a thousand URLs to process/download/examine, so you need to issue as much HTTP GET calls and retrieve the body of each response.

This is a way to do it:

import http.client
import socket

def get_it(url):
    try:
        # always set a timeout when you connect to an external server         connection = http.client.HTTPSConnection(url, timeout=2)

        connection.request("GET", "/")

        response = connection.getresponse()

        return response.read()
    except socket.timeout:
        # in a real world scenario you would probably do stuff if the         # socket goes into timeout         pass

urls = [
    "www.google.com",
    "www.youtube.com",
    "www.wikipedia.org",
    "www.reddit.com",
    "www.httpbin.org"
] * 200

for url in urls:
    get_it(url)

Enter fullscreen mode Exit fullscreen mode

(I wouldn’t use the standard library as an HTTP client but for the purpose of this post it’s okay)

As you can see there’s no magic here. Python iterates on 1000 URLs and calls each of them.

This thing on my computer occupies 2% of the CPU and spends most of the time waiting for I/O:

$ time python io_bound_serial.py
20.67s user 5.37s system 855.03s real 24292kB mem

Enter fullscreen mode Exit fullscreen mode

It runs for roughly 14 minutes. We can do better.

Show me the trick!

from concurrent.futures import ThreadPoolExecutor as PoolExecutor
import http.client
import socket

def get_it(url):
    try:
        # always set a timeout when you connect to an external server         connection = http.client.HTTPSConnection(url, timeout=2)

        connection.request("GET", "/")

        response = connection.getresponse()

        return response.read()
    except socket.timeout:
        # in a real world scenario you would probably do stuff if the         # socket goes into timeout         pass

urls = [
    "www.google.com",
    "www.youtube.com",
    "www.wikipedia.org",
    "www.reddit.com",
    "www.httpbin.org"
] * 200

with PoolExecutor(max_workers=4) as executor:
    for _ in executor.map(get_it, urls):
        pass

Enter fullscreen mode Exit fullscreen mode

Let’s see what changed:

# import a new API to create a thread pool from concurrent.futures import ThreadPoolExecutor as PoolExecutor

# create a thread pool of 4 threads with PoolExecutor(max_workers=4) as executor:

    # distribute the 1000 URLs among 4 threads in the pool     # _ is the body of each page that I'm ignoring right now     for _ in executor.map(get_it, urls):
        pass

Enter fullscreen mode Exit fullscreen mode

So, 3 lines of code, we made a slow serial task into a concurrent one, taking little short of 5 minutes:

$ time python io_bound_threads.py
21.40s user 6.10s system 294.07s real 31784kB mem

Enter fullscreen mode Exit fullscreen mode

We went from 855.03s to 294.07s, a 2.9x increase!

Wait, there’s more

The great thing about this new API is that you can substitute

from concurrent.futures import ThreadPoolExecutor as PoolExecutor

Enter fullscreen mode Exit fullscreen mode

with

from concurrent.futures import ProcessPoolExecutor as PoolExecutor

Enter fullscreen mode Exit fullscreen mode

to tell Python to use processes instead of threads. Out of curiosity, let’s see what happens to the running time:

$ time python io_bound_processes.py
22.19s user 6.03s system 270.28s real 23324kB mem

Enter fullscreen mode Exit fullscreen mode

20 seconds less than the threaded version, not much different. Keep in mind that these are unscientific experiments and I’m using the computer while these scripts run.

Bonus content

My computer has 4 cores, let’s see what happens to the threaded versions increasing the number of worker threads:

# 6 threads
20.48s user 5.19s system 155.92s real 35876kB mem
# 8 threads
23.48s user 5.55s system 178.29s real 40472kB mem
# 16 threads
23.77s user 5.44s system 119.69s real 58928kB mem
# 32 threads
21.88s user 4.81s system 119.26s real 96136kB mem

Enter fullscreen mode Exit fullscreen mode

Three things to notice: RAM occupation obviously increases, we hit a wall around 16 threads and at 16 threads we’re more than 7x faster than the serial version.

If you don’t recognize time‘s output is because I’ve aliased it like this:

time='gtime -f '\''%Us user %Ss system %es real %MkB mem -- %C'\'

Enter fullscreen mode Exit fullscreen mode

where gtime is installed by brew install gnu-time

Conclusions

I think ThreadPoolExecutor and ProcessPoolExecutor are super cool additions to Python’s standard library. You could have done mostly everything they do with the “older” threading, multiprocessing and with FIFO queues but this API is so much better.

原文链接：How to make Python code concurrent with 3 lines

© 版权声明

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END

Python（EN）
# python

喜欢就支持一下吧

相关推荐

评论抢沙发

请登录后发表评论

暂无评论内容