Python并行编程实战 Erik Bartmann pdf 下载 python 并行化

转载

mob64ca14157da7 2023-10-10 10:04:52

文章标签 python java 多线程编程语言大数据 文章分类 Python 后端开发

通常，Python是用于数据处理和数据科学的最受欢迎的语言之一。该生态系统提供了许多促进高性能计算的库和框架。不过，在Python中进行并行编程可能会非常棘手。

在本教程中，我们将研究为什么并行性很难，尤其是在Python上下文中，为此，我们将经历以下内容：

为什么在Python中并行性会很棘手 （提示：这是由于GIL（全局解释器锁）所致）。
线程与进程 ：实现并行性的不同方法。什么时候使用另一个？
并行与并发 ：为什么在某些情况下我们可以选择并发而不是并行。
使用所讨论的各种技术构建一个简单但实际的示例 。

全局翻译锁

全局解释器锁（GIL）是Python领域中最具争议的主题之一。在CPython（最流行的Python实现）中，GIL是使线程安全的互斥体。 GIL使与非线程安全的外部库的集成变得容易，并且使非并行代码更快。不过，这是有代价的。由于GIL，我们无法通过多线程实现真正的并行性。基本上，同一进程的两个不同的本机线程不能一次运行Python代码。

但是，事情并没有那么糟糕，这就是为什么：发生在GIL领域之外的事情可以自由地并行化。这类长期运行的任务属于I / O之类，幸运的是诸如numpy库。

线程与进程

因此，Python并不是真正的多线程。但是什么是线程？让我们退后一步，以透视的方式看待事物。

进程是基本的操作系统抽象。它是一个正在执行的程序，换句话说，就是正在运行的代码。多个进程始终在计算机中运行，并且它们并行执行。

一个进程可以有多个线程。它们执行属于父进程的相同代码。理想情况下，它们可以并行运行，但不一定。流程不足的原因是，应用程序需要响应并在更新显示和保存文件时侦听用户的操作。

如果仍然不清楚，请参考以下速查表：

Craft.io流程	螺纹
进程不共享内存	线程共享内存
产生/切换过程很昂贵	生成/切换线程更便宜
流程需要更多资源	线程需要较少的资源（有时称为轻量级进程）
无需内存同步	您需要使用同步机制来确保正确处理数据

没有一种配方可以容纳所有食物。选择一个选项在很大程度上取决于上下文和您要实现的任务。

并行与并行

现在，我们将更进一步，深入探讨并发性。并发常常被误解为并行性。事实并非如此。并发意味着调度要以协作方式执行的独立代码。利用这样的事实：一段代码正在等待I / O操作，并且在这段时间内运行了代码的不同但独立的部分。

在Python中，我们可以通过greenlets实现轻量级的并发行为。从并行化的角度来看，使用线程或greenlet是等效的，因为它们都不并行运行。与线程相比，Greenlets的创建成本甚至更低。因此，greenlets被大量用于执行大量简单的I / O任务，就像在网络和Web服务器中常见的那样。

既然我们知道了线程和进程（并行和并发）之间的区别，我们就可以说明如何在这两种范式上执行不同的任务。这是我们要做的事情：我们将在GIL外部执行一项任务，并在其中执行多次。我们正在使用线程和进程来串行运行它们。让我们定义任务：

import os
import time
import threading
import multiprocessing

NUM_WORKERS = 4

def only_sleep():
    """ Do nothing, wait for a timer to expire """
    print("PID: %s, Process Name: %s, Thread Name: %s" % (
        os.getpid(),
        multiprocessing.current_process().name,
        threading.current_thread().name)
    )
    time.sleep(1)


def crunch_numbers():
    """ Do some computations """
    print("PID: %s, Process Name: %s, Thread Name: %s" % (
        os.getpid(),
        multiprocessing.current_process().name,
        threading.current_thread().name)
    )
    x = 0
    while x < 10000000:
        x += 1

我们创建了两个任务。它们都是长时间运行的，但是只有crunch_numbers主动执行计算。让我们only_sleep运行only_sleep ，多线程并使用多个进程并比较结果：

## Run tasks serially
start_time = time.time()
for _ in range(NUM_WORKERS):
    only_sleep()
end_time = time.time()

print("Serial time=", end_time - start_time)

# Run tasks using threads
start_time = time.time()
threads = [threading.Thread(target=only_sleep) for _ in range(NUM_WORKERS)]
[thread.start() for thread in threads]
[thread.join() for thread in threads]
end_time = time.time()

print("Threads time=", end_time - start_time)

# Run tasks using processes
start_time = time.time()
processes = [multiprocessing.Process(target=only_sleep()) for _ in range(NUM_WORKERS)]
[process.start() for process in processes]
[process.join() for process in processes]
end_time = time.time()

print("Parallel time=", end_time - start_time)

这是我得到的输出（您的输入应该相似，尽管PID和时间会有所不同）：

PID: 95726, Process Name: MainProcess, Thread Name: MainThread
PID: 95726, Process Name: MainProcess, Thread Name: MainThread
PID: 95726, Process Name: MainProcess, Thread Name: MainThread
PID: 95726, Process Name: MainProcess, Thread Name: MainThread
Serial time= 4.018089056015015

PID: 95726, Process Name: MainProcess, Thread Name: Thread-1
PID: 95726, Process Name: MainProcess, Thread Name: Thread-2
PID: 95726, Process Name: MainProcess, Thread Name: Thread-3
PID: 95726, Process Name: MainProcess, Thread Name: Thread-4
Threads time= 1.0047411918640137

PID: 95728, Process Name: Process-1, Thread Name: MainThread
PID: 95729, Process Name: Process-2, Thread Name: MainThread
PID: 95730, Process Name: Process-3, Thread Name: MainThread
PID: 95731, Process Name: Process-4, Thread Name: MainThread
Parallel time= 1.014023780822754

以下是一些观察结果：

对于串行方法 ，情况非常明显。我们一个接一个地运行任务。所有四个运行都由同一进程的同一线程执行。
使用进程，我们将执行时间减少到原始时间的四分之一，这仅仅是因为任务是并行执行的。请注意，每个任务是如何在不同的进程中以及在该进程的MainThread上执行的。
使用线程，我们利用了任务可以同时执行的事实。即使没有并行运行，执行时间也减少了四分之一。这是这样的：我们产生第一个线程，它开始等待计时器到期。我们暂停它的执行，让它等待计时器到期，这时我们产生第二个线程。我们对所有线程重复此操作。有一个时刻，第一个线程的计时器到期，因此我们将执行切换到该线程，然后终止它。对第二个线程和所有其他线程重复该算法。最后，结果就像是并行运行。您还将注意到，四个不同的线程从同一进程分支并位于同一进程内： MainProcess 。
您甚至可能注意到，线程方法比真正的并行方法要快。那是由于产生过程的开销。如前所述，产生和切换过程是一项昂贵的操作。

让我们执行相同的例程，但是这次运行crunch_numbers任务：

start_time = time.time()
for _ in range(NUM_WORKERS):
    crunch_numbers()
end_time = time.time()

print("Serial time=", end_time - start_time)

start_time = time.time()
threads = [threading.Thread(target=crunch_numbers) for _ in range(NUM_WORKERS)]
[thread.start() for thread in threads]
[thread.join() for thread in threads]
end_time = time.time()

print("Threads time=", end_time - start_time)


start_time = time.time()
processes = [multiprocessing.Process(target=crunch_numbers) for _ in range(NUM_WORKERS)]
[process.start() for process in processes]
[process.join() for process in processes]
end_time = time.time()

print("Parallel time=", end_time - start_time)

这是我得到的输出：

PID: 96285, Process Name: MainProcess, Thread Name: MainThread
PID: 96285, Process Name: MainProcess, Thread Name: MainThread
PID: 96285, Process Name: MainProcess, Thread Name: MainThread
PID: 96285, Process Name: MainProcess, Thread Name: MainThread
Serial time= 2.705625057220459
PID: 96285, Process Name: MainProcess, Thread Name: Thread-1
PID: 96285, Process Name: MainProcess, Thread Name: Thread-2
PID: 96285, Process Name: MainProcess, Thread Name: Thread-3
PID: 96285, Process Name: MainProcess, Thread Name: Thread-4
Threads time= 2.6961309909820557
PID: 96289, Process Name: Process-1, Thread Name: MainThread
PID: 96290, Process Name: Process-2, Thread Name: MainThread
PID: 96291, Process Name: Process-3, Thread Name: MainThread
PID: 96292, Process Name: Process-4, Thread Name: MainThread
Parallel time= 0.8014059066772461

这里的主要区别在于多线程方法的结果。这次，它的执行与串行方法非常相似，这就是为什么：由于它执行计算，而Python不执行真正的并行性，因此线程基本上是一个接一个地运行，直到彼此完成为止，相互执行。

Python并行/并发编程生态系统

Python具有用于执行并行/并发编程的丰富API。在本教程中，我们涵盖了最受欢迎的内容，但是您必须知道，对于该领域中的任何需要，可能已经有一些可以帮助您实现目标的东西。

在下一节中，我们将使用提供的所有库以多种形式构建实际应用程序。事不宜迟，这里是我们要介绍的模块/库：

threading ：Python中使用线程的标准方法。它是_thread模块公开的功能的高级API包装器， _thread模块是操作系统的线程实现的低级接口。
concurrent.futures ：，提供了与螺纹的更高级别的抽象层的标准库的一个模块部件。线程被建模为异步任务。
multiprocessing ：类似于threading模块，提供了非常相似的接口，但是使用进程而不是线程。
gevent and greenlets ：Greenlets，也称为微线程，是可以协作调度的执行单元，可以并发执行任务而没有太多开销。
celery ：高级分布式任务队列。使用诸如multiprocessing或gevent各种范例将任务排队并同时执行。

建立实际应用

知道理论是很好的，但是最好的学习方法是建立一些实用的东西，对吗？在本节中，我们将构建一个涵盖所有不同范例的经典类型的应用程序。

让我们构建一个检查网站正常运行时间的应用程序。那里有很多这样的解决方案，最著名的可能是Jetpack Monitor和Uptime Robot 。这些应用程序的目的是在您的网站关闭时通知您，以便您可以快速采取行动。它们的工作方式如下：

该应用程序非常频繁地浏览网站URL列表，并检查那些网站是否正常运行。
每个网站应每5-10分钟检查一次，以确保停机时间不大。
它执行一个HEAD请求，而不是执行传统的HTTP GET请求，因此不会显着影响您的流量。
如果HTTP状态在危险范围内（400 +，500 +），则通知所有者。
通过电子邮件，短信或推送通知来通知所有者。

这就是为什么必须采用并行/并行方法解决问题的原因。随着网站列表的增加，连续浏览列表并不能保证我们每隔五分钟左右检查一次每个网站。网站可能会关闭几个小时，并且不会通知所有者。

让我们从编写一些实用程序开始：

# utils.py

import time
import logging
import requests


class WebsiteDownException(Exception):
    pass


def ping_website(address, timeout=20):
    """
    Check if a website is down. A website is considered down 
    if either the status_code >= 400 or if the timeout expires
    
    Throw a WebsiteDownException if any of the website down conditions are met
    """
    try:
        response = requests.head(address, timeout=timeout)
        if response.status_code >= 400:
            logging.warning("Website %s returned status_code=%s" % (address, response.status_code))
            raise WebsiteDownException()
    except requests.exceptions.RequestException:
        logging.warning("Timeout expired for website %s" % address)
        raise WebsiteDownException()
        

def notify_owner(address):
    """ 
    Send the owner of the address a notification that their website is down 
    
    For now, we're just going to sleep for 0.5 seconds but this is where 
    you would send an email, push notification or text-message
    """
    logging.info("Notifying the owner of %s website" % address)
    time.sleep(0.5)
    

def check_website(address):
    """
    Utility function: check if a website is down, if so, notify the user
    """
    try:
        ping_website(address)
    except WebsiteDownException:
        notify_owner(address)

实际上，我们将需要一个网站列表来试用我们的系统。创建您自己的列表或使用我的列表：

# websites.py

WEBSITE_LIST = [
    'http://envato.com',
    'http://amazon.co.uk',
    'http://amazon.com',
    'http://facebook.com',
    'http://google.com',
    'http://google.fr',
    'http://google.es',
    'http://google.co.uk',
    'http://internet.org',
    'http://gmail.com',
    'http://stackoverflow.com',
    'http://github.com',
    'http://heroku.com',
    'http://really-cool-available-domain.com',
    'http://djangoproject.com',
    'http://rubyonrails.org',
    'http://basecamp.com',
    'http://trello.com',
    'http://yiiframework.com',
    'http://shopify.com',
    'http://another-really-interesting-domain.co',
    'http://airbnb.com',
    'http://instagram.com',
    'http://snapchat.com',
    'http://youtube.com',
    'http://baidu.com',
    'http://yahoo.com',
    'http://live.com',
    'http://linkedin.com',
    'http://yandex.ru',
    'http://netflix.com',
    'http://wordpress.com',
    'http://bing.com',
]

通常，您会将此列表与所有者联系信息一起保存在数据库中，以便您可以与他们联系。由于这不是本教程的主题，并且为了简单起见，我们仅使用此Python列表。

如果您非常注意，您可能会注意到列表中有两个很长的域名无效的网站（希望您在阅读本文时能证明没有错，这样才能让我买错！）。我添加了这两个域，以确保每次运行都关闭一些网站。另外，让我们将应用命名为UptimeSquirrel 。

串行方法

首先，让我们尝试串行方法，看看它的性能如何。我们将以此为基准。

# serial_squirrel.py

import time


start_time = time.time()

for address in WEBSITE_LIST:
    check_website(address)
        
end_time = time.time()        

print("Time for SerialSquirrel: %ssecs" % (end_time - start_time))

# WARNING:root:Timeout expired for website http://really-cool-available-domain.com
# WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
# WARNING:root:Website http://bing.com returned status_code=405
# Time for SerialSquirrel: 15.881232261657715secs

线程化方法

线程方法的实现将使我们更具创意。我们正在使用队列将地址放入并创建工作线程，以将其移出队列并进行处理。我们将等待队列为空，这意味着我们的工作线程已处理了所有地址。

# threaded_squirrel.py

import time
from queue import Queue
from threading import Thread

NUM_WORKERS = 4
task_queue = Queue()

def worker():
    # Constantly check the queue for addresses
    while True:
        address = task_queue.get()
        check_website(address)
        
        # Mark the processed task as done
        task_queue.task_done()

start_time = time.time()
        
# Create the worker threads
threads = [Thread(target=worker) for _ in range(NUM_WORKERS)]

# Add the websites to the task queue
[task_queue.put(item) for item in WEBSITE_LIST]

# Start all the workers
[thread.start() for thread in threads]

# Wait for all the tasks in the queue to be processed
task_queue.join()

        
end_time = time.time()        

print("Time for ThreadedSquirrel: %ssecs" % (end_time - start_time))

# WARNING:root:Timeout expired for website http://really-cool-available-domain.com
# WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
# WARNING:root:Website http://bing.com returned status_code=405
# Time for ThreadedSquirrel: 3.110753059387207secs

并发未来

如前所述， concurrent.futures是使用线程的高级API。我们在这里采用的方法意味着使用ThreadPoolExecutor 。我们将向池提交任务并取回未来，这些结果将在将来提供给我们。当然，我们可以等待所有期货成为实际结果。

# future_squirrel.py

import time
import concurrent.futures

NUM_WORKERS = 4

start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
    futures = {executor.submit(check_website, address) for address in WEBSITE_LIST}
    concurrent.futures.wait(futures)

end_time = time.time()        

print("Time for FutureSquirrel: %ssecs" % (end_time - start_time))

# WARNING:root:Timeout expired for website http://really-cool-available-domain.com
# WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
# WARNING:root:Website http://bing.com returned status_code=405
# Time for FutureSquirrel: 1.812899112701416secs

多处理方法

multiprocessing库为threading库提供了几乎是直接替代的API。在这种情况下，我们将采用一种与concurrent.futures 。未来更相似的方法。我们正在设置一个multiprocessing.Pool并通过将一个函数映射到地址列表（例如经典的Python map函数）向其提交任务。

# multiprocessing_squirrel.py

import time
import socket
import multiprocessing

NUM_WORKERS = 4

start_time = time.time()

with multiprocessing.Pool(processes=NUM_WORKERS) as pool:
    results = pool.map_async(check_website, WEBSITE_LIST)
    results.wait()

end_time = time.time()        

print("Time for MultiProcessingSquirrel: %ssecs" % (end_time - start_time))

# WARNING:root:Timeout expired for website http://really-cool-available-domain.com
# WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
# WARNING:root:Website http://bing.com returned status_code=405
# Time for MultiProcessingSquirrel: 2.8224599361419678secs

Gevent

Gevent是实现大规模并发的流行替代方法。使用前，您需要了解以下几点：

由greenlets同时执行的代码是确定性的。与其他提出的替代方法相反，此范例可确保对于任何两个相同的运行，您始终将以相同的顺序获得相同的结果。
您需要猴子修补标准功能，以便它们与gevent配合使用。这就是我的意思。通常，套接字操作会阻塞。我们正在等待操作完成。如果我们处于多线程环境中，那么调度程序将仅在另一个线程正在等待I / O时切换到另一个线程。由于我们不在多线程环境中，因此gevent对标准函数进行了修补，以使它们成为非阻塞函数，并将控制权返回给gevent调度程序。

要安装gevent，请运行： pip install gevent

这是使用gevent使用gevent.pool.Pool执行任务的gevent.pool.Pool ：

# green_squirrel.py

import time
from gevent.pool import Pool
from gevent import monkey

# Note that you can spawn many workers with gevent since the cost of creating and switching is very low
NUM_WORKERS = 4

# Monkey-Patch socket module for HTTP requests
monkey.patch_socket()

start_time = time.time()

pool = Pool(NUM_WORKERS)
for address in WEBSITE_LIST:
    pool.spawn(check_website, address)

# Wait for stuff to finish
pool.join()
        
end_time = time.time()        

print("Time for GreenSquirrel: %ssecs" % (end_time - start_time))
# Time for GreenSquirrel: 3.8395519256591797secs

芹菜

芹菜是一种与我们迄今为止所见的方法大不相同的方法。在非常复杂和高性能的环境中经过了实战测试。与以上所有解决方案相比，设置Celery需要更多的修补工作。

首先，我们需要安装Celery：

pip install celery

任务是Celery项目中的核心概念。您要在Celery中运行的所有内容都需要完成。 Celery为运行任务提供了极大的灵活性：您可以在同一台计算机或多台计算机上，并使用线程，进程，Eventlet或gevent，以同步或异步，实时或计划的方式运行它们。

安排会稍微复杂一些。 Celery使用其他服务来发送和接收消息。这些消息通常是任务或任务的结果。为此，我们将在本教程中使用Redis。 Redis是一个不错的选择，因为它确实易于安装和配置，并且很可能已经在应用程序中将其用于其他目的，例如缓存和pub / sub。

您可以按照“ Redis快速入门”页面上的说明安装Redis。不要忘记安装redis Python库， pip install redis以及使用Redis和Celery所需的捆绑软件： pip install celery[redis] 。

像这样启动Redis服务器： $ redis-server

要开始使用Celery构建东西，我们首先需要创建一个Celery应用程序。之后，Celery需要知道它可能执行什么样的任务。为此，我们需要将任务注册到Celery应用程序。我们将使用@app.task装饰器进行此操作：

# celery_squirrel.py

import time
from utils import check_website
from data import WEBSITE_LIST
from celery import Celery
from celery.result import ResultSet

app = Celery('celery_squirrel',
             broker='redis://localhost:6379/0',
             backend='redis://localhost:6379/0')

@app.task
def check_website_task(address):
    return check_website(address)

if __name__ == "__main__":
    start_time = time.time()

    # Using `delay` runs the task async
    rs = ResultSet([check_website_task.delay(address) for address in WEBSITE_LIST])
    
    # Wait for the tasks to finish
    rs.get()

    end_time = time.time()

    print("CelerySquirrel:", end_time - start_time)
    # CelerySquirrel: 2.4979639053344727

如果什么都没有发生，请不要惊慌。记住，芹菜是一项服务，我们需要运行它。到目前为止，我们仅将任务放在Redis中，而没有启动Celery来执行它们。为此，我们需要在代码所在的文件夹中运行以下命令：

celery worker -A do_celery --loglevel=debug --concurrency=4

现在重新运行Python脚本，看看会发生什么。要注意的一件事：注意我们如何两次将Redis地址传递给我们的Redis应用程序。 broker参数指定将任务传递到Celery的位置， backend是Celery放置结果的位置，以便我们可以在应用程序中使用它们。如果我们不指定结果backend ，则无法让我们知道何时处理任务以及结果是什么。

另外，请注意，日志现在已在Celery进程的标准输出中，因此请确保在相应的终端中检出它们。