Python 多线程读取csv python多线程读写

转载

lgmyxbjfu 2023-06-30 12:59:58

文章标签 Python 多线程读取csv python Python 多线程数据同步 文章分类 Python 后端开发

文章目录

简介
对比
基准
_thread
Thread类

Lock
Queue

multiprocessing.dummy
线程池（推荐）

进度条

参考文献

简介

多进程 Process：multiprocessing

优点：使用多核 CPU 并行运算
缺点：占用资源最多、可启动数目比线程少
适用场景：CPU 密集型

多线程 Thread：threading

优点：相比进程，更轻量级、占用资源少
缺点：

相比进程：多线程并发执行时只能同时使用一个 CPU，不能利用多 CPU（因为 GIL，但因为有 IO 存在，多线程依然可以加速运行）
相比协程：启动数目有限，占用内存资源，有线程切换开销

适用场景：IO 密集型、同时运行任务数不多

多协程 Coroutine：asyncio

优点：内存开销最小、启动协程数量多
缺点：支持的库少、实现复杂
适用场景：IO 密集型、需要超多任务运行

IO 指输入输出，有文件 IO 和网络 IO，如文件读写、数据库读写、网络请求（爬虫）

好用的多线程目标：

速度快
有返回值
数据同步

对比

方案	优点	缺点	耗时/s
基准			33.05
_thread	1. 后台运行 2. 适合 GUI	1. 需要程序一直运行 2. 难以获取返回值	142.75
Thread类		1. 获取返回值有点麻烦 2. 数据同步需要用到 Lock 或 Queue	29.22
multiprocessing.dummy	1. 启动方便 2. 有返回值 3. 数据同步	需先收集参数，编写逻辑有点不同	28.81
线程池	1. 启动方便 2. 有返回值 3. 数据同步	需先收集参数，编写逻辑有点不同	30.09

基准

以简单的文件读写为例，模拟 IO 操作

def benchmark(n):
    """多线程基准函数"""
    i = 0
    with open('{}.txt'.format(n), 'w') as f:
        for i in range(n * 1000000):
            f.write(str(i) + '\n')
    return i


if __name__ == '__main__':
    from timeit import timeit


    def f():
        for n in range(10):
            print(benchmark(n))


    print(timeit(f, number=1))

_thread

import _thread

from tool import benchmark


def f():
    for n in range(10):
        print(_thread.start_new_thread(benchmark, (n,)))


if __name__ == '__main__':
    f()
    while True:
        pass

缺点：

需要程序一直运行
难以获取返回值

Thread类

import threading

from tool import benchmark


class MyThread(threading.Thread):
    def run(self):
        if self._target is not None:
            self._return = self._target(*self._args, **self._kwargs)

    def join(self):
        super().join()
        return self._return


def f():
    threads = []
    for n in range(10):
        threads.append(MyThread(target=benchmark, args=(n,)))
    for thread in threads:
        thread.start()
    for thread in threads:
        print(thread.join())


if __name__ == '__main__':
    from timeit import timeit

    print(timeit(f, number=1))

缺点：

获取返回值有点麻烦
数据同步需要用到 Lock 或 Queue

Lock

import time
import threading
from threading import Thread, Lock

lock = Lock()


class Account:
    def __init__(self, balance):
        self.balance = balance


def draw(account, amount):
    with lock:
        if account.balance >= amount:
            time.sleep(0.1)
            print(threading.current_thread().name, '取钱成功')
            account.balance -= amount
            print(threading.current_thread().name, '余额', account.balance)
        else:
            print(threading.current_thread().name, '取钱失败，余额不足')


if __name__ == '__main__':
    account = Account(1000)
    ta = Thread(target=draw, args=(account, 800), name='ta')
    tb = Thread(target=draw, args=(account, 800), name='tb')
    ta.start()
    tb.start()

Queue

import threading
from queue import Queue

from tool import benchmark


def f(queue):
    n = queue.get()
    print(benchmark(n))


if __name__ == '__main__':
    queue = Queue()
    for n in range(10):
        queue.put(n)

    for n in range(10):
        thread = threading.Thread(target=f, args=(queue,))
        thread.start()

这种写法数据不同步

耗时：26.44

multiprocessing.dummy

from multiprocessing.dummy import Pool

from tool import benchmark


def f():
    n_list = [n for n in range(10)]
    pool = Pool(processes=8)
    results = pool.map(benchmark, n_list)
    pool.close()
    pool.join()
    print(results)


if __name__ == '__main__':
    from timeit import timeit

    print(timeit(f, number=1))

线程池（推荐）

线程池

from concurrent.futures import ThreadPoolExecutor

from tool import benchmark


def f():
    n_list = [n for n in range(10)]
    with ThreadPoolExecutor() as executor:
        results = list(executor.map(benchmark, n_list))
        print(results)


if __name__ == '__main__':
    from timeit import timeit

    print(timeit(f, number=1))

要用多个参数时，可用 lambda 函数进行封装，如

import time

from concurrent.futures import ThreadPoolExecutor


def f(x=1, y=2):
    time.sleep(1)
    return x * y


x_list = [1, 2, 3]
y_list = [4, 5, 6]

with ThreadPoolExecutor() as executor:
    results = list(executor.map(f, x_list, y_list))
    print(results)  # [4, 10, 18]
    results = list(executor.map(lambda y: f(y=y), y_list))
    print(results)  # [4, 5, 6]

进度条

from concurrent.futures import ThreadPoolExecutor

from tool import benchmark


def f():
    n_list = [n for n in range(10)]
    with ThreadPoolExecutor() as executor:
        results = list(executor.map(benchmark, n_list))
        print(results)


if __name__ == '__main__':
    from timeit import timeit

    print(timeit(f, number=1))

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。