摘要:
Unix/Linux操作系统提供了一个fork()系统调用,它非常特殊。普通的函数调用,调用一次,返回一次,但是fork()调用一次,返回两次,因为操作系统自动把当前进程(称为父进程)复制了一份(称为子进程),然后,分别在父进程和子进程内返回。子进程永远返回0,而父进程返回子进程的ID。这样做的理由是,一个父进程可以fork出很多子进程,所以,父进程要记下每个子进程的ID,而子进程只需要调用getppid()就可以拿到父进程的ID。
目录:
- 前文回顾
- Python 多线程
- Multiprocessing Lock
- Multiprocessing Semaphore
- Multiprocessing Event
- Multiprocessing Queue and Pipe
- Multiprocessing Pool
- Python 多进程 数据对比测试
正文:
一. 前文回顾
1.1 前言
上一篇博客中写了《Python 多线程是多鸡肋》一文,感觉多线程并没有真正意义上的实现了并发,进而尝试使用多进程来实现上文的数据对比测试,从而分析测试结果。
二. Python 多线程
2.1 讲解
Pyhton实现多进程用到了 multiprocessing 模块,如果你打算编写多进程的服务程序,Unix/Linux无疑是正确的选择。由于Windows没有fork调用,难道在Windows上无法用Python编写多进程的程序?由于Python是跨平台的,自然也应该提供一个跨平台的多进程支持。multiprocessing模块就是跨平台版本的多进程模块。
multiprocessing模块提供了一个Process类来代表一个进程对象,下面的例子演示了启动一个子进程并等待其结束:
# -*- coding:utf-8 -*-
from multiprocessing import Process
import os
# 子进程要执行的代码
def run_proc(name):
print 'Run child process %s (%s)...' % (name, os.getpid())
if __name__=='__main__':
print 'Parent process %s.' % os.getpid()
p = Process(target=run_proc, args=('test',))
print 'Process will start.'
p.start()
p.join()
print 'Process end.'
执行结果:
Parent process 928.
Process will start.
Run child process test (929)...
Process end.
三. Multiprocessing Lock
当多个进程需要访问共享资源的时候,Lock可以用来避免访问的冲突。主要用到了lock.acquire() 和lock.release()
# -*- coding:utf-8 -*-
import multiprocessing
import sys
def worker_with(lock, f):
with lock:
fs = open(f,"a+")
fs.write('Lock acquired via with\n')
fs.close()
def worker_no_with(lock, f):
lock.acquire()
try:
fs = open(f,"a+")
fs.write('Lock acquired directly\n')
fs.close()
finally:
lock.release()
if __name__ == "__main__":
f = "file.txt"
lock = multiprocessing.Lock()
w = multiprocessing.Process(target=worker_with, args=(lock, f))
nw = multiprocessing.Process(target=worker_no_with, args=(lock, f))
w.start()
nw.start()
w.join()
nw.join()
四. Multiprocessing Semaphore
Semaphore用来控制对共享资源的访问数量,例如池的最大连接数。
# -*- coding:utf-8 -*-
import multiprocessing
import time
def worker(s,i):
s.acquire()
print(multiprocessing.current_process().name + " acquire")
time.sleep(i)
print(multiprocessing.current_process().name + " release")
s.release()
if __name__ == "__main__":
s = multiprocessing.Semaphore(2)
for i in range(5):
p = multiprocessing.Process(target=worker, args=(s,i*2))
p.start()
五. Multiprocessing Event
Event用来实现进程间同步通信。
# -*- coding:utf-8 -*-
import multiprocessing
import time
def wait_for_event(e):
"""Wait for the event to be set before doing anything"""
print ('wait_for_event: starting')
e.wait()
print ('wait_for_event: e.is_set()->' + str(e.is_set()))
def wait_for_event_timeout(e, t):
"""Wait t seconds and then timeout"""
print ('wait_for_event_timeout: starting')
e.wait(t)
print ('wait_for_event_timeout: e.is_set()->' + str(e.is_set()))
if __name__ == '__main__':
e = multiprocessing.Event()
w1 = multiprocessing.Process(name='block',
target=wait_for_event,
args=(e,))
w1.start()
w2 = multiprocessing.Process(name='non-block',
target=wait_for_event_timeout,
args=(e, 2))
w2.start()
time.sleep(3)
e.set()
print ('main: event is set')
六. Multiprocessing Queue and Pipe
Python的multiprocessing模块包装了底层的机制,提供了Queue、Pipes等多种方式来交换数据。
# -*- coding:utf-8 -*-
from multiprocessing import Process, Queue
import os, time, random
# 写数据进程执行的代码:
def write(q):
for value in ['A', 'B', 'C']:
print 'Put %s to queue...' % value
q.put(value)
time.sleep(random.random())
# 读数据进程执行的代码:
def read(q):
while True:
value = q.get(True)
print 'Get %s from queue.' % value
if __name__=='__main__':
# 父进程创建Queue,并传给各个子进程:
q = Queue()
pw = Process(target=write, args=(q,))
pr = Process(target=read, args=(q,))
# 启动子进程pw,写入:
pw.start()
# 启动子进程pr,读取:
pr.start()
# 等待pw结束:
pw.join()
# pr进程里是死循环,无法等待其结束,只能强行终止:
pr.terminate()
执行结果:
Put A to queue...
Get A from queue.
Put B to queue...
Get B from queue.
Put C to queue...
Get C from queue.
七. Multiprocessing Pool
如果要启动大量的子进程,可以用进程池的方式批量创建子进程:
# -*- coding:utf-8 -*-
from multiprocessing import Pool
import os, time, random
def long_time_task(name):
print 'Run task %s (%s)...' % (name, os.getpid())
start = time.time()
time.sleep(random.random() * 3)
end = time.time()
print 'Task %s runs %0.2f seconds.' % (name, (end - start))
if __name__=='__main__':
print 'Parent process %s.' % os.getpid()
p = Pool()
for i in range(5):
p.apply_async(long_time_task, args=(i,))
print 'Waiting for all subprocesses done...'
p.close()
p.join()
print 'All subprocesses done.'
执行结果:
Parent process 669.
Waiting for all subprocesses done...
Run task 0 (671)...
Run task 1 (672)...
Run task 2 (673)...
Run task 3 (674)...
Task 2 runs 0.14 seconds.
Run task 4 (673)...
Task 1 runs 0.27 seconds.
Task 3 runs 0.86 seconds.
Task 0 runs 1.41 seconds.
Task 4 runs 1.91 seconds.
All subprocesses done.
八. Python 多进程 数据对比测试
将上文列子中多线程数据对比方法,改成多进行进行数据对比:
# -*- coding:utf-8 -*-
import multiprocessing
import TestCase
import CommonVariable
def test_data(excel_index):
pool = multiprocessing.Pool(processes=CommonVariable.multiprocess_number)
result = []
for i in range(CommonVariable.multiprocess_number):
result.append(pool.apply_async(TestCase.compare_data, (CommonVariable.result_excel[excel_index + i][0], CommonVariable.result_excel[excel_index + i][1])))
pool.close()
pool.join()
test_result = ""
for i in result:
if i.get() == "all_pass":
pass
else:
test_result += i.get()
return test_result
if __name__ == '__main__':
print test_data(0)
结论:
多进程在Windows上执行,耗时:6302.33秒,对比单线程8023.14秒有一些改进,但远远并没有达到预期目标,在Unix/Linux下,multiprocessing模块封装了fork()调用,使我们不需要关注fork()的细节。由于Windows没有fork调用,因此,multiprocessing需要“模拟”出fork的效果,父进程所有Python对象都必须通过pickle序列化再传到子进程去,所有,如果multiprocessing在Windows下调用失败了,要先考虑是不是pickle失败了。进而在MAC 上执行,执行时间3102.12秒,时间大大缩短。