基于PaddleSpeech与OpenCV的课堂笔记助手
项目介绍
简单来说,本项目是一个基于Paddlespeech与OpenCV的课程笔记整理助手,可将课程视频根据需要转换为pdf,同时转换语音为文本。
这里用的是吴恩达老师的视频,可自行替换视频进行测试,也可以根据实际课程语言进行设定中英文。
使用和导出方式是直接fork后,上传自己的视频,同时设置相应路径以及视频语言为中文或英文,在程序运行结束后下载导出即可。
项目背景
- 1.在学校上课或是培训的时候,经常会录制视频给没来的人或是温故复习时观看,然而这会出现一个问题,录制的视频过于冗长,观看者无法确定视频的关键区段,上课时记录的笔记也无法和视频内容自动匹配。
- 2.目前,用于制作视频笔记的发明在解决这个问题上大多是对视频进行二次编辑,由用户在对应的区段添加笔记内容,编辑的过程十分耗时,也对视频编辑技术有一定的要求。
由此延伸出来的想法是,能不能基于图像处理以及AI技术将课程视频中的关键帧和语音、字幕等关键信息,设计制作一个课堂笔记助手。
以上,便是简单的项目背景和想法。
答案自然是可以的,不过实现方式也有很多,这里仅以吴恩达老师的课程视频文件,来作为测试和演示。
可以看到原视频大概有7分多钟,将会生成以下的pdf文件。
思路方案介绍
其实这里的方案也很容易理解和实现,下面绘制了一个简单的流程图来帮助理解。
关于视频生成pdf笔记文件,设置了相应的参数来选取帧数间隔,也就是多少帧数来进行前后间隔帧的相似度,
如果发现相似度超过一定阀值,则将当前帧记录下来,若低于这个值,则跳过当前帧,进入下一个步骤,
遍历完所有的视频帧数后,则按照记录顺序将图像帧转换保存为pdf文件,从而作为这段视频的课程笔记文件。
同时,针对语音信息,则使用PaddleSpeech来进行语音文字的提取和整理,
同样抽取音频片段进行识别,最后统一整理为课程摘要,作为课程笔记文件的批注和参考。
补充:在实际使用语音识别翻译的时候,发现存在中英文识别还有中英文混杂的问题,
因此在后面的几个版本中将语音识别函数进行了重新设计,另外针对中英文模型也做了相应的匹配和调整。
基本实现原先设计的方案,但仍有美中不足的地方需要去优化和调整,就是机器翻译还有推理速度的优化。
视频转pdf处理实现
这部分主要分为以下几个步骤:
引入必要的依赖库
import argparse
import cv2
import numpy as np
import os
from matplotlib import pyplot as plt
from PIL import Image
import os
定义组合函数
def combine_imgs_pdf(folder_path, pdf_file_path):
"""
合成文件夹下的所有图片为pdf
Args:
folder_path (str): 源文件夹
pdf_file_path (str): 输出路径
"""
files = os.listdir(folder_path)
png_files = []
sources = []
for file in files:
if 'png' in file or 'jpg' in file:
png_files.append(folder_path + file)
png_files.sort()
output = Image.open(png_files[0])
png_files.pop(0)
for file in png_files:
png_file = Image.open(file)
if png_file.mode == "RGB":
png_file = png_file.convert("RGB")
sources.append(png_file)
output.save(pdf_file_path, "pdf", save_all=True, append_images=sources)
print("convet success")
def compare(a, b):
return np.abs(np.sum(a - b)) / np.sum(b)
定义获取视频帧函数
def get_frames(
i_path,
o_path,
sample_interval_or_num_samples,
is_n_sample=False,
start=None,
end=None,
x_min=None,
x_max=None,
y_min=None,
y_max=None,
sensitivity=None,
):
"""
This function is used to get the slides from the video.
@param i_path: the input video path
@param o_path: the output pics path
@param sample_interval_or_num_samples: the number_of_interval (seconds) or number of pictures one expects to get
(it is the upper bound because there will be a lot of them filtered out later, ~300 is a safe number)
@param is_n_sample: whether the sample_interval_or_num_samples is the_number_of_samples or not, boolean, default is False
@param start: the start second of the video, int, default is None, if None, it will be 0
@param end: the end second of the video, int, default is None, if None, it will be the last second of the video
@param x_min: the minimum x of the bounding box, int, default is None, if None, it will be set to 0
@param x_max: the maximum x of the bounding box, int, default is None, if None, it will be set to the width of the frame
@param y_min: the minimum y of the bounding box, int, default is None, if None, it will be set to 0
@param y_max: the maximum y of the bounding box, int, default is None, if None, it will be set to the height of the frame
@param sensitivity: the sensitivity filter repeat slides, float, default is None. If it's None, it will automatically get a safe sensitivity but it is usually too low, so you probably need to filter pics manually later
"""
vidcap = cv2.VideoCapture(i_path)
success, image = vidcap.read()
if x_min is None:
x_min = 0
if y_min is None:
y_min = 0
if x_max is None:
x_max = image.shape[0]
if y_max is None:
y_max = image.shape[1]
frames = vidcap.get(cv2.CAP_PROP_FRAME_COUNT)
if not os.path.exists(o_path):
os.makedirs(o_path)
fps = vidcap.get(cv2.CAP_PROP_FPS)
if start is None:
start = 0
if end is None:
end = frames / fps
images = []
if is_n_sample:
sample_interval = end // sample_interval_or_num_samples
else:
sample_interval = sample_interval_or_num_samples
for i in range(int(start * fps), int(end * fps) + 1, int(sample_interval * fps)):
vidcap.set(cv2.CAP_PROP_POS_FRAMES, i)
success, image = vidcap.read()
images.append(image[x_min:x_max, y_min:y_max, :])
# cv2.imwrite(os.path.join(o_path, "frame_{}.jpg".format(i)), image[x_min:x_max,y_min:y_max,:]) # save frame as JPEG file
if sensitivity is None:
diff = [compare(images[i], images[i + 1]) for i in range(len(images) - 1)]
c = plt.hist(diff, bins=30)
sensitivity = c[1][np.argmax(c[0])]
print("Find sensitivity: ", sensitivity)
old = images[0]
new = images[1]
filtered_images = []
for i in range(len(images)):
new = images[i]
if np.abs(np.sum(new - old)) > (sensitivity + 0.001) * np.sum(old):
filtered_images.append(new)
old = new
# save
for i in range(len(filtered_images)):
cv2.imwrite(os.path.join(o_path, "frame_{}.jpg".format(i)), filtered_images[i])
return images, filtered_images
执行程序与调用
i_path = "/home/aistudio/work/What_is_a_Neural_Network.mp4"
o_path = "/home/aistudio/work/output/"
sample_interval_or_num_samples = 10
is_n_sample = 1
sensitivity=None
pdf_file = "/home/aistudio/work/pdf/course.pdf"
get_frames(i_path,
o_path,
sample_interval_or_num_samples,
is_n_sample=False,
start=None,
end=None,
x_min=None,
x_max=None,
y_min=None,
y_max=None,
sensitivity=None)
combine_imgs_pdf(o_path, pdf_file)
Find sensitivity: 0.0033964624095421756
convet success
以上执行程序,会将吴恩达老师的课程视频转换为pdf,并存取在/home/aistudio/work/pdf/course.pdf路径下。
感兴趣的可以自行下载和预览,可支持fork项目后上传其他视频,修改其路径即可运行。
音频转换文字处理实现
安装依赖环境
!pip install paddlespeech -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install ffmpy
!pip install moviepy
从视频中提取音频
import moviepy.editor as mp
# 采样率16k 保证和paddlespeech一致
def extract_audio(videos_file_path):
my_clip = mp.VideoFileClip(videos_file_path,audio_fps=16000)
if (videos_file_path.split(".")[-1] == 'MP4' or videos_file_path.split(".")[-1] == 'mp4'):
p = videos_file_path.split('.MP4')[0]
my_clip.audio.write_audiofile(p + '_video.wav')
new_path = p + '_video.wav'
return new_path
import warnings
warnings.filterwarnings("ignore")
path = extract_audio('/home/aistudio/work/course.mp4')
处理音频文件
定义音频切分函数
!pip install pydub
import re
import os
from pydub import AudioSegment
def get_second_part_wav(main_wav_path, start_time, end_time, part_wav_path):
"""
音频切片,获取部分音频,单位秒
:param main_wav_path: 原音频文件路径
:param start_time: 截取的开始时间
:param end_time: 截取的结束时间
:param part_wav_path: 截取后的音频路径
:return:
"""
start_time = start_time * 1000 #因为是毫秒所以需要乘以1000
end_time = end_time * 1000
sound = AudioSegment.from_mp3(main_wav_path)
word = sound[start_time:end_time]
word.export(part_wav_path, format="wav")
wav_path = "/home/aistudio/work/course.wav" #分割的音频
part_path = "/home/aistudio/work/course_100.wav" #分割后的音频
s = 0 #开始分割点
e = 100 #结束分割点
# 开始分割
get_second_part_wav(wav_path, s, e, part_path)
语音识别验证
下面分别根据中文与英文模型进行了语音识别模型的定义与调用。
# 调用语音识别
import warnings
import paddle
warnings.filterwarnings("ignore")
model_list=['conformer_wenetspeech-zh-16k', 'conformer_online_wenetspeech-zh-16k',
'conformer_u2pp_online_wenetspeech-zh-16k', 'conformer_online_multicn-zh-16k',
'conformer_aishell-zh-16k', 'conformer_online_aishell-zh-16k',
'transformer_librispeech-en-16k', 'deepspeech2online_wenetspeech-zh-16k',
'deepspeech2offline_aishell-zh-16k', 'deepspeech2online_aishell-zh-16k', 'deepspeech2offline_librispeech-en-16k']
from paddlespeech.cli.asr.infer import ASRExecutor
print('Loading model...')
asr_executor = ASRExecutor()
def asr_model(filePath, lang_opt='zh'):
if lang_opt == 'zh':
return asr_executor(
lang=lang_opt,# zh/en
sample_rate=16000,
config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None,
audio_file=filePath,
force_yes=True,
device=paddle.get_device())
elif lang_opt == 'en':
modelName = 'transformer_librispeech'
return asr_executor(
model = modelName,
lang=lang_opt,# zh/en
sample_rate=16000,
config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None,
audio_file=filePath,
force_yes=True,
device=paddle.get_device())
Loading model...
将语音识别为英文
# 执行识别程序
wav_file = "/home/aistudio/work/course_c20s.wav"
asr_model(wav_file,lang_opt='en')
[2022-12-21 23:04:23,162] [ WARNING] - The sample rate of the input file is not 16000.
The program will resample the wav file to 16000.
If the result does not meet your expectations,
Please input the 16k 16 bit 1 channel wav file.
"the term deep learning refers to training mirinetwork sometimes very large near in networks so what exactly is in yoter in thisviial lest each other give you some the ba contuitions less not to a halussing price prediction example let' say you have a dayta said with six houses so you know the siz"
将语音识别为中文
# 执行识别程序
wav_file = "/home/aistudio/work/course_c20s.wav"
asr_model(wav_file,lang_opt='zh')
2022-12-21 23:05:09.076 | INFO | paddlespeech.s2t.modules.embedding:__init__:150 - max len: 5000
[2022-12-21 23:05:10,578] [ WARNING] - The sample rate of the input file is not 16000.
The program will resample the wav file to 16000.
If the result does not meet your expectations,
Please input the 16k 16 bit 1 channel wav file.
W1221 23:05:11.369760 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.374516 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.378613 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.383037 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.387112 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.391185 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.395236 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.399384 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.403447 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.407480 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.411563 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
W1221 23:05:11.415596 289 kernel_factory.cc:130] The cudnn kernel for [depthwise_conv2d] is not registered.
v2d] is not registered.
'这枪击责任以我的人还可以认着尔有什么的贝斯跟极为深似莱斯达特塞还会逮着三以的'
批量处理并识别音频
定义音频批量切分与处理函数
!pip install auditok
import csv
import moviepy.editor as mp
import auditok
import os
import paddle
import soundfile
import librosa
import warnings
warnings.filterwarnings('ignore')
# 引入auditok库
import auditok
# 输入类别为audio
def qiefen(path, ty='audio', mmin_dur=1, mmax_dur=100000, mmax_silence=1, menergy_threshold=55):
audio_file = path
audio, audio_sample_rate = soundfile.read(
audio_file, dtype="int16", always_2d=True)
audio_regions = auditok.split(
audio_file,
min_dur=mmin_dur, # minimum duration of a valid audio event in seconds
max_dur=mmax_dur, # maximum duration of an event
# maximum duration of tolerated continuous silence within an event
max_silence=mmax_silence,
energy_threshold=menergy_threshold # threshold of detection
)
for i, r in enumerate(audio_regions):
# Regions returned by `split` have 'start' and 'end' metadata fields
print(
"Region {i}: {r.meta.start:.3f}s -- {r.meta.end:.3f}s".format(i=i, r=r))
epath = ''
file_pre = str(epath.join(audio_file.split('.')[0].split('/')[-1]))
mk = 'change'
if (os.path.exists(mk) == False):
os.mkdir(mk)
if (os.path.exists(mk + '/' + ty) == False):
os.mkdir(mk + '/' + ty)
if (os.path.exists(mk + '/' + ty + '/' + file_pre) == False):
os.mkdir(mk + '/' + ty + '/' + file_pre)
num = i
# 为了取前三位数字排序
s = '000000' + str(num)
file_save = mk + '/' + ty + '/' + file_pre + '/' + \
s[-3:] + '-' + '{meta.start:.3f}-{meta.end:.3f}' + '.wav'
filename = r.save(file_save)
print("region saved as: {}".format(filename))
return mk + '/' + ty + '/' + file_pre
定义音频转文件程序
# 语音转文本
asr_executor = ASRExecutor()
def audio2txt(path):
# 返回path下所有文件构成的一个list列表
print(f"path: {path}")
filelist = os.listdir(path)
# 保证读取按照文件的顺序
filelist.sort(key=lambda x: int(os.path.splitext(x)[0][:3]))
# 遍历输出每一个文件的名字和类型
words = []
for file in filelist:
print(path + '/' + file)
text = asr_model(path + '/' + file, lang_opt='en')
words.append(text)
return words
# 保存
import csv
def txt2csv(txt_all):
with open('result.csv', 'w', encoding='utf-8') as f:
f_csv = csv.writer(f)
for row in txt_all:
f_csv.writerow([row])
批量语音转文本
import warnings
warnings.filterwarnings('ignore')
# 可替换成自身的录音文件
source_path = '/home/aistudio/work/course.wav'
# 划分音频
path = qiefen(path=source_path, ty='audio',
mmin_dur=0.5, mmax_dur=100000, mmax_silence=0.5, menergy_threshold=55)
# 音频转文本 需要GPU
txt_all = audio2txt(path)
# 存入csv
txt2csv(txt_all)
生成后的语音文件如下所示:
总结
本项目以吴恩达老师的课程视频文件为例,分别从视频和音频两个维度,
通过OpenCV和PaddleSpeech以及其他库进行处理,
得到存为pdf的课程笔记文件以及存为csv的语音处理文件。
结合效果与预期来看,目前的视频处理方式对于有在课程中手写动作的识别连续帧会加入冗余的中间过程帧,
同时对于英文语音的识别效果不太理想,后续可能还需要专门训练或者使用英文识别和英译中的模型进行推理。
还可以针对英文语音得到的csv文件进行机器翻译,但前提是必须得到一个比较好的文本信息。