项目需要实现自动在docx中插入批注,首选为python,python中有docx库,但是到目前为止还是未支持插入批注功能,但是在python-docx项目中,有人提出了这个问题,作者scanny给出了相关指导。
总结一下大致思路为:解压docx文件后会得到很多文件及文件夹,对比插入批注和未插入批注的解压文件发现:插入批注会新增一个word/comments.xml文件,并且会修改word/_rels/document.xml.rels和word/document.xml,后续插入新的标注只会修改word/comments.xml和word/document.xml。所以只需要搞清楚document.xml.rels、comments.xml、document.xml的变化规律,就可以实现批注插入的自动化。
大家可以尝试将docx文件重命名为.zip,然后解压,手动修改里面的文件信息,再压缩回.zip,再重命名为doc
以下为未插入批注解压文件结构:
以下为插入批注的文件结构:
最明显的区别在于新增了word/comments.xml文件其次还有word/_rels/document.xml.rels、word/document.xml内容的变化。
首先对比word/_rels/document.xml.rels文件内容的变化
插入批注前:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http:///package/2006/relationships"><Relationship Id="rId5" Type="http:///officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/><Relationship Id="rId4" Type="http:///officeDocument/2006/relationships/customXml" Target="../customXml/item1.xml"/><Relationship Id="rId3" Type="http:///officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/><Relationship Id="rId2" Type="http:///officeDocument/2006/relationships/settings" Target="settings.xml"/><Relationship Id="rId1" Type="http:///officeDocument/2006/relationships/styles" Target="styles.xml"/></Relationships>
插入批注后:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http:///package/2006/relationships"><Relationship Id="rId6" Type="http:///officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/><Relationship Id="rId5" Type="http:///officeDocument/2006/relationships/customXml" Target="../customXml/item1.xml"/><Relationship Id="rId4" Type="http:///officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/><Relationship Id="rId3" Type="http:///officeDocument/2006/relationships/comments" Target="comments.xml"/><Relationship Id="rId2" Type="http:///officeDocument/2006/relationships/settings" Target="settings.xml"/><Relationship Id="rId1" Type="http:///officeDocument/2006/relationships/styles" Target="styles.xml"/></Relationships>
其次对比word/document.xml内容变化:
插入前:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http:///markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http:///officeDocument/2006/relationships" xmlns:m="http:///officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http:///drawingml/2006/wordprocessingDrawing" xmlns:w="http:///wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"><w:body><w:p><w:r><w:t>这是一段文本,等待插入批注</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="851" w:footer="992" w:gutter="0"/><w:cols w:space="425" w:num="1"/><w:docGrid w:type="lines" w:linePitch="312" w:charSpace="0"/></w:sectPr></w:body></w:document>
插入后:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http:///markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http:///officeDocument/2006/relationships" xmlns:m="http:///officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http:///drawingml/2006/wordprocessingDrawing" xmlns:w="http:///wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"><w:body><w:p><w:r><w:t>这是一段</w:t></w:r><w:commentRangeStart w:id="0"/><w:r><w:t>文本</w:t></w:r><w:commentRangeEnd w:id="0"/><w:r><w:commentReference w:id="0"/></w:r><w:r><w:t>,等待插入批注</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="851" w:footer="992" w:gutter="0"/><w:cols w:space="425" w:num="1"/><w:docGrid w:type="lines" w:linePitch="312" w:charSpace="0"/></w:sectPr></w:body></w:document>
对比插入一个批注和插入两个批注的区别:
插入一个:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:comments xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http:///markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http:///officeDocument/2006/relationships" xmlns:m="http:///officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http:///drawingml/2006/wordprocessingDrawing" xmlns:w="http:///wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"><w:comment w:id="0" w:author="guochuanxiang" w:date="2019-03-14T14:46:32Z" w:initials="g"><w:p><w:pPr><w:pStyle w:val="2"/></w:pPr><w:r><w:t>这是一个批注</w:t></w:r></w:p></w:comment></w:comments>
插入两个:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:comments xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http:///markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http:///officeDocument/2006/relationships" xmlns:m="http:///officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http:///drawingml/2006/wordprocessingDrawing" xmlns:w="http:///wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"><w:comment w:id="0" w:author="guochuanxiang" w:date="2019-03-14T14:46:32Z" w:initials="g"><w:p><w:pPr><w:pStyle w:val="2"/></w:pPr><w:r><w:t>这是一个批注</w:t></w:r></w:p></w:comment><w:comment w:id="1" w:author="guochuanxiang" w:date="2019-03-14T14:52:47Z" w:initials="g"><w:p><w:pPr><w:pStyle w:val="2"/></w:pPr><w:r><w:t>这是第二个批注</w:t></w:r></w:p></w:comment></w:comments>
区别大家可以自己尝试。
不多说,上实现代码:
运行python3 [code.py代码文件名] [docx文件路径] [需要被批注的文本内容] [批注内容]
例: python3 insert_comments.py /Users/guochuanxiang/Desktop/comments.docx 文本 批注
# coding:utf-8
import sys
from zipfile import ZipFile
import os
import shutil
import re
def write_comments(comments_file_content, comments): # comments: [被批注文本,批注]
comments_id = comments[2]
print ('generate comments.xml content....')
tmp = '<w:comment w:id="{}" w:author="guochuanxiang" w:date="2019-03-13T15:10:06Z" w:initials="g"><w:p><w:pPr><w:pStyle w:val="2"/></w:pPr><w:r><w:t>{}</w:t></w:r></w:p></w:comment></w:comments>'.format(comments_id, comments[1])
content_comments = comments_file_content[:-13]+tmp
return content_comments
def write_document(document_file_content, comments):
comments_id = comments[2]
print ('generate document.xml content....')
tmp = '</w:t></w:r><w:commentRangeStart w:id="{}"/><w:r><w:rPr><w:rFonts w:hint="eastAsia"/></w:rPr><w:t>{}</w:t></w:r><w:commentRangeEnd w:id="{}"/><w:r><w:commentReference w:id="{}"/></w:r><w:r><w:rPr><w:rFonts w:hint="eastAsia"/></w:rPr><w:t>'.format(comments_id,comments[0],comments_id,comments_id)
content_document = document_file_content.replace(comments[0],tmp,1)
return content_document
def write_rel(rel_file_content, comments):
if rel_file_content.find('comments.xml') == -1:
print ("not find comments.xml")
content_rel = rel_file_content[:-16]+'<Relationship Id="{}" Type="http:///officeDocument/2006/relationships/comments" Target="comments.xml"/></Relationships>'.format('rId9')
print(content_rel)
return content_rel
else:
print('get comments.xml in rels file')
return rel_file_content
def run(file_path='/Users/guochuanxiang/Desktop/test.docx',comments=['内容', '批注1']):
doc_file = open(file_path, 'rb')
doc = ZipFile(doc_file)
doc.extractall() #解压文件
print ('extracting....')
file_name = doc.namelist() #获取所有文件名
if 'word/comments.xml' not in file_name:
print ('create comments.xml')
comments_file = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\n<w:comments xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http:///markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http:///officeDocument/2006/relationships" xmlns:m="http:///officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http:///drawingml/2006/wordprocessingDrawing" xmlns:w="http:///wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wpsCustomData="http://www.wps.cn/officeDocument/2013/wpsCustomData" mc:Ignorable="w14 w15 wp14"></w:comments>'
comments.append(0)
else:
comments_file = doc.read('word/comments.xml').decode('utf-8') #获取comments.xml内容
comment_id = re.compile(r'(?<=id=")\d+') #寻找所有comments id
comment_id = int(max(comment_id.findall(comments_file)))+1 #设置批注id为最大+1
comments.append(comment_id)
document_file = doc.read('word/document.xml').decode('utf-8') #获取document.xml内容
rel_file = doc.read('word/_rels/document.xml.rels').decode('utf-8') #获取rel内容
doc.close()
doc_file.close()
comments_g = write_comments(comments_file, comments) #获取添加批注后comments.xml内容
document = write_document(document_file, comments) #获取添加批注后doucment.xml内容
rel = write_rel(rel_file, comments) #获取添加批注后rel内容
print ('get all content')
print('writing document.xml.rels...')
r_f = open('word/_rels/document.xml.rels','w')
r_f.write(rel)
r_f.close()
print('done')
print ('writing comments.xml...')
c_f = open('word/comments.xml','w') #将插入批注的comment内容写入comments.xml
c_f.write(comments_g)
c_f.close()
print('done')
print('writing document.xml....') #将插入批注的document内容写入document.xml
d_f = open('word/document.xml','w')
d_f.write(document)
d_f.close()
print('done')
os.remove(file_path) #删除原docx
print('creat commented docx....')
new_file = ZipFile(doc.filename,mode='w') #新建空docx
if 'word/comments.xml' not in file_name:
print ('add {}'.format('word/comments.xml'))
new_file.write('word/comments.xml')
try:
for name in file_name:
if os.path.isfile(name):
print('add {}'.format(name))
new_file.write(name) #将文件压缩回docx
finally:
print('closing')
new_file.close()
for name in file_name:
if os.path.exists(name):
if os.path.isfile(name):
os.remove(name)
else:
shutil.rmtree(name)
print('done')
if __name__ == '__main__':
file_path = sys.argv[1]
text = sys.argv[2]
comment = sys.argv[3]
comments = [text,comment]
print (comments)
run(file_path,comments)
总结:按scanny的说法,python-docx有提供在xml里插入内容的方法,但是我没用过这个模块,所以没有深究如何用docx实现,目前这种实现方法有局限性,如果一段文本被批注多次可能会出现问题,可能需要使用docx模块的插入方法可以解决,可以尝试一下