HanLp 提取关键词统计词频

转载

小题大作 2024-11-13 08:19:21

文章标签 HanLp 提取关键词统计词频字符串 sed ci 文章分类 NLP 人工智能

import pandas as pd

1.遍历文件夹下所有文件名，获取各个文件地址

import os 
from os import path 
# 定义一个函数
def scaner_file (url):
    # 遍历当前路径下所有文件
    file  = os.listdir(url)
    list=[]
    for f in file:
        # 字符串拼接
        real_url = path.join (url , f)
        # 保存到数组
        list.append(real_url)
    return list

2.1通过地址读取word文件内容，并按句分割

import docx
# pip install python-docx
# 按照地址读取文档
def read_data(url):
    file=docx.Document(url)
    # 按照段落读取文档内容
    data=[]
    for para in file.paragraphs:
        data.append(para.text)
    data=data[0].split('.')
    return data
read_data("./papers/3.docx")

['The Anti-Atlas Mountains constitute a Late Proterozoic suture zone produced by northward subduction of oceanic lithosphere culminating in the Pan-African orogeny',
 ' Southward migration of thrust slices associated with the destruction of the fbrearc terrane resulted in the uplift and erosion of previously deposited basin sediments',
 ' These sediments were subsequently reincorporated into collisional basin deposits of the Tiddiline Formation',
 'The Tiddiline Formation consists of coarsening-upwards sequences of maroon siltstones, sandstones and intraforma- tional conglomerates',
 ' These rocks unconfbrmably overlie metamorphosed volcaniclastic rocks of the relict fbrearc basin and accretionary terrane',
 ' Syn- and post-deposlional deformation has resulted in folding about gently-plunging fold axes',
 ' Folds were subsequently cut by strike-slip faults that strike at a high angle to the basin axis',
 ' Deformation of the Tiddiline Formation is attributed to transpressional suturing of the relict fbrearc terrane to the West African Craton to the south',
 ' Collisional basins of the Anti-Atlas Mountains serve as ancient analogs fbr the destruction of fbrearc basins in an oblique- convergent margin setting, such as those of the Western Pacific region',
 '']

2.2通过地址读取文献pdf文件中的摘要内容，并按句分割—未使用

可惜我要整理的文献格式不统一，所以并没有用到这一模块，而是先把需要查询的段落摘要出来成word再运行

pdfplumber.pdf中包含了.metadata和.pages两个属性。
.metadata是一个包含pdf信息的字典。
.pages是一个包含页面信息的列表。

每个pdfplumber.page的类中包含了几个主要的属性。
.page_number 页码
.width 页面宽度
.height 页面高度
.objects/.chars/.lines/.rects 这些属性中每一个都是一个列表，每个列表都包含一个字典，每个字典用于说明页面中的对象信息，包括直线，字符，方格等位置信息。

.extract_text() 用来提页面中的文本，将页面的所有字符对象整理为的那个字符串
.extract_words() 返回的是所有的单词及其相关信息
.extract_tables() 提取页面的表格
.to_image() 用于可视化调试时，返回PageImage类的一个实例

import pdfplumber
# pip install pdfplumber
def read_pdf(url):
    with pdfplumber.open(url) as pdf:
        # 获取pdf第1页
        first_page = pdf.pages[0]
        str = first_page.extract_text()# <class 'str'>
        strat=str.find('ABSTRACT')
        end=str.find('INTRODUCTION')
        # 字符串切片，不保存'ABSTRACT'，只保存内容
        data = str[strat+10:end]
#         print(data)
        # 删除字符串中的'\n'
        data = data.replace('\n','')
#         print(data)
        # 按照句子拆分
        data = data.split('.')
#         print(data)
        return data

3.获取关键字

import docx
# pip install python-docx
# 按照地址读取文档
def read_keywords(url):
    file=docx.Document(url)
    # 按照段落读取文档内容
    data=[]
    for para in file.paragraphs:
        data.append(para.text)
    return data
read_keywords('./papers/key_words.docx')

['Continental rifts',
 'Nascent ocean basins',
 'Property value',
 'Intraplate continental margins',
 'Intracratonic basins ',
 'Continental platforms ',
 'Active ocean basins',
 'Oceanic islands',
 'seamounts',
 'aseismic ridges',
 'and plateaus',
 'Dormant ocean basins ',
 'Transtensional basins',
 'Transpressional basins',
 'Transrotational basins',
 'Trenches ',
 'Trench-slope basins',
 'Forearc basins',
 'Intraarc basins',
 'Backarc basins',
 'Retroforeland basins',
 'Remnant ocean basins',
 'Proforeland basins ',
 'Wedgetop basins',
 'Hinterland basins',
 'Aulacogens',
 'Impactogens',
 'Collisional broken foreland',
 'Halokinetic basins ',
 'Bolide basins',
 'Successor basins',
 'Shelf-slope-rise configuration',
 'Transform configuration ',
 'Embankment configuration',
 'Oceanic intraarc basins ',
 'Continental intraarc basins',
 'Oceanic backarc basins',
 'Continental backarc',
 'Retroarc foreland basins ',
 'Collisional retroforeland',
 'Broken-retroforeland']

4.处理数据：筛选含有关键字的内容，并保存

def handle_data(url):
    # 按照地址读取文档 
    data = read_data(url)
    print('当前查询url=',url)
    # 查找关键词
    keywords = read_keywords('./papers/key_words.docx')
    ret_list=[]
    # 筛选含有特数字的句子
    for i in data:
        for j in keywords:
#             print('当前查询句子为=',i,'\n查询关键字为=',j)
            # 忽略大小写
            if i.casefold().find(j.casefold())!=-1:
                # 添加到列表并且句末添加句号和回车
                ret_list.append(i+'.\n')
                print(j,"原句：",i)
                break
    return ret_list

5.主函数

# 循环遍历所有文件
for url in urls:
    # 查询是否有关键字
    list = handle_data(url)
    if list == []:
        print('查找文件：',list,'未找到')
    else:
        # 切片删除.docx
        # url[0:-5]
        # list[0].to_csv(url[0:-5]+'.txt',sep=' ',index=0,header=0)
        f = open(url[0:-5]+".txt",'w',encoding = 'utf-8')
        f.write(list[0])   #将字符串写入文件中
        f.close()

当前查询url= ./papers/1.docx
Retroarc foreland basins  原句：  The source regions for retroarc foreland basins generally, and the Magallanes-Austral Basin speciﬁcally, can be broadly divided into (1) the magmatic arc, (2) the fold-and-thrust belt, and (3) sources around the periphery of foreland ﬂexural subsidence
当前查询url= ./papers/2.docx
查找文件： [] 未找到

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。