文本查重算法java 论文查重javaweb源码

转载

IT独行侠客 2024-06-19 23:03:58

文章标签 文本查重算法java List Test System 文章分类 Java 后端开发

项目Github仓库链接

这个作业属于哪个课程	软件工程
这个作业要求在哪里	个人项目
这个作业的目标	实现论文查重的功能，并测试项目运行情况和性能等

一、PSP表格

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	60	120
Estimate	估计这个任务需要多少时间	800	1800
Development	开发	300	240
Analysis	需求分析 (包括学习新技术)	240	600
Design Spec	生成设计文档	60	60
Design Review	设计复审	60	60
Coding Standard	代码规范 (为目前的开发制定合适的规范)	60	60
Design	具体设计	120	240
Coding	具体编码	120	360
Code Review	代码复审	60	60
Test	测试（自我测试，修改代码，提交修改）	120	60
Reporting	报告	60	60
Test Repor	测试报告	60	30
Size Measurement	计算工作量	30	30
Postmortem & Process Improvement Plan	事后总结, 并提出过程改进计划	60	30
	合计	1410	1800

二、计算模块接口的设计与实现过程

本项目使用了HanLP包对文本进行分词,基于余弦相似度的方法计算两个文本的相似度。

模块接口

文本查重算法java 论文查重javaweb源码_List

1.txtOperation类

函数	功能
List Read(String path)	读取文件内容
void Write(String path,float result)	将内容写入文件
int[] Counter(List merge)	统计词语在文本中的频率

2.WordsCut类

函数	功能
List splitWords(String s1)	对文本进行分词
List s2)	将存储分词的两个表合并

3.CosineSimilarity类

函数	功能
float getSimilarity(int[] number1,int[] number2)	计算两个向量的余弦定理值并返回

4.main类

函数	功能
void main(String[] args)	调用项目的接口

模块设计

首先通过txtOperation类中的Read函数获取两个文本，然后使用WordsCut类中的splitWords函数对两个文本进行分词，并对两个文
本的分词结果求并集，以并集为基础统计各个词语的频率并将其作为向量的特征值，最后使用CosineSimilarity类中的函数求两个向量的余弦值，
由于当两个向量的夹角越接近0°时，余弦值越接近1，所以计算结果越接近1，表示两个文本的内容越相似。

获得余弦相似度的代码

点击查看代码

public static float getSimilarity(int[] number1,int[] number2){
        float square1=0,square2=0,product=0;
        for(int i=0;i<number1.length;i++){
            //两个向量的点乘
            product+=number1[i]*number2[i];
            //分别计算两个向量的平方和
            square1 += (float) Math.pow(number1[i], 2);
            square2 += (float) Math.pow(number2[i], 2);
        }
        //返回两个向量的余弦值
        return (float) (product / (Math.sqrt(square1) * Math.sqrt(square2)));
    }

三、计算模块接口部分的性能改进

总览

文本查重算法java 论文查重javaweb源码_List_02

内存

文本查重算法java 论文查重javaweb源码_文本查重算法java_03

分析

占用内存最多的三个分别为int型数组和两个分词类。int数组用于表示两个文本代表的向量，由于分词数与文本量呈正相关，因此字数越多消耗性能越多，
通过减少对int数组的反复读取可以一定程度上减少性能损耗。

四、计算模块部分单元测试展示

1.txtOperation类测试模块

通过打开正确文件、错误格式文件、错误路径文件的方式测试能否正常读写文件；通过对手动输入的字符串进行词频统计完成对counter函数的测试。

点击查看代码

package test;

import com.Words.txtOperation;
import org.junit.Test;

import java.io.IOException;
import java.io.Writer;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

import static org.junit.Assert.*;

public class txtOperationTest extends txtOperation {

    @Test
    public void read() throws IOException {
        //正常读取
        List<String> t1=Read("F:\\program\\c\\software\\txt\\1.txt");
        System.out.println(t1);
        //读取不存在的文件
        List<String> t2=Read("F:\\program\\c\\software\\txt\\111.txt");
        //读取错误格式的文件
        List<String> t3=Read("F:\\program\\c\\software\\txt\\");
    }

    @Test
    public void write() throws IOException {
        Write("F:\\program\\c\\software\\txt\\3.txt",(float)1.11);
    }

    @Test
    public void counter() {
        List<String> t1=new ArrayList<>();
        t1.add("aa");    t1.add("aa");    t1.add("ab");    t1.add("bb");
        List<String> t2=new ArrayList<>();
        t2.add("aa");    t1.add("a");    t1.add("ab");    t1.add("bb");
        int[] a=Counter(t1,t2);
        System.out.println(Arrays.toString(a));
    }
}

2.WordsCut类测试模块

通过手动输入的字符串测试分词函数能否正常完成分词。

点击查看代码

package test;

import com.Words.WordsCut;
import org.junit.Test;

import java.util.ArrayList;
import java.util.List;

import static org.junit.Assert.*;

public class WordsCutTest extends WordsCut {

    @Test
    public void testSplitWords() {
        String t="这是一个用于测试中文分词函数的语句。";
        System.out.println(splitWords(t));
        t="This is a test aim to test the SplitWords Function of English.";
        System.out.println(splitWords(t));
    }

    @Test
    public void testmerge() {
        String t1="这是第一个字符串";
        List<String>s1=splitWords(t1);
        String t2="This is the second string";
        List<String>s2=splitWords(t2);
        System.out.println(Merge(s1,s2));
    }
}

3.CosineSimilarity类测试模块

通过手动输入不同向量的方式测试函数能否正确计算出余弦值。

点击查看代码

package test;

import com.Words.CosineSimilarity;
import org.junit.Test;

import static org.junit.Assert.*;

public class CosineSimilarityTest extends CosineSimilarity {

    @Test
    public void testGetSimilarity() {
        //计算余弦值并返回
        int t1[]={1,0};
        int t2[]={1,0};
        System.out.println(getSimilarity(t1,t2));
        int t3[]={1,0};
        int t4[]={0,1};
        System.out.println(getSimilarity(t3,t4));
        int t5[]={1,1};
        int t6[]={2,3};
        System.out.println(getSimilarity(t5,t6));
        int t7[]={1,5,5};
        int t8[]={1,6,2};
        System.out.println(getSimilarity(t7,t8));
        int t9[]={7,2,8,9};
        int t0[]={1,0,5,6};
        System.out.println(getSimilarity(t9,t0));
    }
}

测试结果

文本查重算法java 论文查重javaweb源码_Test_04

测试覆盖率

文本查重算法java 论文查重javaweb源码_System_05

五、计算模块部分异常处理说明

1.输入不存在txt文件的地址

点击查看代码

try {
            //打开path路径的文件
            File read = new File(path);
            InputStreamReader reader = new InputStreamReader(new FileInputStream(read));
            BufferedReader r = new BufferedReader(reader);
            String line = r.readLine();
            //循环读取每一行的文字并加入到表中
            while (line != null) {
                article.add(line);
                line = r.readLine();
            }
        } catch (IOException e) {
            System.out.println("文件打开失败，请检查文件路径是否正确");
        }

运行结果：

文本查重算法java 论文查重javaweb源码_List_06