Spring Boot 2.x: 爬取ip代理池入库
概述
因为爬虫的进阶阶段,最基本的就是要用到ip代理池,因为单个代理请求频繁,会被ban掉,所以要备一个代理池,用来请求使用
技术栈
- HttpClient
- Spring Boot 2.3.1
- JDK 1.8
快速创建Spring Boot项目
访问 https://start.spring.io/ 生成一个初始项目
我们需要去请求接口,所以需要一个Web
依赖
点击Generate
,会下载一个zip的项目压缩包
导入Spring Boot项目
解压之后记得复制下demo文件夹放的路径
先用IDE编辑 pom.xml
文件,在下图红框上面加入下述代码
可以切换下载依赖的源为国内阿里源
<repositories>
<!--阿里云主仓库,代理了maven central和jcenter仓库-->
<repository>
<id>aliyun</id>
<name>aliyun</name>
<url>https://maven.aliyun.com/repository/public</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<!--阿里云代理Spring 官方仓库-->
<repository>
<id>spring-milestones</id>
<name>Spring Milestones</name>
<url>https://maven.aliyun.com/repository/spring</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<pluginRepositories>
<!--阿里云代理Spring 插件仓库-->
<pluginRepository>
<id>spring-plugin</id>
<name>spring-plugin</name>
<url>https://maven.aliyun.com/repository/spring-plugin</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</pluginRepository>
</pluginRepositories>
下面是导入流程:
IDEA里点击File -> Open -> 粘贴刚刚的项目文件夹路径 -> 找到pom.xml
双击
-> Open as Peoject -> 等待Maven
加载完毕,看不明白看下图
Open as Project,之后等待Maven
加载完毕即可
pom.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.3.1.RELEASE</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>com.github.gleans</groupId>
<artifactId>SpringBoot-ProxyPool</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>SpringBoot-ProxyPool</name>
<description>Demo project for Spring Boot</description>
<properties>
<java.version>1.8</java.version>
<httpclient.version>4.5.12</httpclient.version>
<jsonp.version>1.13.1</jsonp.version>
<knife4j.version>2.0.3</knife4j.version>
<lombok.version>1.18.12</lombok.version>
<mysql.version>8.0.19</mysql.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>${httpclient.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>${lombok.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
<exclusions>
<exclusion>
<groupId>org.junit.vintage</groupId>
<artifactId>junit-vintage-engine</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.github.xiaoymin</groupId>
<artifactId>knife4j-spring-boot-starter</artifactId>
<version>${knife4j.version}</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>${jsonp.version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-thymeleaf</artifactId>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
<repositories>
<!--阿里云主仓库,代理了maven central和jcenter仓库-->
<repository>
<id>aliyun</id>
<name>aliyun</name>
<url>https://maven.aliyun.com/repository/public</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<!--阿里云代理Spring 官方仓库-->
<repository>
<id>spring-milestones</id>
<name>Spring Milestones</name>
<url>https://maven.aliyun.com/repository/spring</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
<pluginRepositories>
<!--阿里云代理Spring 插件仓库-->
<pluginRepository>
<id>spring-plugin</id>
<name>spring-plugin</name>
<url>https://maven.aliyun.com/repository/spring-plugin</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</pluginRepository>
</pluginRepositories>
</project>
新建ip实体对象
package com.github.gleans.ekko.model;
import io.swagger.annotations.ApiModelProperty;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.experimental.Accessors;
import javax.persistence.Entity;
import javax.persistence.Id;
@Data
@Entity(name = "ip_data")
@NoArgsConstructor
@Accessors(chain = true)
public class IPData {
@Id
@ApiModelProperty(value = "编号")
private Long ipNo;
@ApiModelProperty(value = "国家")
private String country;
@ApiModelProperty(value = "IP地址")
private String ipAddress;
@ApiModelProperty(value = "端口")
private Integer port;
@ApiModelProperty(value = "服务器地址")
private String serverAddress;
@ApiModelProperty(value = "是否匿名")
private String anonymous;
@ApiModelProperty(value = "类型")
private String type;
@ApiModelProperty(value = "速度")
private String speed;
@ApiModelProperty(value = "连接时间")
private String connTime;
@ApiModelProperty(value = "存活时间")
private String survivalTime;
@ApiModelProperty(value = "验证时间")
private String postTime;
}
主要的业务类
IPServiceImpl.java
package com.github.gleans.ekko.service.impl;
import com.github.gleans.ekko.model.IPData;
import com.github.gleans.ekko.service.IPService;
import com.github.gleans.ekko.utils.HttpCustom;
import lombok.extern.slf4j.Slf4j;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.stream.Collectors;
@Slf4j
@Service
public class IPServiceImpl implements IPService {
@Override
public List<IPData> getIpList() {
String html = HttpCustom.getIpStore("https://www.xicidaili.com/nn/1", null, null);
//将html解析成DOM结构
Document document = Jsoup.parse(html);
//提取所需要的数据
Elements trs = document.select("table[id=ip_list]").select("tbody").select("tr");
if (null == trs || trs.size() == 0) {
return new ArrayList<>();
}
return trs.stream()
.map(tr -> {
Elements trd = tr.select("td");
if (trd != null && trd.size() > 0) {
String country = tr.select("td").get(0).text();
String ipAddress = tr.select("td").get(1).text();
Integer port = Integer.valueOf(tr.select("td").get(2).text());
String serverAddress = tr.select("td").get(3).text();
String anonymous = tr.select("td").get(4).text();
String ipType = tr.select("td").get(5).text();
String speed = tr.select("td").get(6).select("div[class=bar]").attr("title");
return new IPData().setIpAddress(ipAddress)
.setPort(port).setType(ipType)
.setCountry(country).setSpeed(speed)
.setAnonymous(anonymous).setServerAddress(serverAddress);
} else {
return null;
}
}).filter(Objects::nonNull).collect(Collectors.toList());
}
}
封装请求类
package com.github.gleans.ekko.utils;
import org.apache.http.HttpHost;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
public class HttpCustom {
private final static int CONNECT_TIMEOUT = 3000;
private final static int SOCKET_TIMEOUT = 3000;
/**
* 获取网页信息
*
* @param url
* @param ip
* @param port
*/
public static String getIpStore(String url, String ip, Integer port) {
String resBody = "";
CloseableHttpClient httpClient = HttpClients.createDefault();
RequestConfig.Builder configBuilder = RequestConfig
.custom()
.setConnectTimeout(CONNECT_TIMEOUT)
.setSocketTimeout(SOCKET_TIMEOUT);
if (ip != null && port != null) {
HttpHost proxy = new HttpHost(ip, port);
configBuilder.setProxy(proxy);
}
RequestConfig config = configBuilder.build();
HttpGet httpGet = new HttpGet(url);
httpGet.setConfig(config);
httpGet.setHeader("Pragma", "no-cache");
httpGet.setHeader("Connection", "keep-alive");
httpGet.setHeader("Host", "www.xicidaili.com");
httpGet.setHeader("Cache-Control", "no-cache");
httpGet.setHeader("Upgrade-Insecure-Requests", "1");
httpGet.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
httpGet.setHeader("Accept-Encoding", "gzip, deflate, sdch");
httpGet.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36");
try {
//客户端执行httpGet方法,返回响应
CloseableHttpResponse httpResponse = httpClient.execute(httpGet);
//得到服务响应状态码
if (httpResponse.getStatusLine().getStatusCode() == 200) {
resBody = EntityUtils.toString(httpResponse.getEntity(), StandardCharsets.UTF_8);
}
httpResponse.close();
httpClient.close();
} catch (IOException e) {
resBody = null;
}
return resBody;
}
}
applicat.yml配置文件
spring:
datasource:
driver-class-name: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://127.0.0.1:3306/big-data?characterEncoding=utf-8
username: root
password: root
jpa:
open-in-view: true
database-platform: org.hibernate.dialect.H2Dialect
# spring.jpa.show-sql=true 配置在日志中打印出执行的 SQL 语句信息。
show-sql: true
# 配置指明在程序启动的时候要删除并且创建实体类对应的表。
# create 这个参数很危险,因为他会把对应的表删除掉然后重建。所以千万不要在生成环境中使用。只有在测试环境中,一开始初始化数据库结构的时候才能使用一次。
# ddl-auto:create----每次运行该程序,没有表格会新建表格,表内有数据会清空
# ddl-auto:create-drop----每次程序结束的时候会清空表
# ddl-auto:update----每次运行程序,没有表格会新建表格,表内有数据不会清空,只会更新(推荐)
# ddl-auto:validate----运行程序会校验数据与数据库的字段类型是否相同,不同会报错
hibernate.ddl-auto: update
前端显示
技术栈
- vue
- element-ui
- html5
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<!-- 引入样式 -->
<link rel="stylesheet" href="https://unpkg.com/element-ui/lib/theme-chalk/index.css">
</head>
<body>
<div id="app">
<h1>{{ message }}</h1>
是否重新抓取
<el-switch
v-model="isRefresh">
</el-switch>
<br>
<!-- <template>-->
<el-table
:data="tableData"
border
style="width: 100%">
<el-table-column
fixed
prop="ipAddress"
label="IP地址"
width="150">
</el-table-column>
<el-table-column
prop="port"
label="端口"
width="120">
</el-table-column>
<el-table-column
prop="serverAddress"
label="服务器地址"
width="120">
</el-table-column>
<el-table-column
prop="speed"
label="速度"
width="120">
</el-table-column>
<el-table-column
prop="type"
label="请求方式"
width="300">
</el-table-column>
<el-table-column
prop="anonymous"
label="匿名类型"
width="120">
</el-table-column>
<el-table-column
label="操作">
<el-button @click="handleClick(scope.row)" type="text" size="small">查看</el-button>
<el-button type="text" size="small">编辑</el-button>
</el-table-column>
</el-table>
<!-- </template>-->
</div>
<!-- 开发环境版本,包含了有帮助的命令行警告 -->
<script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js"></script>
<!-- 引入组件库 -->
<script src="https://unpkg.com/axios/dist/axios.min.js"></script>
<script src="https://unpkg.com/element-ui/lib/index.js"></script>
<script type="text/javascript">
var app = new Vue({
el: '#app',
methods: {
getTableData() {
let _this = this;
// 为给定 ID 的 user 创建请求
axios.get('ip/list')
.then(function (response) {
console.log(response);
_this.tableData = response.data.data
})
.catch(function (error) {
console.log(error);
});
}
},
created() {
this.getTableData()
},
data: {
message: 'ip池子代理',
tableData: [],
isRefresh: true
}
});
</script>
</body>
</html>
效果图
启动之后。
TODO
- 数据入库,防止一直调取人家接口(待实现)
- 缓存,防止一直查询数据库(待实现)
- 数据库去重,去除无效数据(待实现)
- 页面可修改,查询列表(待实现)