这只垂直的小爬虫,使用如下实现
- HttpClient点击进入官方文档
- Jsoup点击进入官方文档
- 多线程
- jdbc
实现的思路很简单,我从主函数开始简单叙述一下整个运行流程,第一步:收集需要爬取的url地址,容器我选择的是ConcurrentLinkedQueue非阻塞队列,它底层使用Unsafe实现,要的就是它线程安全的特性
主函数代码如下:
static String url = "http://www.qlu.edu.cn/38/list.htm";
// 添加url任务
public static ConcurrentLinkedQueue<String> add( ConcurrentLinkedQueue<String> queue){
for (int i=1;i<=19;i++){
String subString = StringUtils.substringBefore(url, ".htm");
queue.add(subString+i+".htm");
}
return queue;
}
public static void main(String[] args) throws IOException {
ConcurrentLinkedQueue<String> queue = new ConcurrentLinkedQueue();
queue.add(url);
ConcurrentLinkedQueue<String> newQueue = add(queue);
// 多线程下载解析
TPoolForDownLoadRootUrl.downLoadRootTaskPool(queue);
}
第二步:把url列表丢线程池:
我使用的线程池是newCachedThreadPool 根据提交的任务数,动态分配线程
线程池里面干了这么几件事,下载源html
/**
* 下载html的业务实现
* @Author: Changwu
* @Date: 2019/3/24 11:13
*/
public class downLoadHtml {
public static Logger logger = Logger.getLogger(downLoadHtml.class);
/**
* 根据url 下载网页源码
* @param url
* @return
*/
public static String downLoadHtmlByUrl(String url) throws IOException {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(url);
//设置请求头
httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");
CloseableHttpResponse response = httpClient.execute(httpGet);
logger.info("请求"+url+"状态码为"+response.getStatusLine().getStatusCode());
HttpEntity entity = response.getEntity();
String result = EntityUtils.toString(entity, "utf-8");
return result;
}
解析rootUrl,目的是拿到新闻主页的url,因为新闻的正文,在那里面,边解析遍封装RootBean
/**
* 解析源html.封装成一级Bean对象并返回
*
* @param sourceHtml
* @return
*/
public static List<RootBean> getRootBeanList(String sourceHtml) {
LinkedList<RootBean> rootBeanList = new LinkedList<>();
Document doc = Jsoup.parse(sourceHtml);
Elements elements = doc.select("#wp_news_w6 ul li");
String rootUrl = "http://www.qlu.edu.cn";
for (Element element : elements) {
RootBean rootBean = new RootBean();
// 获取url并拼装
String href = element.child(0).child(0).attr("href");
// 获取title
String title = element.text();
String[] split = title.split("\\s+");
//封装
System.out.println(title);
if (split.length >= 2) {
String s = element.outerHtml();
String regex = "class=\"news_meta\">.*";
Pattern compile = Pattern.compile(regex);
Matcher matcher = compile.matcher(s);
if (matcher.find()) {
String group = matcher.group(0);
String ss = StringUtils.substring(group, 18);
ss = StringUtils.substringBefore(ss, "</span> </li>");
rootBean.setPostTime(ss);
}
}
rootBean.setTitle(split[0]);
rootBean.setUrl(rootUrl + href);
rootBeanList.add(rootBean);
/*System.out.println();
System.out.println(split[0]);
System.out.println();*/
}
return rootBeanList;
}
类似,处理二级任务,这里使用到了正则表达式,原来没好好学,今天用的时候,完全蒙,还好慢慢悠悠整出来了,这块这要是观察源html,根据特性,使用jsoup提供的选择器选择,剪切,拼接出我们想要的内容,然后封装
为啥说是垂直的小爬虫,它只适合爬取我学校新闻,看下面的代码,没办法,只能拼凑剪切,最坑的是,100条新闻中,99条标题放在里面,总有那么一条放在了里面, 这个时候,就不得不去改刚才写好的规则
/**
* 解析封装二级任务
*
* @param htmlSouce
* @return
*/
public static List<PojoBean> getPojoBeanByHtmlSource(String htmlSouce, RootBean bean) {
LinkedList<PojoBean> list = new LinkedList<>();
PojoBean pojoBean = new PojoBean();
// 解析
Document doc = Jsoup.parse(htmlSouce);
// 编辑
Elements elements1 = doc.select(".arti_metas");
for (Element element : elements1) {
String text = element.text();
// 编辑
String regex = "(责任编辑:.*)";
Pattern compile = Pattern.compile(regex);
Matcher matcher = compile.matcher(text);
String editor = null;
if (matcher.find()) {
//System.out.println(matcher.group(group));
editor = matcher.group(1);
editor = StringUtils.substring(editor, 5);
//System.out.println(editor);
}
// 作者
regex = "(作者:.*出处)";
compile = Pattern.compile(regex);
matcher = compile.matcher(text);
String author = null;
if (matcher.find()) {
//System.out.println(matcher.group(group));
author = matcher.group(1);
author = StringUtils.substring(author, 3);
author = StringUtils.substringBefore(author, "出处");
//System.out.println(author);
}
// 出处
regex = "(出处:.*责任编辑)";
compile = Pattern.compile(regex);
matcher = compile.matcher(text);
String source = null;
if (matcher.find()) {
source = matcher.group(1);
source = StringUtils.substring(source, 3);
source = StringUtils.substringBefore(source, "责任编辑");
// System.out.println(source);
}
// 正文
Elements EBody = doc.select(".wp_articlecontent");
String body = EBody.first().text();
// System.out.println(body);
// 封装
pojoBean.setAuthor(author);
pojoBean.setBody(body);
pojoBean.setEditor(editor);
pojoBean.setSource(source);
pojoBean.setUrl(bean.getUrl());
pojoBean.setPostTime(bean.getPostTime());
pojoBean.setTitle(bean.getTitle());
list.add(pojoBean);
}
return list;
}
}
持久化,使用的是底册的JDBC
/**
* 持久化单个pojo
* @param pojo
*/
public static void insertOnePojo(PojoBean pojo) throws ClassNotFoundException, SQLException {
// 注册驱动
Class.forName("com.mysql.jdbc.Driver");
// 连接
Connection connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/spider", "root", "root");
String sql = "insert into qluspider (title,url,post_time,insert_time,author,source,editor,body) values (?,?,?,?,?,?,?,?)";
PreparedStatement ps = connection.prepareStatement(sql);
// 填充sql
ps.setString(1,pojo.getTitle());
ps.setString(2,pojo.getUrl());
// 把字符串转换成日期
ps.setTimestamp(3,new java.sql.Timestamp(SpiderUtil.stringToDate(pojo.getPostTime()).getTime()));
ps.setTimestamp(4,new java.sql.Timestamp(new Date().getTime()));
ps.setString(5,pojo.getAuthor());
ps.setString(6,pojo.getSource());
ps.setString(7,pojo.getEditor());
ps.setString(8,pojo.getBody());
ps.execute();
connection.close();
}
拿到的新的url称作是二级
public static Logger logger = Logger.getLogger(TPoolForDownLoadRootUrl.class);
/**
* 下载,解析 根url的线程池
*/
public static void downLoadRootTaskPool(ConcurrentLinkedQueue queue) {
ExecutorService executor = Executors.newCachedThreadPool();
//ExecutorService executor = Executors.newFixedThreadPool(5);
for ( int i=1;i<=queue.size();i++)
{
executor.execute(new Runnable() {
@Override
public void run() {
try {
logger.info("1号线程池开启,将要下载解析root任务");
// 获取根任务url
String url = (String) queue.poll();
logger.info("根URL==" + url);
if (StringUtils.isNotBlank(url)) {
// 下载当前url对应的rootHtml
String sourceHtml = downLoadHtml.downLoadHtmlByUrl(url);
// 解析rootHtml里面所有的RootBean对象
List<RootBean> rootBeanList = parseHtmlByJsoup.getRootBeanList(sourceHtml);
// 二级任务开始
for (RootBean rootBean : rootBeanList) {
logger.info(this + "进入二级任务");
String subUrl = rootBean.getUrl();
// 下载二级任务 html
String htmlSouce = downLoadHtml.downLoadHtmlByUrl(subUrl);
// 解析封装
List<PojoBean> pojoList = parseHtmlByJsoup.getPojoBeanByHtmlSource(htmlSouce, rootBean);
// 持久化
logger.info(this + "将持久化" + subUrl + "中的二级任务");
Persistence.insertPojoListToDB(pojoList);
logger.info("持久化完成.......");
}
}
} catch (IOException e) {
System.out.println();
e.printStackTrace();
}
}
});
}