curl爬虫 java java爬虫抓取网页数据

转载

墨染青丝 2023-07-21 20:13:22

文章标签 curl爬虫 java http java https HttpClient 文章分类 Java 后端开发

Java实现网络爬虫

HttpClient

爬虫介绍
爬虫的抓取环节
使用HttpClient发送get请求
使用HttpClient发送post请求
HttpClient连接池
HttpClient抓取https协议页面

HttpClient

爬虫介绍

一、什么是爬虫
爬虫是一段程序，抓取互联网上的数据，保存到本地。

抓取过程：

使用程序模拟浏览器
向服务器发送请求。
服务器响应html
把页面中的有用的数据解析出来。
解析页面中的链接地址。
把链接地址添加到url队列中。
爬虫从url队列中取url，返回2的操作。

爬虫的抓取环节

二、爬虫的抓取环节

抓取页面。
可以使用java api中提供的URLConnection类发送请求。
推荐使用工具包HttpClient。是apache旗下的一个开源项目。可以模拟浏览器。
对页面进行解析。
使用Jsoup工具包。
可以像使用jQuery一样解析html。

使用HttpClient发送get请求

步骤：
1）创建一个HttpClient对象，使用CloseableHttpClient，使用HttpClients工具类创建。
2）创建一个HttpGet对象，get对象封装请求的url
3）使用HttpClient执行请求
4）接收服务端响应的内容。
响应的内容包含响应头
包含响应的内容（html）
5）关闭连接

一、引入依赖

<dependencies>
        <!-- HttpClient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <!-- 日志 -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.7.25</version>
        </dependency>
    </dependencies>

二、使用HttpClient发送get请求

public class HttpClientTest {
    @Test
    public void testGet() throws Exception {
        //1.相当于打开浏览器
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2.设置访问路径
        HttpGet get = new HttpGet("http://yun.itheima.com/search?keys=Java");
        //3.发送请求，获取响应
        CloseableHttpResponse response = httpClient.execute(get);
        //4，获取相应的内容
        StatusLine statusLine = response.getStatusLine();
        System.out.println(statusLine);
        //5.获取响应头
        int statusCode = statusLine.getStatusCode();
        System.out.println(statusCode);
        //6.获取html
        HttpEntity entity = response.getEntity();
        String html = EntityUtils.toString(entity);
        System.out.println(html);
        //7.关闭连接
        response.close();
        httpClient.close();
    }
}

使用HttpClient发送post请求

步骤：
1）创建一个HttpClient对象
2）创建HttpPost对象，封装一个url
3）如果有参数就应该把参数封装到表单中。
4）使用HttpClient执行请求。
5）接收服务端响应html
6）关闭连接

@Test
    public void testPost() throws Exception {
        //1、创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2、封装post对象
        HttpPost post = new HttpPost("http://bbs.itheima.com/search.php");
        //3、封装参数
        List<NameValuePair> form = new ArrayList<>();
        form.add(new BasicNameValuePair("mod","forum"));
        form.add(new BasicNameValuePair("searchid","50"));
        form.add(new BasicNameValuePair("orderby","lastpost"));
        form.add(new BasicNameValuePair("ascdesc","desc"));
        form.add(new BasicNameValuePair("kw","java"));
        UrlEncodedFormEntity entity = new UrlEncodedFormEntity(form);
        post.setEntity(entity);
        //4、发送请求
        CloseableHttpResponse response = httpClient.execute(post);
        //5、接收服务端响应
        HttpEntity resultEntity = response.getEntity();
        String html = EntityUtils.toString(resultEntity);
        System.out.println(html);
        //6、关闭连接
        response.close();
        httpClient.close();
    }

HttpClient连接池

步骤：
1）创建一个连接池对象。在系统中应是单例的。
2）使用HttpClients工具类，设置使用的连接池对象。基于连接池创建HttpClient对象。
3）使用HttpClient发送请求。
4）接收服务端响应的数据。
5）关闭Response对象，HttpClient对象不需要关闭。

HttpClient抓取https协议页面

发现，可以抓取如京东首页，但是无法抓取商品数据

首先，添加HttpsUtils工具类

import org.apache.http.config.Registry;
import org.apache.http.config.RegistryBuilder;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.ssl.NoopHostnameVerifier;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.ssl.SSLContextBuilder;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;

public class HttpsUtils {
    private static final String HTTP = "http";
    private static final String HTTPS = "https";
    private static SSLConnectionSocketFactory sslsf = null;
    private static PoolingHttpClientConnectionManager cm = null;
    private static SSLContextBuilder builder = null;
    static {
        try {
            builder = new SSLContextBuilder();
            // 全部信任 不做身份鉴定
            builder.loadTrustMaterial(null, new TrustStrategy() {
                @Override
                public boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {
                    return true;
                }
            });
            sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello", "SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE);
            Registry<ConnectionSocketFactory> registry = RegistryBuilder.<ConnectionSocketFactory>create()
                    .register(HTTP, new PlainConnectionSocketFactory())
                    .register(HTTPS, sslsf)
                    .build();
            cm = new PoolingHttpClientConnectionManager(registry);
            cm.setMaxTotal(200);//max connection
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static CloseableHttpClient getHttpClient() throws Exception {
        CloseableHttpClient httpClient = HttpClients.custom()
                .setSSLSocketFactory(sslsf)
                .setConnectionManager(cm)
                .setConnectionManagerShared(true)
                .build();
        return httpClient;
    }

}

使用工具类就可以爬取https页面的数据了，主要还要添加user-agent的请求头。

@Test
    public void testHttps() throws Exception {
        //创建HttpClient对象
        CloseableHttpClient httpClient = HttpsUtils.getHttpClient();
        //创建get对象
        HttpGet httpGet = new HttpGet("https://search.jd.com/Search?keyword=%E7%94%B5%E8%84%91&enc=utf-8&pvid=b1deb5e2163141b8bebbb6c0505a4fca");
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36");
        //执行请求
        CloseableHttpResponse response = httpClient.execute(httpGet);
        //接收结果
        HttpEntity entity = response.getEntity();
        String html = EntityUtils.toString(entity,"utf-8");
        //打印结果
        System.out.println(html);
        //关闭连接
        response.close();
    }

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。