I crawled a picture website with java crawler

I crawled a picture website with java crawler

I want to set up a website recently, and I don't want to do technology blog, because it's enough like blog Park and CSDN. The usual problem is that these records are enough. What kind of website is fun?

It's not bad to see a picture website. There are a lot of pictures in it (of course, there are xxx pictures...). Haha, it's actually idle. At the same time, it introduces the relevant usage of java crawler.

1. First of all, there are two kinds of crawlers. One is the data returned by the dynamic interface request. This kind of json parsing or other parsing can get the data you need.

2. There are static html pages and so on. This requires parsing the data of the html dom node. In fact, the popular point is similar to jquery selector. html data is parsed into dom node data. java has a ready-made class library

 

Take a look at the website effect I generated according to the pictures I crawled (the code is going to be open-source in the near future, and I'll do whatever I want)

Original website: https://www.yeitu.com/meinv/ 

Generated sites: http://91bt.online/

Note that this blog site is modified https://github.com/WinterChenS/my-site

 

 

  

The required maven depends on. Go to maven to search the version number

    <!--Web crawling-->
        <!--  http   -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
         
        </dependency>

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>

        </dependency>

 

Then I will introduce the usage,

① For example: crawling a picture static page: https://www.yeitu.com/meinv/xinggan/20180919_.html

We normally use jquery to get the title of this static page (you need to look at the structure of the dom node and use the selector to get it)

 

 

Then, how can we use code to crawl this?

Using httpclient tool class, write a get request method, and finally return a string html web page content

       
       String url="https://www.yeitu.com/meinv/xinggan/20180919_14722.html";
       HttpGet get = new HttpGet(url);// Setting parameters Builder customReqConf = RequestConfig.custom(); customReqConf.setConnectTimeout(connTimeout); customReqConf.setSocketTimeout(socketTimeout); customReqConf.setConnectionRequestTimeout(requestTimeout); get.setConfig(customReqConf.build()); get.addHeader("Connection", "Close"); HttpResponse res; // implement Http request. if (url.startsWith("https")) { // implement Https request. client = createSSLInsecureClient(); res = client.execute(get); } else { // implement Http request. client = HttpClientUtil.client; res = client.execute(get); } return EntityUtils.toString(res.getEntity(), charset);

③ , and then transform the html content

 Document documentInner = Jsoup.parse(htmlInner);

//This is the jquery selector in the picture
//$(".img_box").children("a").children("img").attr("alt");
//The following is the corresponding jsoup framework
String firstAlt = documentInner.select(".img_box").select("a").select("img").attr("alt");

 

To summarize, it is to change the writing method of jquery corresponding selector into that of jsoup framework. In fact, I didn't change anything. It's clear if I operate it by myself

Tags: Java JQuery Maven JSON

Posted on Fri, 27 Mar 2020 09:51:51 -0700 by aubeasty