Hook crawler (single and multi thread)

Hook crawler

Crawling method

emmmm here is to find the law of page number from the homepage

This rule is easy to find, but the page number has changed

The following is the main page of the hook
I won't say much about this review element magnitude xpath tag

Matters needing attention

1. There are anti crawling and cookies changes in the network
Refer to https://www.cnblogs.com/kuba8/p/10808023.html to solve the problem of cookie change

2. There are spaces and line breaks in the data. You need to use strip or replace function to clean the data
The following two methods can effectively clean

set = list(set(lists))
s=[x.strip() for x in list1 if x.strip()!='']

Key example

Cleaning for benefits

#Corporate welfare
    welfare = x.xpath('//*[@id="s_position_list"]/ul/li/div[2]/div[2]/text()')
    welfare=[exp.replace('"', '').replace('"', '') for exp in welfare if exp.strip()!='']

This is for pandas

data = {'names':names, 'direction':dire, 'money':money, 'experience':experience, 'condition':condition,
            'company':company, 'welfare':welfare}
    basic_data = pd.DataFrame.from_dict(data = data)

Single thread example

Using xpath

At the same time, in order to clean up the data better, we also use the data frame module of pandas

import requests
import re
from requests.exceptions import  RequestException
from lxml import etree
from queue import Queue
import threading
import pandas as pd 
import time

def get_one_page(url):
            headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
            #Anti climbing measures against the hook
            s = requests.Session() # Create a session object
            s.get(url, headers=headers, timeout=3)  # Send a get request with the session object to request the first page to obtain cookies
            cookie = s.cookies  # cookies obtained for this time
            response = s.post(url, headers=headers, cookies=cookie, timeout=3)  # Get this text
            #response = requests.get(url, headers = headers)
            #response.encoding = response.apparent_encoding
            if response.status_code==200:
                return response.text
                #return response.content.decode("utf8", "ignore")
            return None
        except RequestException:
            return None

def parse_one_page(html):
    x = etree.HTML(html)
    #Job title
    names = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[1]/a/h3/text()')
    dire = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[1]/a/span/em/text()')
    money = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[2]/div/span/text()')
    experience = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[1]/div[2]/div/text()')
    #Crawler data cleaning 
    experience=[exp.strip() for exp in experience if exp.strip()!='']
    #Company conditions
    condition = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[2]/div[2]/text()')
    condition=[exp.strip() for exp in condition if exp.strip()!='']
    #Corporate name
    company = x.xpath('//*[@id="s_position_list"]/ul/li/div[1]/div[2]/div[1]/a/text()')
    #Corporate welfare
    welfare = x.xpath('//*[@id="s_position_list"]/ul/li/div[2]/div[2]/text()')
    welfare=[exp.replace('"', '').replace('"', '') for exp in welfare if exp.strip()!='']
    #Using dictionaries to store multiple contents can avoid using for statements to separate tuples and then read them separately, which is another feasible method
    data = {'names':names, 'direction':dire, 'money':money, 'experience':experience, 'condition':condition,
            'company':company, 'welfare':welfare}
    basic_data = pd.DataFrame.from_dict(data = data)
    basic_data.to_csv(r'xxx.csv', index=False, mode='a', header=False)

def main():
    html = get_one_page(url)
    print('Printing first',(j+1),'page')

#This is mainly for the foreshadowing of multithreading
i = 'dianlusheji'

for j in range(5):
    url = 'https://www.lagou.com/zhaopin/{}/{}/'.format(i, j+1)

    if __name__=='__main__':

Multithreading example

Similar to single thread, using the knowledge of queue, random extraction of content can speed up the speed. The last 3000 pieces of data took 80 seconds, and three threads were opened, which could have been faster

Only necessary code blocks are added below
Others are basically the same as single threads

#Queue title to crawl
crawl_list = ['danpianji', 'dianlusheji', 'zidonghua', 'qianrushi', 'yingjian', 'Python'] 

Definition of each parameter in the class

def run(self):
        # # Task start event
        # start_time = time.time()
        while True:
            if self.page_queue.empty():
                # # Task end time
                # end_time = time.time()
                # # It takes time
                # print(end_time - start_time)
                print(self.name, 'About to fetch task from queue')
                #Here is the use of the characteristics of the queue. After extraction, it's OK. After get extraction, the corresponding page number disappears. Otherwise, it will be extracted repeatedly
                page = self.page_queue.get()
                print(self.name, 'The tasks to be taken out are:', page)
                for j in range(30):
                    url = 'https://www.lagou.com/zhaopin/{}/{}/'.format(page, j+1)
                    main(url, j)
                print(self.name, 'Complete the task:', page)

Crawled data

This is a single thread crawling to the picture, 5 pages of 75
This is the data crawled by multiple threads
The main problem here is the change of cookies. After solving this problem, you can climb directly. Of course, you can do this with selenium.

In a word, it's so much. This program can be used for personal testing. It's also an explanation for half a month of crawler learning. Thank you!

Published 1 original article, praised 0 and visited 3
Private letter follow

Tags: Session network Windows encoding

Posted on Tue, 17 Mar 2020 01:08:14 -0700 by d-woo