Python Web Crawler Getting Started (crawl the last 7 days of weather and the highest/lowest temperatures)

_ 

Preface Text and pictures are from the Internet. They are for study and communication only. They do not have any commercial use. Copyright is owned by the original author. If you have any questions, please contact us in time for processing.Author:
PS: Partners who need the latest Python learning materials can click on the link below to get their own
 http://note.youdao.com/noteshare?id=a3a533247e4c084a72c9ae88c271e3d1
I learned python in the last two days and wrote my own example of a web crawler.(

python version: 3.5
IDE : pycharm 5.0.4 
Packages to be used can be downloaded with pycharm:
File->Default Settings->Default Project->Project Interpreter 
Select the python version and point the plus sign on the right to install the package you want

The website I selected is the Suzhou weather in China Weather Network. I am going to grab the last 7 days'weather and the highest/lowest temperatures.
http://www.weather.com.cn/weather/101190401.shtml 

At the beginning of the program we added:

# coding : UTF-8
  • 1
  • 2

This tells the interpreter that the py program is utf-8 encoded and that the source program can be in Chinese.

Package to reference:

import requests
import csv
import random
import time
import socket
import http.client
# import urllib.request
from bs4 import BeautifulSoup
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

requests: html source code to grab web pages
csv: Write data to csv file
Random: take a random number
Time: time related operations
Sockets and http.client are used here for exception handling only.
BeautifulSoup: Used instead of regularly retrieving the contents of the corresponding label in the source code
urllib.request: Another way to grab html source code from a web page, but no requests are convenient (which I started with)

Get the html code from the web page:

def get_content(url , data = None):
    header={
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
    }
    timeout = random.choice(range(80, 180))
    while True:
        try:
            rep = requests.get(url,headers = header,timeout = timeout)
            rep.encoding = 'utf-8'
            # req = urllib.request.Request(url, data, header)
            # response = urllib.request.urlopen(req, timeout=timeout)
            # html1 = response.read().decode('UTF-8', errors='ignore')
            # response.close()
            break
        # except urllib.request.HTTPError as e:
        #         print( '1:', e)
        #         time.sleep(random.choice(range(5, 10)))
        #
        # except urllib.request.URLError as e:
        #     print( '2:', e)
        #     time.sleep(random.choice(range(5, 10)))
        except socket.timeout as e:
            print( '3:', e)
            time.sleep(random.choice(range(8,15)))

        except socket.error as e:
            print( '4:', e)
            time.sleep(random.choice(range(20, 60)))

        except http.client.BadStatusLine as e:
            print( '5:', e)
            time.sleep(random.choice(range(30, 80)))

        except http.client.IncompleteRead as e:
            print( '6:', e)
            time.sleep(random.choice(range(5, 15)))

    return rep.text
    # return html_text
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43

header is a parameter of requests.get to simulate browser access.
header can be obtained using the developer tools for chrome as follows:
Open chrome, press F12, select network

Visit the website again, find the first network request, and view its header

Timout is a set timeout, and the random number is chosen to prevent websites from identifying it as a crawler.(
Then use the requests.get method to get the source code of the web page,
rep.encoding ='utf-8'is to change the encoding format of the source code to UTF-8 (the Chinese part of the source code is not garbled)
Here are some exception handling
Return rep.text

Get the fields we need in html:
Here we mainly use BeautifulSoup
BeautifulSoup Document http://www.crummy.com/software/BeautifulSoup/bs4/doc/

First, use the developer tools to look at the source code of the page and find the appropriate location for the required fields
Find that the fields we need are all in the ul of the div with id = 7d.Date in h1 of each li, weather conditions in the first p label of each li, maximum and minimum temperatures in the space and i labels of each li.(
Thanks to Joey_Ko for pointing out the mistake that there will be no maximum temperature in the evening, so make a judgement.(
The code is as follows:

def get_data(html_text):
    final = []
    bs = BeautifulSoup(html_text, "html.parser")  # Create BeautifulSoup object
    body = bs.body # Get body part
    data = body.find('div', {'id': '7d'})  # Found div with id of 7d
    ul = data.find('ul')  # Get ul part
    li = ul.find_all('li')  # Get all the li

    for day in li: # Traverse through the contents of each li tag
        temp = []
        date = day.find('h1').string  # Find Date
        temp.append(date)  # Add to temp
        inf = day.find_all('p')  # Find all p tags in li
        temp.append(inf[0].string,)  # Add the content of the first p tag (weather conditions) to temp
        if inf[1].find('span') is None:
            temperature_highest = None # The weather forecast may not have the highest temperature of the day (that's the case in the evening), so you need to add a judgment to output the lowest temperature
        else:
            temperature_highest = inf[1].find('span').string  # Find the highest temperature
            temperature_highest = temperature_highest.replace('℃', '')  # In the evening, the website will change and there will be a temperature behind the maximum temperature.
        temperature_lowest = inf[1].find('i').string  # Find the lowest temperature
        temperature_lowest = temperature_lowest.replace('℃', '')  # The lowest temperature is followed by a temperature. Remove this symbol
        temp.append(temperature_highest)   # Add maximum temperature to temp
        temp.append(temperature_lowest)   #Add minimum temperature to temp
        final.append(temp)   #Add temp to final

    return final
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

Write to file csv:
After the data is fetched, we will write them to a file with the following code:

def write_data(data, name):
    file_name = name
    with open(file_name, 'a', errors='ignore', newline='') as f:
            f_csv = csv.writer(f)
            f_csv.writerows(data)
  • 1
  • 2
  • 3
  • 4
  • 5

Main function:

if __name__ == '__main__':
    url ='http://www.weather.com.cn/weather/101190401.shtml'
    html = get_content(url)
    result = get_data(html)
    write_data(result, 'weather.csv')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Then run it:
The generated weather.csv file is as follows:

To summarize, there are roughly 3 steps to grab content from a web page:
1. Simulate browser access to get html source code
2. Get the contents of the specified label through regular matching
3. Write the obtained content to a file

New python crawler, there may be some mistakes in understanding, please criticize and correct, thank you!

Tags: Python encoding socket Pycharm

Posted on Thu, 28 Nov 2019 17:47:10 -0800 by dessolator