python crawler framework scrapy bean flaps

Scrapy

The official introduction is

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

Meaning is

An open source and collaboration framework for extracting the required data from a website in a fast, simple and scalable manner.

 

Environmental preparation

The environment and tools used in this project are as follows.

  • python3
  • scrapy
  • mongodb

Installation of Python 3 scrapy is no longer described

mongodb is the official download address of nosql non-relational database used to store data https://www.mongodb.com/download-center/community?jmp=docs

mongodb graphical management tool recommends nosqlmanager

Project creation

That's right. Let's still pick out the soft persimmon and climb up the simplest bean-paste movie top250.

(vii) This website is a site that almost everyone who learns to crawl will crawl. This website is very representative, not to mention much. The project begins.

Creating scrapy projects needs to be done on the command line

Switch to the working directory and enter the instruction scrapy start project Douban

That is, create successfully, then use pycharm to open the project, first look at the directory structure

We found that there was only one file in the spiders of the project. How could there be only one _init_ py where the crawler was placed?

Don't worry. We also need to enter a command to create the basic crawler. Open cmd and switch to the spiders directory under the directory folder.

Enter scrapy genspider douban_spider https://movie.douban.com/top250

Create the crawler successfully as shown below.

Then we open the project analysis directory structure

Project Folder

spiders crawler folder

    __init__.py

Reptilian file

  __init__.py

ietms.py) Where the data structure of items is defined (that is, we crawl information such as the properties of the content)

middlewares.py Middleware

pipelines.py) Defines how items are handled (data cleaning, etc.) (pipelines options need to be turned on in settings)

settings.py) Project settings file that defines global settings (such as header agent, task concurrency, download latency, etc.)

scrapy.cfg * project configuration file (including some default configuration information)

So far, our project has been successfully created.

 

Determine content

After creating the project, the next step is to determine what we want to crawl before we can start writing our items.py file.

First open the target page for analysis.

What do we need in the web page?

  • Film Rank Number
  • Movie title
  • Film Actors and Year Classification
  • Film Star Score
  • Number of Commentaries
  • Film Introduction

Now you can write items.py files based on content

The items.py file code is written as follows

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DoubanItem(scrapy.Item):
    #Example
    # define the fields for your item here like:
    # name = scrapy.Field()

    serial_number = scrapy.Field()#ranking
    movie_name = scrapy.Field()#Movie title
    introduce = scrapy.Field()#Basic Information of Film Introduction
    star = scrapy.Field()#Film Star Score
    evaluate = scrapy.Field()#Number of film reviewers
    describe = scrapy.Field()#A Brief Introduction to Film Contents

 

 

Content extraction spider file writing

After determining the content, the spider crawler file is written.

The spider.py file for the test phase is as follows:

# -*- coding: utf-8 -*-
import scrapy


class DoubanSpiderSpider(scrapy.Spider):
    #Reptile name
    name = 'douban_spider'
    #Allowed Domain Name Crawling url All belong to this domain name.
    allowed_domains = ['movie.douban.com']
    #Start url
    start_urls = ['https://movie.douban.com/top250/']

    def parse(self, response):
        print(response.text)#Print response content
        pass

Then we need to run our reptiles to see if we can get any information now.

Open the command window and cd to enter the command scrapy crawl douban_spider in the project directory

Doban_spider is the name of the reptile, which is listed in the spider.py file.

Run as follows

We found that there were crawler information and response information in it, but we can see that there is no movie information we want. What should we do now?

Students who have learnt a little about crawlers know that crawlers need to modify USER_AGENT, which is also the simplest anti-crawler mechanism, so we also need to modify the user agent of our crawlers.

Where can I find a head agent? Simpler, you can go to Baidu to search for one directly, or we can use the browser debugger to copy our user agent.

For example, the chrome browser clicks on F12 to replicate a resource and the user agent is no longer redundant.

Open the settings.py file and find the USER_AGENT modification as follows:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36'

Then we open the command window again and cd to enter the command scrapy crawl douban_spider in the project directory

We found that we already had the movie information we wanted.

At the same time, it's really inconvenient to run spider on the command line every time. We can add a main.py startup file to the project as follows

main.py writes code:

from scrapy import cmdline
cmdline.execute('scrapy crawl douban_spider'.split())

Running, you find that you get the same effect as running the command line.

The next step is to process the data, extract the information we want, and continue to write spider.py file.

For data extraction, we use xpath location to observe the elements of the target website first. We can see that there are 25 movie information on each page of the top 250 movie and each movie information is a list li.

There are many ways to write xpath. We can review elements and write XPath locations. Or we can get the XPath path path of an element directly with chrome.

For example, using an xpath browser plug-in to find the elements we need, we first find the location of each movie.

As shown in the figure, we can write //ol[@class='grid_view']/li/div[@class='item'] to locate the current movie. In fact, we can simply write //ol/li directly, but we'd better be more precise with xpath grammar as follows

Expression describe
nodename Select all child nodes of this node.
/ Select from the root node.
// Select the nodes in the document from the current node that matches the selection, regardless of their location.
. Select the current node.
.. Select the parent of the current node.
@ Select attributes.

In the same way, we can find the xpath of movie ranking, name, comment and so on. Next, we can quote the DoubanItem class written in our items.py file and complete the assignment of object attributes.

spider.py file code:

# -*- coding: utf-8 -*-
import scrapy
from douban.items import DoubanItem


class DoubanSpiderSpider(scrapy.Spider):
    #Reptile name
    name = 'douban_spider'
    #Allowed Domain Name Crawling url All belong to this domain name.
    allowed_domains = ['movie.douban.com']
    #Start url
    start_urls = ['https://movie.douban.com/top250/']

    #Default parsing method
    def parse(self, response):
        # Pay attention to python Use in statements xpath If you pay attention to the problem of single and double quotation marks with the original sentence
        movie_list=response.xpath("//ol[@class='grid_view']/li/div[@class='item']")
        for movie_item in movie_list:
            douban_item=DoubanItem()
            #xpath At the end of the statement text()Is to acquire the present xpath Content
            # scrapy get() getall()Method acquisition xpath The value of the path is different in two ways, please Baidu
            douban_item['serial_number'] = movie_item.xpath(".//em/text()").get()
            douban_item['movie_name'] = movie_item.xpath(".//span[@class='title']/text()").get()
            #The introduction is very informal and has many lines. First, use getall()To get it, and then we're going to deal with it.
            content = movie_item.xpath(".//div[@class='bd']/p[1]/text()").getall()
            #Handle
            contient_introduce=''
            for conitem in content:
                content_s=''.join(conitem.split())
                contient_introduce=contient_introduce+content_s+'  '
            #assignment
            douban_item['introduce'] = contient_introduce
            douban_item['star'] = movie_item.xpath(".//span[@class='rating_num']/text()").get()
            douban_item['evaluate'] = movie_item.xpath(".//div[@class='star']/span[4]/text()").get()
            douban_item['describe'] = movie_item.xpath(".//div[@class='bd']/p[2]/span/text()").get()
            #We need to get what we get. yield reach douban_item China, or our pipeline pipelines.py Unable to receive data
            yield douban_item

        #We need to turn the page automatically to the next page to parse the data.
        next_linkend=response.xpath("//span[@class='next']/a/@href").get()
        #judge next_linkend Does it exist?
        if next_linkend:
            next_link = 'https://movie.douban.com/top250/'+next_linkend
            #Same need yield Submit to scheduler and add a callback function(Data Extraction Function Just Written)
            yield scrapy.Request(next_link,callback=self.parse)

data storage

We can use commands to save data directly to json or csv files as follows

Or use the command line cd to the project directory

Enter the command scrapy crawl douban_spider-o test.json to get a JSON file

Enter the command scrapy crawl douban_spider-o test.csv to get a CSV file

This csv file can be opened and browsed directly with excel, but we will find that there is chaotic code. We can open the file with notepad + + and change the encoding mode, then save it and open it with excel.

Store in database

Next we need to write pipelines.py and store the data in mongodb

Note that we need to turn off the ITEM_PIPELINES comment in set.py to run pipelines.py properly.

pipelines.py code:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
#Connecting to a local database remotely is also possible
myclient = pymongo.MongoClient("mongodb://localhost:27017/")
#Database name
mydb = myclient["douban"]
#Data table name
mysheet = mydb["movie"]

class DoubanPipeline(object):
    #In this case item Just now. yield Coming back
    def process_item(self, item, spider):
        data=dict(item)
        #insert data
        mysheet.insert(data)
        return item

Now that the main.py data is stored in the database, we can open the database to view the data.

So far, our reptile project can be said to have been completed.

 

Reptile camouflage

  • ip proxy Middleware
  • user-agent Middleware

The ip proxy needs to buy the server and then use it.

Let's try user-agent Middleware

Write middlewares.py and finally add our own class (import random at the top of the file):

class my_useragent(object):
    def process_request(self,request,spider):
        USER_AGENT_LIST = [
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
            "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
            "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
            "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
            "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
            "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
            "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
            "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
            "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
        ]
        agent = random.choice(USER_AGENT_LIST)
        request.headers['User_Agent'] = agent

Then go to settings.py and open the middleware and modify it to the class we just created as follows

Then run main.py and it's OK.

Tags: Python Windows MongoDB Database JSON

Posted on Wed, 11 Sep 2019 04:46:29 -0700 by iarp