web Crawler Explanation - Scrapy Framework Crawler - Scrapy Use

xpath expression
// x means to look down for n-level specified labels, such as: //div means to look up all div labels
/ x means to look down for a specified label
/@ x denotes the value of the specified attribute, which can be linked as: @id@src
[@Property Name= "Property Value"] indicates that a tag whose specified property is equal to the specified value can be linked, such as a tag whose class name is equal to the specified name.
/ text() Gets the label text class content
[x] Retrieves a specified element in a collection by indexing

1. Regular matching of the filtered results of xpath expression and the final content by regularization
Finally. re('regular')

xpath('//div[@class="showlist"]/li//img')[0].re('alt="(\w+)')

2. Application of regularity to filter in selector rules
[re: Regular Rules]

xpath('//div[re:test(@class, "showlist")]').extract()

Scrapy is used to get the titles, links, and comments of an e-commerce website

Analysis source code

Step 1: Write items.py container file

We already know what we want to get, the title of the product, the links to the product, and the number of comments.

Create containers in items.py to receive data captured by Crawlers

The scrapy.Item class must be inherited to set up the information container class acquired by the crawler

scrapy.Field() method, which defines variables and uses scrapy.Field() method to receive information about the crawler's specified fields

What I don't know in the process of learning can be added to me?
python Learning Exchange Button qun,784758214
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

#Item. py, the file is specially used to receive data information from the crawler, which is equivalent to container file.

class AdcItem(scrapy.Item):    #Setting the information container class that the crawler gets
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()      #Receiving title information from Crawlers
    link = scrapy.Field()       #Receiving the connection information obtained by the crawler
    comment = scrapy.Field()    #Number of Comments on Goods Received by Crawlers

Step 2: Writing pach.py crawler file

To define reptiles, you must inherit scrapy.Spider

Name sets the crawler name
allowed_domains setting crawl domain name
start_urls Settings Crawl URLs
The parse(response) crawler callback function receives the response, which is the acquired html data object
xpath() filter, parameter is xpath expression
extract() gets the data in the html data object
yield item receives the container object of the data and returns it to pipelies.py

# -*- coding: utf-8 -*-
import scrapy
from adc.items import AdcItem  #Import the AdcItem class, container class in items.py

class PachSpider(scrapy.Spider):                 #To define reptiles, you must inherit scrapy.Spider
    name = 'pach'                                #Set the crawler name
    allowed_domains = ['search.dangdang.com']    #Crawling Domain Names
    start_urls = ['http://category.dangdang.com/pg1-cid4008149.html'] Crawl the Web site

    def parse(self, response):                   #parse callback function
        item = AdcItem()                         #Instantiate container objects
        item['title'] = response.xpath('//P [@class= "name"]/a/text ()'. extract ()# expression filters to get the data assigned to the title variable in the container class
        # print(rqi['title'])
        item['link'] = response.xpath('//P [@class= "name"]/a/@href'. extract ()# expression filters to get the link variable in the container class assigned to the data.
        # print(rqi['link'])
        item['comment'] = response.xpath('//P [@class= "star"]//a/text ()'. extract ()# expression filtering gets the data assignment to the comment variable in the container class
        # print(rqi['comment'])
        yield item   #The container object that receives the data is returned to pipelies.py

robots Protocol

Note: If the acquired website is set in the robots.txt file, the crawler-prohibited protocol will not be able to crawl, because scrapy defaults to abide by the international protocol of robots. If you want to not abide by the protocol, you need to set up settings.py.

Find the ROBOTSTXT_OBEY variable in the settings.py file, which equals that False does not comply with the robots protocol and True complies with the robots protocol.

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   #Non-compliance with robots protocol

The third step is to compile pipelines.py data processing file

If you need the data processing class in pipelines.py to work, you need to register the data processing class in the ITEM_PIPELINES variable in the settings.py settings file.

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'adc.pipelines.AdcPipeline': 300,  #Register the adc.pipelines.AdcPipeline class. The next numeric parameter represents the execution level. The larger the value, the earlier the execution.
}

After registration, the data processing classes in pipelines.py will work.

To define data processing classes, you must inherit object s
process_item(item) is a data processing function that receives an item, which is the data object from the last yield item of the crawler.

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class AdcPipeline(object):                      #To define data processing classes, you must inherit object s
    def process_item(self, item, spider):       #process_item(item) is a data processing function that receives an item, which is the data object from the last yield item of the crawler.
        for i in range(0,len(item['title'])):   #The corresponding data list can be obtained by item ['container name']
            title = item['title'][i]
            print(title)
            link = item['link'][i]
            print(link)
            comment = item['comment'][i]
            print(comment)
        return item

Final implementation

Execute the crawler file, scrapy crawl pach --nolog

If you are still confused in the world of programming, you can join our Python Learning button qun: 784758214 to see how our predecessors learned. Exchange of experience. From basic Python script to web development, crawler, django, data mining, zero-base to actual project data are sorted out. To every Python buddy! Share some learning methods and small details that need attention. Click to join us. python learner gathering place

You can see that the data we need is already available.

Tags: Python Attribute Programming Web Development

Posted on Mon, 02 Sep 2019 07:12:08 -0700 by said_r3000