python crawler-scrapy framework

install

The urllib library is more suitable for writing crawler files, and scrapy is more suitable for crawler projects.

Steps:

  1. Change the pip source first. It's too slow abroad. Reference: https://www.jb51.net/article/159167.htm
  2. Upgrade pip: python -m pip install --upgrade pip
  3. pip install wheel
  4. pip install lxml
  5. pip install Twisted
  6. pip install scrapy

Common commands

Core Catalogue

  1. New project: scrapy start project MCQ
  2. Running separate crawler files (not projects): for example

Then enter the command scrapy runspider gg.py

  1. Get settings: cd to projects, such as scrapy settings --get BOT_NAME

  2. Interactive crawling: scrapy shell http://www.baidu.com, using python code

  3. Scrapy version information: scrapy version

  4. Crawl and display in the browser: scrapy view http://news.1152.com, download the page to open locally

  5. Testing local hardware performance: scrapy bench, how many pages can be crawled per minute

  6. Create crawler files based on templates: scrapy genspider-l, with the following templates

    Choose basic, scrapy genspider-t basic haha baidu.com. (Note: Fill in the crawlable domain name here, the domain name is not www, edu... Beginning)

  1. Testing for compliance with crawler files: scrapy check haha

  2. Running files under the crawler project: scrapy crawl haha
    Do not display intermediate log information: scrapy crawl haha --nolog

  3. View the available crawler files under the current project: scrapy list

  4. Specify a crawler file to get url:
    F: scrapy project MCQ > scrapy parse -- spider = haha http://www.baidu.com

XPath expression

XPath and Regular Simple Contrast:

  1. XPath expressions are a little more efficient
  2. Regular expressions are a little more powerful
  3. Generally speaking, XPath is preferred, but the problem that XPath can't solve is that we choose the rule to solve it.

/ Layer by layer extraction

text() extracts the text under the label

To extract the title: / html/head/title/text()

// Label Name: Extract all the names of the ___________. Label

If all div tags are extracted: //div

// Label name [@attribute='attribute value']: Extract attribute as... Label

@ Property denotes taking an attribute value

Use scrapy as a web commodity crawler

New crawler project: F: scrapy Project > scrapy start project Dangdang

F: scrapy Project > CD Dangdang

F: scrapy project Dangdang > scrapy genspider - t basic DD dangdang.com

Modify items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field() #Commodity Title
    link=scrapy.Field() #Commodity Links
    comment=scrapy.Field() #Comments on Commodities
    

Let's turn the page and analyze two links:

http://category.dangdang.com/pg2-cid4008154.html

http://category.dangdang.com/pg3-cid4008154.html

You can find the initial link: http://category.dangdang.com/pg1-cid4008154.html

Analyzing the page source code, you can start with name="itemlist-title", because this happens to have 48 results, that is, the number of items on a page.

ctrl+f comments, you can find that there are exactly 48 records.

dd.py:

# -*- coding: utf-8 -*-
import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request
class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://category.dangdang.com/pg1-cid4008154.html']

    def parse(self, response):
        item=DangdangItem()
        item["title"]=response.xpath("//a[@name='itemlist-title']/@title").extract()
        item["link"]=response.xpath("//a[@name='itemlist-title']/@href").extract()
        item["comment"]=response.xpath("//a[@name='itemlist-review']/text()").extract()
        # print(item["title"])
        yield item
        for i in range(2,11): #Crawl 2-10 pages
            url='http://category.dangdang.com/pg'+str(i)+'-cid4008154.html'
            yield Request(url, callback=self.parse)

For Request in dd:

url: The url that needs to be requested and processed next
callback: Specifies the Response returned by the request, which function handles it.

First, change settings.py's robots to False:

settings.py:

# -*- coding: utf-8 -*-

# Scrapy settings for dangdang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'dangdang'

SPIDER_MODULES = ['dangdang.spiders']
NEWSPIDER_MODULE = 'dangdang.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'dangdang (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'dangdang.middlewares.DangdangSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'dangdang.middlewares.DangdangDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {
   'dangdang.pipelines.DangdangPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Running: F: scrapy project Dangdang > scrapy crawl DD -- nolog

Go to settings.py and turn pipeline on:

pipelines.py:

# -*- coding: utf-8 -*-
import pymysql
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class DangdangPipeline(object):
    def process_item(self, item, spider):
        conn=pymysql.connect(host='127.0.0.1',user="root",passwd="123456",db="dangdang")
        cursor = conn.cursor()
        for i in range(len(item["title"])):
            title=item["title"][i]
            link=item["link"][i]
            comment=item["comment"][i]
            # print(title+":"+link+":"+comment)
            sql="insert into goods(title,link,comment) values('%s','%s','%s')"%(title,link,comment)
            # print(sql)
            try:
                cursor.execute(sql)
                conn.commit()
            except Exception as e:
                print(e)
        conn.close()
        return item

Log in to mysql and create a database: mysql > create database and dangdang;

mysql> use dangdang

mysql> create table goods(id int(32) auto_increment primary key,title varchar(100),link varchar(100) unique,comment varchar(100));

Finally run scrapy crawl dd--nolog

48 items per page, 48*10=480, crawl successfully!

The complete project source code refers to my github

Scapy simulates landing operations

Take this website for example http://edu.iqianyue.com/. We don't crawl content, we just simulate landing, so we don't need to write item.py.

Click on login and use fiddler to view the real login address: http://edu.iqianyue.com/index_user_login

Modify login.py:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest, Request


class LoginSpider(scrapy.Spider):
    name = 'login'
    allowed_domains = ['iqianyue.com']
    start_urls = ['http://iqianyue.com/']
    header={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0"}
    #Write the start_request() method, and for the first time the request in that method is invoked by default.
    def start_requests(self):
        #First climb the login page once, and then enter the callback function parse()
        return [Request("http://edu.iqianyue.com/index_user_login",meta={"cookiejar":1},callback=self.parse)]
    def parse(self, response):
        #Set the post information to be passed, at which point there is no validation code field
        data={
            "number":"swineherd",
            "passwd":"123",
        }
        print("Log in...")
        #Log in through ForRequest.from_response()
        return FormRequest.from_response(response,
                                          #Setting cookie information
                                          meta={"cookiejar":response.meta["cookiejar"]},
                                          #Setting headers information to simulate browsers
                                          headers=self.header,
                                          #Setting the data in the post form
                                          formdata=data,
                                          #Setting callback function
                                          callback=self.next,
                                          )
    def next(self,response):
        data=response.body
        fp=open("a.html","wb")
        fp.write(data)
        fp.close()
        print(response.xpath("/html/head/title/text()").extract())
        #Access after login
        yield Request("http://edu.iqianyue.com/index_user_index",callback=self.next2,meta={"cookiejar":1})
    def next2(self,response):
        data=response.body
        fp=open("b.html","wb")
        fp.write(data)
        fp.close()
        print(response.xpath("/html/head/title/text()").extract())

Scapy News Reptilian Warfare

Goal: Climb all news on Baidu News Home page

F:> CD scrapy project

F: scrapy Project > scrapy start project baidunews

F: scrapy Project > CD baidunews

F: scrapy project baidunews > scrapy genspider-t basic N1 baidu.com

Analysis of grabbing bags

Find the json file:

idle check it out

Home page ctrl+f:

Drag down the front page to trigger all news, and find js files that store url s, title s, etc. in fiddler (not every js file is useful)

Find out that not only the js file has news information, but also other, carefully look in fiddler!

http://news.baidu.com/widget?id=LocalNews&ajax=json&t=1566824493194

http://news.baidu.com/widget?id=civilnews&t=1566824634139

http://news.baidu.com/widget?id=InternationalNews&t=1566824931323

http://news.baidu.com/widget?id=EnterNews&t=1566824931341

http://news.baidu.com/widget?id=SportNews&t=1566824931358

http://news.baidu.com/widget?id=FinanceNews&t=1566824931376

http://news.baidu.com/widget?id=TechNews&t=1566824931407

http://news.baidu.com/widget?id=MilitaryNews&t=1566824931439

http://news.baidu.com/widget?id=InternetNews&t=1566824931456

http://news.baidu.com/widget?id=DiscoveryNews&t=1566824931473

http://news.baidu.com/widget?id=LadyNews&t=1566824931490

http://news.baidu.com/widget?id=HealthNews&t=1566824931506

http://news.baidu.com/widget?id=PicWall&t=1566824931522

We can find that what really affects news information is the id value behind the widget?

Write a script to extract the id:

The URLs of the source code for the two different links are also different:

items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaidunewsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    link=scrapy.Field()
    content=scrapy.Field()

n1.py:

# -*- coding: utf-8 -*-
import scrapy
from baidunews.items import BaidunewsItem #From the Core Catalogue
from scrapy.http import Request
import re
import time
class N1Spider(scrapy.Spider):
    name = 'n1'
    allowed_domains = ['baidu.com']
    start_urls = ["http://news.baidu.com/widget?id=LocalNews&ajax=json"]
    allid=['LocalNews', 'civilnews', 'InternationalNews', 'EnterNews', 'SportNews', 'FinanceNews', 'TechNews', 'MilitaryNews', 'InternetNews', 'DiscoveryNews', 'LadyNews', 'HealthNews', 'PicWall']
    allurl=[]
    for k in range(len(allid)):
        thisurl="http://news.baidu.com/widget?id="+allid[k]+"&ajax=json"
        allurl.append(thisurl)

    def parse(self, response):
        while True: #Climb every five minutes
            for m in range(len(self.allurl)):
                yield Request(self.allurl[m], callback=self.next)
                time.sleep(300) #Unit in seconds
    cnt=0
    def next(self,response):
        print("The first" + str(self.cnt) + "Columns")
        self.cnt+=1
        data=response.body.decode("utf-8","ignore")
        pat1='"m_url":"(.*?)"'
        pat2='"url":"(.*?)"'
        url1=re.compile(pat1,re.S).findall(data)
        url2=re.compile(pat2,re.S).findall(data)
        if(len(url1)!=0):
            url=url1
        else :
            url=url2
        for i in range(len(url)):
            thisurl=re.sub("\\\/","/",url[i])
            print(thisurl)
            yield Request(thisurl,callback=self.next2)
    def next2(self,response):
        item=BaidunewsItem()
        item["link"]=response.url
        item["title"]=response.xpath("/html/head/title/text()")
        item["content"]=response.body
        print(item)
        yield item

Open the pipeline for settings:

Change robots to False, scrapy crawl N1 -- Technology will run

Scapy Douban Net Log-in Reptilian

To add to settings:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'

Refer to this blog: https://blog.csdn.net/qq_33472765/article/details/80958820 for the usage distinction between scrapy.http.FormRequest and scrapy.http.FormRequest.from_response.

d1.py:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request, FormRequest


class D1Spider(scrapy.Spider):
    name = 'd1'
    allowed_domains = ['douban.com']
    # start_urls = ['http://douban.com/']
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0"}

    def start_requests(self):
        # First climb the login page once, and then enter the callback function parse()
        print("Start:")
        return [Request("https://accounts.douban.com/passport/login",meta={"cookiejar":1},callback=self.login)]

    def login(self, response):
        #Judgment Verification Code
        captcha=response.xpath("//")
        data = {
            "ck": "",
            "name": "***",
            "password": "***",
            "remember": "false",
            "ticket": ""
        }
        print("In landing...")
        return FormRequest(url="https://accounts.douban.com/j/mobile/login/basic",
                                         # Setting cookie information
                                         meta={"cookiejar": response.meta["cookiejar"]},
                                         # Setting headers information to simulate browsers
                                         headers=self.headers,
                                         # Setting the data in the post form
                                         formdata=data,
                                         # Setting callback function
                                         callback=self.next,
                                         )
    def next(self,response):
        #Jump to the Personal Center
        yield Request("https://www.douban.com/people/202921494/",meta={"cookiejar":1},callback=self.next2)
    def next2(self, response):
        title = response.xpath("/html/head/title/text()").extract()
        print(title)

Now the bean chips are slider validation codes, for now I will not deal with this vegetable chicken.

Use XPath expressions in urllib

First install the lxml module: pip install lxml, and then convert the web page data into treedata through etree under lxml.

import urllib.request
from lxml import etree
data=urllib.request.urlopen("http://www.baidu.com").read().decode("utf-8","ignore")
treedata=etree.HTML(data)
title=treedata.xpath("//title/text()")
print(title)

Tags: Python pip Attribute MySQL JSON

Posted on Fri, 30 Aug 2019 07:13:25 -0700 by ComputerNerd888