The official introduction is
An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.
An open source and collaboration framework for extracting the required data from a website in a fast, simple and scalable manner.
The environment and tools used in this project are as follows.
Installation of Python 3 scrapy is no longer described
mongodb is the official download address of nosql non-relational database used to store data https://www.mongodb.com/download-center/community?jmp=docs
mongodb graphical management tool recommends nosqlmanager
That's right. Let's still pick out the soft persimmon and climb up the simplest bean-paste movie top250.
(vii) This website is a site that almost everyone who learns to crawl will crawl. This website is very representative, not to mention much. The project begins.
Creating scrapy projects needs to be done on the command line
Switch to the working directory and enter the instruction scrapy start project Douban
That is, create successfully, then use pycharm to open the project, first look at the directory structure
We found that there was only one file in the spiders of the project. How could there be only one _init_ py where the crawler was placed?
Don't worry. We also need to enter a command to create the basic crawler. Open cmd and switch to the spiders directory under the directory folder.
Enter scrapy genspider douban_spider https://movie.douban.com/top250
Create the crawler successfully as shown below.
Then we open the project analysis directory structure
spiders crawler folder
ietms.py) Where the data structure of items is defined (that is, we crawl information such as the properties of the content)
pipelines.py) Defines how items are handled (data cleaning, etc.) (pipelines options need to be turned on in settings)
settings.py) Project settings file that defines global settings (such as header agent, task concurrency, download latency, etc.)
scrapy.cfg * project configuration file (including some default configuration information)
So far, our project has been successfully created.
After creating the project, the next step is to determine what we want to crawl before we can start writing our items.py file.
First open the target page for analysis.
What do we need in the web page?
- Film Rank Number
- Movie title
- Film Actors and Year Classification
- Film Star Score
- Number of Commentaries
- Film Introduction
Now you can write items.py files based on content
The items.py file code is written as follows
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class DoubanItem(scrapy.Item): #Example # define the fields for your item here like: # name = scrapy.Field() serial_number = scrapy.Field()#ranking movie_name = scrapy.Field()#Movie title introduce = scrapy.Field()#Basic Information of Film Introduction star = scrapy.Field()#Film Star Score evaluate = scrapy.Field()#Number of film reviewers describe = scrapy.Field()#A Brief Introduction to Film Contents
Content extraction spider file writing
After determining the content, the spider crawler file is written.
The spider.py file for the test phase is as follows:
# -*- coding: utf-8 -*- import scrapy class DoubanSpiderSpider(scrapy.Spider): #Reptile name name = 'douban_spider' #Allowed Domain Name Crawling url All belong to this domain name. allowed_domains = ['movie.douban.com'] #Start url start_urls = ['https://movie.douban.com/top250/'] def parse(self, response): print(response.text)#Print response content pass
Then we need to run our reptiles to see if we can get any information now.
Open the command window and cd to enter the command scrapy crawl douban_spider in the project directory
Doban_spider is the name of the reptile, which is listed in the spider.py file.
Run as follows
We found that there were crawler information and response information in it, but we can see that there is no movie information we want. What should we do now?
Students who have learnt a little about crawlers know that crawlers need to modify USER_AGENT, which is also the simplest anti-crawler mechanism, so we also need to modify the user agent of our crawlers.
Where can I find a head agent? Simpler, you can go to Baidu to search for one directly, or we can use the browser debugger to copy our user agent.
For example, the chrome browser clicks on F12 to replicate a resource and the user agent is no longer redundant.
Open the settings.py file and find the USER_AGENT modification as follows:
# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'douban (+http://www.yourdomain.com)' USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36'
Then we open the command window again and cd to enter the command scrapy crawl douban_spider in the project directory
We found that we already had the movie information we wanted.
At the same time, it's really inconvenient to run spider on the command line every time. We can add a main.py startup file to the project as follows
main.py writes code:
from scrapy import cmdline cmdline.execute('scrapy crawl douban_spider'.split())
Running, you find that you get the same effect as running the command line.
The next step is to process the data, extract the information we want, and continue to write spider.py file.
For data extraction, we use xpath location to observe the elements of the target website first. We can see that there are 25 movie information on each page of the top 250 movie and each movie information is a list li.
There are many ways to write xpath. We can review elements and write XPath locations. Or we can get the XPath path path of an element directly with chrome.
For example, using an xpath browser plug-in to find the elements we need, we first find the location of each movie.
As shown in the figure, we can write //ol[@class='grid_view']/li/div[@class='item'] to locate the current movie. In fact, we can simply write //ol/li directly, but we'd better be more precise with xpath grammar as follows
|nodename||Select all child nodes of this node.|
|/||Select from the root node.|
|//||Select the nodes in the document from the current node that matches the selection, regardless of their location.|
|.||Select the current node.|
|..||Select the parent of the current node.|
In the same way, we can find the xpath of movie ranking, name, comment and so on. Next, we can quote the DoubanItem class written in our items.py file and complete the assignment of object attributes.
spider.py file code:
# -*- coding: utf-8 -*- import scrapy from douban.items import DoubanItem class DoubanSpiderSpider(scrapy.Spider): #Reptile name name = 'douban_spider' #Allowed Domain Name Crawling url All belong to this domain name. allowed_domains = ['movie.douban.com'] #Start url start_urls = ['https://movie.douban.com/top250/'] #Default parsing method def parse(self, response): # Pay attention to python Use in statements xpath If you pay attention to the problem of single and double quotation marks with the original sentence movie_list=response.xpath("//ol[@class='grid_view']/li/div[@class='item']") for movie_item in movie_list: douban_item=DoubanItem() #xpath At the end of the statement text()Is to acquire the present xpath Content # scrapy get() getall()Method acquisition xpath The value of the path is different in two ways, please Baidu douban_item['serial_number'] = movie_item.xpath(".//em/text()").get() douban_item['movie_name'] = movie_item.xpath(".//span[@class='title']/text()").get() #The introduction is very informal and has many lines. First, use getall()To get it, and then we're going to deal with it. content = movie_item.xpath(".//div[@class='bd']/p/text()").getall() #Handle contient_introduce='' for conitem in content: content_s=''.join(conitem.split()) contient_introduce=contient_introduce+content_s+' ' #assignment douban_item['introduce'] = contient_introduce douban_item['star'] = movie_item.xpath(".//span[@class='rating_num']/text()").get() douban_item['evaluate'] = movie_item.xpath(".//div[@class='star']/span/text()").get() douban_item['describe'] = movie_item.xpath(".//div[@class='bd']/p/span/text()").get() #We need to get what we get. yield reach douban_item China, or our pipeline pipelines.py Unable to receive data yield douban_item #We need to turn the page automatically to the next page to parse the data. next_linkend=response.xpath("//span[@class='next']/a/@href").get() #judge next_linkend Does it exist? if next_linkend: next_link = 'https://movie.douban.com/top250/'+next_linkend #Same need yield Submit to scheduler and add a callback function(Data Extraction Function Just Written) yield scrapy.Request(next_link,callback=self.parse)
We can use commands to save data directly to json or csv files as follows
Or use the command line cd to the project directory
Enter the command scrapy crawl douban_spider-o test.json to get a JSON file
Enter the command scrapy crawl douban_spider-o test.csv to get a CSV file
This csv file can be opened and browsed directly with excel, but we will find that there is chaotic code. We can open the file with notepad + + and change the encoding mode, then save it and open it with excel.
Store in database
Next we need to write pipelines.py and store the data in mongodb
Note that we need to turn off the ITEM_PIPELINES comment in set.py to run pipelines.py properly.
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymongo #Connecting to a local database remotely is also possible myclient = pymongo.MongoClient("mongodb://localhost:27017/") #Database name mydb = myclient["douban"] #Data table name mysheet = mydb["movie"] class DoubanPipeline(object): #In this case item Just now. yield Coming back def process_item(self, item, spider): data=dict(item) #insert data mysheet.insert(data) return item
Now that the main.py data is stored in the database, we can open the database to view the data.
So far, our reptile project can be said to have been completed.
- ip proxy Middleware
- user-agent Middleware
The ip proxy needs to buy the server and then use it.
Let's try user-agent Middleware
Write middlewares.py and finally add our own class (import random at the top of the file):
class my_useragent(object): def process_request(self,request,spider): USER_AGENT_LIST = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:22.214.171.124pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:126.96.36.199) Gecko Fedora/188.8.131.52-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", ] agent = random.choice(USER_AGENT_LIST) request.headers['User_Agent'] = agent
Then go to settings.py and open the middleware and modify it to the class we just created as follows
Then run main.py and it's OK.