Crawling website data by using scratch is a relatively mainstream crawler framework at present, which is also very simple.
1. After the project is created, change the value of robotstxt'obey to False in settings.py. Otherwise, it will follow the robots protocol by default, and you will not be able to crawl any data.
2. Start to write your crawler in the crawler file. You can use xpath or css selector to parse the data. After all the data is parsed, declare your fields in the items file
import scrapy class JobspiderItem(scrapy.Item): zwmc = scrapy.Field() zwxz = scrapy.Field() zpyq = scrapy.Field() gwyq = scrapy.Field()
3. Then, in the crawler file, first import the declared field class in items, then create the item object, write the value, and finally don't forget to give the item to yield
item = JobspiderItem() item['zwmc'] = zwmc item['zwxz'] = money item['zpyq'] = zpyq item['gwyq'] = gzyq yield item
4. The next step is to save it to mysql database:
1. Import the item class and pymysql module in the pipelinks.py file
from ..jobspider.items import JobspiderItem import pymysql
2. Then I started to connect to and write to the database. Here, I set up the mysql database and data table directly, but not in the code
class JobspiderPipeline(object): def __init__(self): # 1. Establish database connection self.connect = pymysql.connect( # localhost is connected to the local database host='localhost', # Port number of mysql database port=3306, # User name of the database user='root', # Local database password passwd='123456', # Table name db='job51', # Encoding format charset='utf8' ) # 2. Create a cursor to operate the table. self.cursor = self.connect.cursor() def process_item(self, item, spider): # 3. Put the Item data into the database, which is written synchronously by default. insert_sql = "INSERT INTO job(zwmc, zwxz, zpyq, gwyq) VALUES ('%s', '%s', '%s', '%s')" % (item['zwmc'], item['zwxz'], item['zpyq'], item['gwyq']) self.cursor.execute(insert_sql) # 4. Submit operation self.connect.commit() def close_spider(self, spider): self.cursor.close() self.connect.close()
5. Finally, go to the settings.py file to uncomment the item "pipeline"