Store the data crawled by the crawler framework into mysql database

Crawling website data by using scratch is a relatively mainstream crawler framework at present, which is also very simple.
1. After the project is created, change the value of robotstxt'obey to False in settings.py. Otherwise, it will follow the robots protocol by default, and you will not be able to crawl any data.
2. Start to write your crawler in the crawler file. You can use xpath or css selector to parse the data. After all the data is parsed, declare your fields in the items file

import scrapy
class JobspiderItem(scrapy.Item):
    zwmc = scrapy.Field()
    zwxz = scrapy.Field()
    zpyq = scrapy.Field()
    gwyq = scrapy.Field()

3. Then, in the crawler file, first import the declared field class in items, then create the item object, write the value, and finally don't forget to give the item to yield

item = JobspiderItem()
item['zwmc'] = zwmc
item['zwxz'] = money
item['zpyq'] = zpyq
item['gwyq'] = gzyq
yield item

4. The next step is to save it to mysql database:
1. Import the item class and pymysql module in the pipelinks.py file

from ..jobspider.items import JobspiderItem
import pymysql

2. Then I started to connect to and write to the database. Here, I set up the mysql database and data table directly, but not in the code

class JobspiderPipeline(object):
    def __init__(self):
        # 1. Establish database connection
        self.connect = pymysql.connect(
        # localhost is connected to the local database
            host='localhost',
            # Port number of mysql database
            port=3306,
            # User name of the database
            user='root',
            # Local database password
            passwd='123456',
            # Table name
            db='job51',
            # Encoding format
            charset='utf8'
        )
        # 2. Create a cursor to operate the table.
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        # 3. Put the Item data into the database, which is written synchronously by default.
        insert_sql = "INSERT INTO job(zwmc, zwxz, zpyq, gwyq) VALUES ('%s', '%s', '%s', '%s')" % (item['zwmc'], item['zwxz'], item['zpyq'], item['gwyq'])
        self.cursor.execute(insert_sql)

        # 4. Submit operation
        self.connect.commit()

    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()

5. Finally, go to the settings.py file to uncomment the item "pipeline"

Tags: Database MySQL encoding

Posted on Sun, 05 Jan 2020 12:57:11 -0800 by the max