Python crawler novice tutorial: mobile APP data capture pyspider

1. Mobile APP Data - Write in front

Continue to practice the use of pyspider, recently searched some of the framework's usage skills, found that the document is difficult to understand, but there is no obstacle to use it for the time being, it is estimated that to write about five tutorials on the framework. Picture processing has been added to today's tutorial. You can focus on it.

2. Mobile APP Data--Page Analysis

The website I want to crawl is http://www.liqucn.com/rj/new/ This website I looked at, there are about 20,000 pages, each page of data is 9, the amount of data is about 180,000, you can grab it, do data analysis and use later, you can also practice to optimize the database.

The website basically has no anti-climbing measures, can climb up, slightly control the concurrency, after all, do not give other servers too much pressure.

After page analysis, you can see that it is based on the URL paging, which is simple, we first get the total page number through the home page, and then generate all the page numbers in batches.

http://www.liqucn.com/rj/new/?page=1
http://www.liqucn.com/rj/new/?page=2
http://www.liqucn.com/rj/new/?page=3
http://www.liqucn.com/rj/new/?page=4

Code to get the total page number

class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.liqucn.com/rj/new/?page=1', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        # Get the page number on the last page
        totle = int(response.doc(".current").text())
        for page in range(1,totle+1):
            self.crawl('http://www.liqucn.com/rj/new/?page={}'.format(page), callback=self.detail_page)
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

Then copy an official Chinese translation. Come here and remind yourself all the time.

Code analysis:

The def on_start(self) method is the entry code. This method is executed when the run button is clicked on the web console.

self.crawl(url, callback=self.index_page) is a method that calls the API to generate a new crawl task.
            This task is added to the queue to be crawled.
The def index_page(self, response) method obtains a Response object. 
            response.doc is an extension of pyquery objects. Pyquery is an object selector similar to jQuery.

def detail_page(self, response) returns a result set object.
            This result is added to the resultdb database by default (if the database is not specified to call the sqlite database by default at startup). You can also rewrite
            The on_result(self,result) method specifies the save location.

More knowledge:
@ every(minutes=24*60, seconds=0) is set to tell scheduler (scheduler) on_start method to execute once a day.
@ Cong (age = 10 * 24 * 60 * 60) tells scheduler that the request expires for 10 days.
    This request will be ignored within 10 days. This parameter can also be set in self.crawl(url, age=10*24*60*60) and crawl_config.
@ config(priority=2) This is the priority setting. The bigger the number, the earlier it will be executed.

Paging data has been added to the queue to be crawled. Now we begin to analyze the crawled data, which is implemented in detail_page function.

    @config(priority=2)
    def detail_page(self, response):
        docs = response.doc(".tip_blist li").items()
        dicts = []
        for item in docs:
            title = item(".tip_list>span>a").text()
            pubdate = item(".tip_list>i:eq(0)").text()
            info = item(".tip_list>i:eq(1)").text()
            # Mobile phone type
            category = info.split(": ")[1]
            size = info.split("/")
            if len(size) == 2:
                size = size[1]
            else:
                size = "0MB"
            app_type = item("p").text()
            mobile_type = item("h3>a").text()
            # Save data

            # Establishing Picture Download Channel

            img_url = item(".tip_list>a>img").attr("src")
            # Get the file name
            filename = img_url[img_url.rindex("/")+1:]
            # Add Software logo Picture Download Address
            self.crawl(img_url,callback=self.save_img,save={"filename":filename},validate_cert=False)
            dicts.append({
                "title":title,
                "pubdate":pubdate,
                "category":category,
                "size":size,
                "app_type":app_type,
                "mobile_type":mobile_type

                })
        return dicts
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

The data has been returned centrally. We rewrite on_result to save the data in mongodb. Before writing, we write the relevant content of linking mongodb.

import os

import pymongo
import pandas as pd
import numpy as np
import time
import json

DATABASE_IP = '127.0.0.1'
DATABASE_PORT = 27017
DATABASE_NAME = 'sun'
client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT)
db = client.sun
db.authenticate("dba", "dba")
collection = db.liqu  # Prepare to insert data

data storage

    def on_result(self,result):
        if result:
            self.save_to_mongo(result)            

    def save_to_mongo(self,result):
        df = pd.DataFrame(result)
        #print(df)
        content = json.loads(df.T.to_json()).values()
        if collection.insert_many(content):
            print('Stored in mongondb Success')

The data obtained is shown in the table below. So far, we have completed most of the work, and finally download and improve the picture, and then finish!

3. Mobile APP Data - Picture Storage

Picture download, in fact, is to save network pictures to an address.

    def save_img(self,response):
        content = response.content
        file_name = response.save["filename"]
        #Create folders (if they do not exist)
        if not os.path.exists(DIR_PATH):                         
            os.makedirs(DIR_PATH) 

        file_path = DIR_PATH + "/" + file_name

        with open(file_path,"wb" ) as f:
            f.write(content)
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

So far, after the task is completed and saved, adjust the crawler's grasp speed, click run, and the data will run.~~~~

Tags: Python Mobile Database MongoDB

Posted on Sat, 20 Jul 2019 05:29:18 -0700 by dwnz