Directional crawling of Taobao products

Taobao commodity price comparison directional reptile

Function Description:

1. Objective: to obtain the information of Taobao Search page and extract the commodity name and price

2. Understanding: Taobao's search interface, page turning processing

 

Technical route: requests + re

Program structure design:

1. Submit the request of product search and obtain the page in a circular manner.

2. For each page, extract the commodity name and price information.

3. Output the information to the screen.

 

Important: when Taobao obtains the page, Taobao sets login authentication to access it. At this time, cookies and user agent need to be set when requesting.

import requests
import re

def getHTMLText(url,kv,cookies):
    try:
        r=requests.get(url,headers = kv,cookies = cookies,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def parsePage(ilt,html):
    try:
        plt = re.findall(r' \"view_price\"\:\"[\d\.]*\" ',html)
    
        tlt = re.findall(r' \"raw_title\"\:\".*?\" ',html)
        for i in range(len(plt)):
              price = eval(plt[i].split(':')[1])
              title = eval(tlt[i].split(':')[1])
              ilt.append([price,title])
     except:
           print("")





def printGoodList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("Serial number","Price","Trade name"))
    count = 0
    for g in ilt:
        count = count +1 
        print(tplt.format(count,g[0],g[1]))


def main():
    goods = 'A bag'
    depth = 3
    start_url = 'https://s.taobao.com/search?&q=' + goods
    coo = 'thw=cn; t=254ecf83ad9b49c70d383c71e214fab2;cna=kbBgFFO2dU8CAXFzKR/6/wq5; tg=0;enc=xWaBwIc%2BqZfhPca6P6g4cz34emAsVK3LjzRsT%2FkMfk5Ja31%2BmjMxGvBDJ%2B82Q2pJLJ83dUH5lBPAw%2BpI53L4%2BQ%3D%3D; hng=CN%7Czh-CN%7CCNY%7C156; x=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0; tracknick=xxp158125132; lgc=xxp158125132; _cc_=URm48syIZQ%3D%3D; uc3=vt3=F8dByR1TOm%2BGeyLp6rE%3D&id2=VWeYZoAnISaO&nk2=G5htj%2BHk8f8ST03J&lg2=W5iHLLyFOGW7aA%3D%3D; mt=ci=94_1; v=0; cookie2=1794e2e0aad3f137168e0a64b250dced; _tb_token_=e33e0373befbe; isg=BEtLnz2UkxogtcghrE2mANjZ2u8_4F80dJ1mYL1IJwrh3Gs-RbDvsul6spyXJ7da'
   cookies = {}
    for line in coo.split(';'):
        name,value=line.strip().split('=',1)
        cookies[name]=value
    kv = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'}
    infoList = []
    for i in range(depth):
        try:
            url = start_url+'&s=' + str(44*i)
            html = getHTMLText(url,kv,cookies)
            parsePage(infoList,html)
        except:
            continue
    printGoodsList(infoList)


main()

Tags: Python encoding Windows

Posted on Sun, 01 Dec 2019 09:50:14 -0800 by krraleigh