Python reptile's movie paradise, novice training project

 

preface:

This paper is very simple and easy to understand. It can be said that it is zero basis and can be mastered quickly. If you have any questions, please leave a message and I will reply as soon as possible. The code of this article is stored in github

1, The importance of reptiles:

If the Internet is compared to a Spider web, Spider is a Spider crawling around the web. Web spiders search for web pages through their link addresses. They start from a certain page (usually the home page) of the web site, read the content of the web page, find other link addresses in the web page, and then search for the next web page through these link addresses, and continue to cycle until all the pages of the whole web site are grabbed.

Note here: whether you are for Python employment or hobbies, remember: project development experience is always the core. If you don't have the latest Python introduction to the advanced practical video tutorial in 2020, you can go to the small Python exchange : you can find a lot of new Python tutorial projects under the transformation of "seven clothes, nine seven buses and five numbers" (homophony of numbers). You can also communicate with the old driver for advice!

Take the replies of some netizens: 1. Before buying a house in Beijing, who wants the house price to start soaring? The data analysis of the house price of chain family only gives a small part, which is far from meeting their own needs. So in the evening, I spent several hours writing a crawler, which crawled down all the residential information in Beijing and all the historical transaction records of all the residential areas in Beijing.

2. My lover is a sales person of an Internet company, who needs to collect all kinds of business information and then make a phone call. So she used the collection script to grab a bunch of data for her use, and her colleagues searched the data every day until midnight.

2, Practice: climb to the movie Paradise movie details page

1. Web page analysis and details page url of the first page

The latest movie interface from movie paradise. You can see that the url of the first page is www.ygdy8.net/html/gndy/d... , the second page is www.ygdy8.net/html/gndy/d... , page 3 and page 4 are similar

from lxml import etree
import requests


url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html'

headers = {
    'User_Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}

response = requests.get(url,headers=headers)

# response.text is the default judgment of the system. However, it's a pity that the misjudgment resulted in disorderly code. We can respond. Content in another way. Self specified format decoding
# print(response.text)
# print(response.content.decode('gbk'))
print(response.content.decode(encoding="gbk", errors="ignore"))
Copy code

First, take the first page as an example, and print the data as follows:

 

image.png

 

By analyzing the html source code of movie paradise, we can conclude that each table tag is a movie

 

image.png

 

Get the details url of each movie through xpath

html = etree.HTML(text)
detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
for detail_url in detail_urls:
    print(detail_url)  #Add the domain name as the detail url
Copy code

The results are as follows:

 

2. Organize the code and crawl the url of the first 7 pages of the movie list

from lxml import etree
import requests

# domain name
BASE_DOMAIN = 'http://www.ygdy8.net'
# url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_1.html'

HEADERS = {
    'User_Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
}

def spider():
    base_url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_{}.html'
    for x in range(1,8):
        url = base_url.format(x)
        print(url) # Find the URL eg of the movie list on each page: http://www.ygdy8.net/html/gndy/dyzz/list.html

if __name__ == '__main__':
    spider()
Copy code

3. Crawl to the details page address of each movie

def get_detail_urls(url):
    response = requests.get(url, headers=HEADERS)

    # response.text is the default judgment of the system. However, it's a pity that the misjudgment resulted in disorderly code. We can respond. Content in another way. Self specified format decoding
    # print(response.text)
    # print(response.content.decode('gbk'))
    # print(response.content.decode(encoding="gbk", errors="ignore"))
    text = response.content.decode(encoding="gbk", errors="ignore")

    # Get the details url of each movie through xpath
    html = etree.HTML(text)
    detail_urls = html.xpath("//table[@class='tbspan']//a/@href")

    detail_urls = map(lambda url:BASE_DOMAIN+url,detail_urls) #This is equivalent to the following code: replace every url in the list
    # def abc(url):
    #     return BASE_DOMAIN+url
    # index = 1
    # for detail_url in detail_urls:
    #     detail_url = abc(detail_url)
    #     detail_urls[index] = detail_url
    #     index+1

    return detail_urls
Copy code

4. Capture data from movie details page

 

 

# Analysis details page
def parse_detail_page(url):
    movie = {}
    response = requests.get(url,headers = HEADERS)
    text = response.content.decode('gbk', errors='ignore')
    html = etree.HTML(text)
    # title = html.xpath("//Div [@ class ='title'u all '] / / font [@ color =' (07519a '] ") (line 47, modified below

   # Print out [< element font at 0x10cb42c8 >, < element font at 0x10cb42308 >]
   #  print(title)

    # In order to display, we need to change the code
    # for x in title:
    #     print(etree.tostring(x,encoding='utf-8').decode('utf-8'))

     # We're trying to get the text, so we're going to change 47 lines
    title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
    movie['title'] = title

    zoomE = html.xpath("//div[@id='Zoom']") [0] # Find a common top-level container to facilitate job search
    imgs = zoomE.xpath(".//img/@src") # Find posters and screenshots
    cover = imgs[0]
    if len(imgs) > 1:
        screenshot = imgs[1]
        movie['screenshot'] = screenshot
    # print(cover)
    movie['cover'] = cover

    infos = zoomE.xpath(".//text()")

    for index,info in enumerate(infos):
        if info.startswith('◎year&emsp;&emsp;generation'):
            info = info.replace("◎year&emsp;&emsp;generation", "").strip() # strip remove space
            movie['year'] = info
        elif info.startswith("◎yield&emsp;&emsp;land"):
            info = info.replace("◎yield&emsp;&emsp;land", "").strip()
            movie["country"] = info
        elif info.startswith("◎class&emsp;&emsp;other"):
            info = info.replace("◎class&emsp;&emsp;other", "").strip()
            movie["category"] = info
        elif info.startswith("◎Douban score"):
            info = info.replace("◎Douban score", "").strip()
            movie["douban_rating"] = info
        elif info.startswith("◎slice&emsp;&emsp;long"):
            info = info.replace("◎slice&emsp;&emsp;long","").strip()
            movie["duration"] = info
        elif info.startswith("◎guide&emsp;&emsp;Play"):
            info = info.replace("◎guide&emsp;&emsp;Play", "").strip()
            movie["director"] = info
        elif info.startswith("◎main&emsp;&emsp;Play"):
            actors = []
            actor = info.replace("◎main&emsp;&emsp;Play", "").strip()
            actors.append(actor)
            # Because there are many leading actors, plus the particularity of their elements in the movie paradise, we need to go through them and find out each actor separately
            for x in range(index+1,len(infos)): # Start from actor infos and find out every actor
                actor = infos[x].strip()
                if actor.startswith("◎"): # That is to say, when we get to the label, we will quit
                    break
                actors.append(actor)
            movie['actor'] = actors
        elif info.startswith('◎simple&emsp;&emsp;Medium '):

            # info = info.replace('asimple & emsp; & emsp; medium', ""). strip()
            for x in range(index+1,len(infos)):
                if infos[x].startswith("◎Awards"):
                  break
                profile = infos[x].strip()
                movie['profile'] = profile
            # print(movie)
        elif info.startswith('◎Awards '):
            awards = []
            # info = info.replace("◎ awards", "strip()
            for x in range(index+1,len(infos)):
                if infos[x].startswith("[Download address]"):
                    break
                award = infos[x].strip()
                awards.append(award)
            movie['awards'] = awards
            # print(awards)

    download_url = html.xpath("//td[@bgcolor='#fdfddf']/a/@href")[0]
    movie['download_url'] = download_url
    return  movie
Copy code

The above code crawls every data of the movie. In order to make it convenient for readers to compare the format, the author has downloaded the html - "movie.html" when writing this article, and put it on the github Medium

Final results:

Summary note: whether you are for Python employment or hobbies, remember: project development experience is always the core. If you don't have the latest Python introduction to advanced practical video tutorials in 2020, you can go to small-scale Python exchange : you can find a lot of new Python tutorial projects under the transformation of "seven clothes, nine seven buses and five numbers" (homophony of numbers). You can also communicate with the old driver for advice!

The text and pictures of this article come from the Internet and my own ideas. They are only for learning and communication. They have no commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

Tags: Python encoding github Mac

Posted on Sun, 26 Apr 2020 02:50:34 -0700 by sdotsen