Where to store the crawled data

Last time, we took QQ music as an example, mainly learning how to request data with parameters (get request) and the use of Request Headers, and finally crawled to the data of Jay Chou's song information.

So here comes a new question: how to save the crawled data?

You may come up with a solution: copy and paste the crawled data one by one, and then save it as an Excel file. Such a scheme is good for storing more than ten pieces of data, but when we crawl to more than a few hundred pieces of data, such a scheme is obviously not desirable.

At this stage, we have learned to acquire data, analyze data and extract data. The only difference is to store data, which is also an indispensable step in the whole process of crawler.

Therefore, the core content of this section is the correct way to store data. Later, we will still take QQ music as an example to store the data of Jay Chou's song information we crawled to in the last pass.

How to store data

In fact, there are two commonly used ways to store data: storing it as a csv format file and storing it as an Excel file (not a copy and paste one).

I guess at this time, you will want to ask what is "csv" and what is the difference between it and Excel file?

Earlier, I mentioned that json is a special string. In fact, csv is also a format of string file. The syntax of organizing data is to add a separator between strings -- a line break between lines, and a comma between characters in the same line.

It can be opened with any text editor (such as Notepad), Excel or save as csv format (because excel supports csv format file).

Run the following three lines of code, and you'll be able to see what a csv is.

file=open('test.csv','a+')
#Create the test.csv file to append the read-write mode
file.write('Captain America,Iron Man,Spider-Man')
#Write test.csv file
file.close()
#Close file

Download the csv file we just wrote to the local computer and open it in Notepad. You will see:

Open in Excel, as follows:

Did you find out? The comma in the csv file can be used to separate the peer strings.

Why do you want a separator? Because if we don't add it, the data will be piled together, which will be disorderly and inconvenient for us to extract and search later. It's also a way to organize data in a regular way.

In addition, using csv format to store data, reading and writing is more convenient, easy to implement, and the file will be smaller than excel file. But the csv file lacks many functions of Excel file itself, such as not embedding images and charts, not generating formulas.

As for Excel files, I don't need to say that you also know that they are spreadsheets. It has a special format for saving files, namely xls and xlsx (the file format of Excel2003 is xls, and that of Excel2007 and later is xlsx).

Well, you know csv and Excel files, we can continue to learn the basic knowledge of data storage - how to write and read data from csv format files and Excel files.

Basic knowledge of data storage

The modules that need to be referenced are different when they are stored as csv format files and Excel files. We need to use csv module to operate csv files, and openpyxl module to operate Excel files.

Rest assured, neither module is complex.

Basic knowledge: csv write and read

OK. Now please follow my pace. Let's find out how to write data to csv format file.

First, we will refer to the csv module. Because Python comes with its own csv module, we can reference it without installation.

Are you confused? Clearly, we can use the open function to write the previous csv writing. Why should we refer to the csv module first? Answer: directly use the module written by others, which is more concise and powerful than the open() function. You can feel it later. So, why not?

import csv
#Reference the csv module.
csv_file = open('demo.csv','w',newline='',encoding='utf-8')
#To create a csv file, we first call the open() function, passing in parameters: filename "demo.csv", write mode "w", newline = '', encoding='utf-8 '.

Then we have to create a new csv file called "demo.csv.".

"w" is the writer, that is, the file write mode. It will write the newly added content in the form of overwriting the original content.

Please attach a document reading and writing mode table. You don't need to memorize it and then check the table when you don't know what mode to use.

The reason for adding the newline = '' parameter is that you can avoid double the line spacing in the csv file (that is, you can avoid blank lines between rows in the table). Adding encoding = 'utf-8' can avoid error or miscoding caused by coding problems.

After creating the csv file, we use the csv.writer() function to create a writer object.

import csv
#Reference the csv module.
csv_file = open('demo.csv','w',newline='',encoding='utf-8')
#Call the open() function to open the csv file, passing in parameters: filename "demo.csv", write mode "w", newline = '', encoding='utf-8 '.
writer = csv.writer(csv_file)
# Create a writer object with the csv.writer() function.

How to write new content to the csv file? The answer is -- call the writerow() method of the writer object.

writer.writerow(['Film','Bean score'])
#With the writerow() function, you can write a line of text "movie" and "Douban score" in the csv file.

Note: the writerow() function needs to put the list parameter, so we have to write the content to be written into the list. It's like "movie", "Douban score".

We try to write the names of two movies and their corresponding Douban scores. Finally, we close the file and finish writing the csv file.

import csv
#Reference the csv module.
csv_file = open('demo.csv','w',newline='',encoding='utf-8')
#Call the open() function to open the csv file, passing in parameters: filename "demo.csv", write mode "w", newline = '', encoding='utf-8 '.
writer = csv.writer(csv_file)
# Create a writer object with the csv.writer() function.
writer.writerow(['Film','Bean score'])
#Call the writerow() method of the writer object to write a line of text "movie" and "Douban score" in the csv file.
writer.writerow(['Galaxy escort','8.0'])
#Write a line of text "Galactic escort" and "8.0" in the csv file.
writer.writerow(['The Avengers','8.1'])
#Write a line of text "Avenger alliance" and "8.1" in the csv file.
csv_file.close()
#After writing, closing the file is done!

After running the code, a file named "demo.csv" is created. Open this file in Excel or notepad and you will see——

[practice time] please type the code written in the above csv file. After all, code can't be read without knocking. Knock it up! Tip: first, introduce the csv module, open the csv file with the open() function, and don't forget to add the newline = '' parameter; then use the csv.writer() function to create a writer object, and then call the writerow() method to write content to the csv file.

The knowledge of writing data with csv module is clear to us. Next, we can continue to learn how to read the data of the csv file.

Take the "demo.csv" file you just created as an example. You can run the following code first to see what results will be read.

import csv
csv_file=open('demo.csv','r',newline='',encoding='utf-8')
reader=csv.reader(csv_file)
for row in reader:
    print(row)
csv_file.close()

Did you print out what you just wrote to the "demo.csv" file? If yes, we have successfully read it!

Now, let's look at the code that just read the "demo.csv" file line by line. Please read the comments carefully.

import csv
csv_file = open('demo.csv','r',newline='',encoding='utf-8')
reader = csv.reader(csv_file)
for row in reader:
    print(row)

Lines 1 and 2: import the csv module. Open the "demo.csv" file with open(), where "r" is read mode and "newline = '" is to avoid double line spacing. encoding='utf-8 'can avoid error or garbled code caused by coding problems.

Line 3: create a reader object with the csv.reader() function.

Lines 4 and 5: loop through each line of the reader object with for. If you print row, you can read the contents of the "demo.csv" file.

Now, type the code above that reads the "demo.csv" file, and don't be lazy.

Great, we have done the writing and reading of csv files!

Add: there are many functions and methods in the csv module itself. Attach the official document link of the csv module:

https://yiyibooks.cn/xx/python_352/library/csv.html#module-csv

We don't need to write down all these functions and methods, as long as we learn to query when we need to use them.

Next, as long as we do the writing and reading of Excel files, it is no longer difficult to store data.

Basic knowledge: Excel writing and reading

Before we start to talk about the writing and reading of Excel files, we need to know a little about the basic concepts of EXCEL documents (it's time to test your understanding of Excel to the point of ๑❛ᴗ❛๑۶).

[question and answer session] what is the meaning of workbook, worksheet and cell in Excel?

An Excel document is also called a workbook. Each workbook can have multiple worksheets. The currently opened worksheets are also called active tables.

There are rows and columns in each worksheet, and the cell that a particular row intersects with a column is called cell. For example, the grid that intersects column A and row 1 in the figure above can be directly expressed as cell A1.

Clear the basic concept of Excel, we can say how openpyxl module operates Excel files. Write then read as usual.

Note: we need to install the openpyxl module in advance. The terminal of the course has been installed. If you want to operate locally, you need to install it locally. (installation method: window computer: input command at terminal: pip install openpyxl, press enter; mac computer: input command at terminal: pip3 install openpyxl, press enter)

After the openpyxl module is installed, it should be referenced first, and then a new workbook can be created through the openpyxl.Workbook() function. The code is as follows:

import openpyxl 
#Reference openpyxl.
wb = openpyxl.Workbook()
#Using the openpyxl.Workbook() function to create a new workbook object is to create a new empty Excel file.

After you create a new workbook, you still have to get the worksheet. Otherwise, the program will not know which worksheet to write the content into.

sheet = wb.active
#wb.active is to get the activity table of this workbook, usually the first worksheet.
sheet.title = 'new title'
#You can rename the sheet with. Title. Now the name of the first sheet will be changed from the default "sheet1" to "new title".

After adding a worksheet, we can manipulate cells and write to them.

sheet['A1'] = 'Universal universe' 
#Assign "Marvel Universe" to cell A1 of the first worksheet, that is, write "Marvel Universe" to cell A1.

To write content to a cell, just locate to a specific cell, such as A1 (according to Excel coordinates, A1 represents the cell intersecting the first row of the first column), and then assign a value to the cell.

If we want to write a line to the worksheet, we need to use the append function.

row = ['Captain America','Iron Man','Spider-Man']
#Write a list of the line we want to write and assign it to row.
sheet.append(row)
#Use sheet.append() to add this line of text to the table.

What if we want to write more than one line, but more than one line at a time? Please take 10 seconds to think about it.

Have you come up with a result? (tip: use for loop and click again to get the answer)

rows = [['Captain America','Iron Man','Spider-Man'],['yes','Man Wei','universe', 'Classic','character']]
#First write the multi line content to be written into a list, then put it into a large list and assign it to rows.
for i in rows:
    sheet.append(i)
#Traverse rows, and add the traversal content to the table at the same time, so that multiple rows can be written.
print(rows)
#Print rows

After successful writing, we must remember to save this Excel file, otherwise we will write it for nothing!

wb.save('Marvel.xlsx')
#Save the new Excel file and name it "Marvel.xlsx"

In this way, we can write the code written in Excel file and run the code.

import openpyxl 
wb=openpyxl.Workbook() 
sheet=wb.active
sheet.title='new title'
sheet['A1'] = 'Universal universe'
rows= [['Captain America','Iron Man','Spider-Man'],['yes','Man Wei','universe', 'Classic','character']]
for i in rows:
    sheet.append(i)
print(rows)
wb.save('Marvel.xlsx')

Here, please type the code written in Excel above. You can replace the text to be written with the content you want to write.

Congratulations, Excel file writing has learned! There should be applause here!

Next, let's finish reading the Excel file, the last basic knowledge point of storing data.

Please run the code, let's read the "Marvel.xlsx" file just written.

import openpyxl 
#Code written:
wb = openpyxl.Workbook() 
sheet = wb.active
sheet.title = 'new title'
sheet['A1'] = 'Universal universe'
rows = [['Captain America','Iron Man','Spider-Man','Raytheon'],['yes','Man Wei','universe', 'Classic','character']]
for i in rows:
    sheet.append(i)
print(rows)
wb.save('Marvel.xlsx')

#Code read:
wb = openpyxl.load_workbook('Marvel.xlsx')
sheet = wb['new title']
sheetname = wb.sheetnames
print(sheetname)
A1_cell = sheet['A1']
A1_value = A1_cell.value
print(A1_value)

'new title' printed by the program is the name of the worksheet; "Marvel Universe" is the text we just wrote into cell A1.

Line by line look at the code for reading Excel file:

import openpyxl 
#Code written:
wb = openpyxl.Workbook() 
sheet = wb.active
sheet.title = 'new title'
sheet['A1'] = 'Universal universe'
rows = [['Captain America','Iron Man','Spider-Man','Raytheon'],['yes','Man Wei','universe', 'Classic','character']]
for i in rows:
    sheet.append(i)
print(rows)
wb.save('Marvel.xlsx')

#Code read:
wb = openpyxl.load_workbook('Marvel.xlsx')
sheet = wb['new title']
sheetname = wb.sheetnames
print(sheetname)
A1_cell = sheet['A1']
A1_value = A1_cell.value
print(A1_value)

Line 14: call the openpyxl. Load ﹣ workbook() function to open the "Marvel.xlsx" file.

Line 15: get the worksheet named "new title" in the "Marvel.xlsx" workbook.

Sixteenth, seventeenth line of code: sheetnames is used to get the names of all the sheets in the workbook. If you don't know how many worksheets there are in the workbook, you can print out the names of the worksheets.

Line 18-20 Code: assign cell A1 in the "new title" worksheet to cell A1, and then use the cell value attribute to print the value of cell A1.

After learning how to write Excel, reading Excel is still relatively simple, right?

If you are interested in openpyxl module and want to have a deeper understanding, it is recommended to read the official document of openpyxl module:

https://openpyxl.readthedocs.io/en/stable/

We're done with the basics of storing data. Then we enter into the practical operation project - to store Jay Chou's song information.

First, we have to choose how to store the data. Due to the limited space, we only choose to use openpyxl module to store Excel files for demonstration.

Project: store Jay Chou's song information

We have climbed to the data of Jay Chou's song information in the last pass, so as long as we add the code to store data on the basis of the code in the last pass, we will complete the whole process of crawling [obtain data → analyze data → extract data → store data]!

The code of the last close is as follows:

import requests

url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp'
for x in range(5):

    params = {
        'ct': '24',
        'qqmusic_ver': '1298',
        'new_json': '1',
        'remoteplace': 'txt.yqq.song',
        'searchid': '64405487069162918',
        't': '0',
        'aggr': '1',
        'cr': '1',
        'catZhida': '1',
        'lossless': '0',
        'flag_qc': '0',
        'p': str(x + 1),
        'n': '20',
        'w': 'Jay Chou',
        'g_tk': '5381',
        'loginUin': '0',
        'hostUin': '0',
        'format': 'json',
        'inCharset': 'utf8',
        'outCharset': 'utf-8',
        'notice': '0',
        'platform': 'yqq.json',
        'needNewCode': '0'
    }
    # Encapsulate parameters as dictionaries
    res_music = requests.get(url, params=params)
    # Call the get method to download the list
    json_music = res_music.json()
    # Use the json() method to turn the response object into a list / dictionary
    list_music = json_music['data']['song']['list']
    # Take dictionaries layer by layer to get the list of songs
    for music in list_music:
        # List music is a list, and music is its element
        print(music['name'])
        # Find the song name with name as the key
        print('Album:' + music['album']['name'])
        # Find album name
        print('Playback time:' + str(music['interval']) + 'second')
        # Find playback time
        print('Play link: https://y.qq.com/n/yqq/song/' + music['mid'] + '.html\n\n')
        # Find playback link

According to the steps of writing Excel files, we can first add the code of importing openpyxl module, creating workbooks and getting Worksheets on the code of the last pass.

import requests,openpyxl
#Import the openpyxl module on the last code.
wb = openpyxl.Workbook()
#Create Workbook
sheet = wb.active
#Get activity sheet for Workbook
sheet.title = 'song'
#The worksheet is renamed song.

Since we want to store it as an Excel file, we need to add a header first. For example, if we want to store the song name, album, play time and play link now, we can write "song name", "album", "play time" and "play link" in cells A1, B1, C1 and D1 respectively.

sheet['A1'] ='Song name'
sheet['B1'] ='Album of origin'
sheet['C1'] ='Duration of play'
sheet['D1'] ='Playback link'

Next, use the append() function to write all song names, albums, playback duration and playback links to Excel files one by one.

Finally, save this file, and the data will be stored by us all.

The complete code is as follows (please focus on the annotated code):

import requests,openpyxl
wb=openpyxl.Workbook()  
#Create Workbook
sheet=wb.active 
#Get activity sheet for Workbook
sheet.title='restaurants' 
#Worksheet rename

sheet['A1'] ='Song name'     #Add header and assign value to cell A1
sheet['B1'] ='Album of origin'   #Add header and assign value to cell B1
sheet['C1'] ='Duration of play'   #Add header and assign value to cell C1
sheet['D1'] ='Playback link'   #Add header, assign value to cell D1

url = 'https://c.y.qq.com/soso/fcgi-bin/client_search_cp'
for x in range(5):
    params = {
        'ct': '24',
        'qqmusic_ver': '1298',
        'new_json': '1',
        'remoteplace': 'txt.yqq.song',
        'searchid': '64405487069162918',
        't': '0',
        'aggr': '1',
        'cr': '1',
        'catZhida': '1',
        'lossless': '0',
        'flag_qc': '0',
        'p': str(x + 1),
        'n': '20',
        'w': 'Jay Chou',
        'g_tk': '5381',
        'loginUin': '0',
        'hostUin': '0',
        'format': 'json',
        'inCharset': 'utf8',
        'outCharset': 'utf-8',
        'notice': '0',
        'platform': 'yqq.json',
        'needNewCode': '0'
    }

    res_music = requests.get(url, params=params)
    json_music = res_music.json()
    list_music = json_music['data']['song']['list']
    for music in list_music:
        name = music['name']
        # Use name as the key to find the song name and assign the song name to name
        album = music['album']['name']
        # Find the album name and assign it to album
        time = music['interval']
        # Find the playing time and assign the time to time
        link = 'https://y.qq.com/n/yqq/song/' + str(music['mid']) + '.html\n\n'
        # Find play link and assign link to link
        sheet.append([name,album,time,link])
        # Write name, album, time and link as a list, and use append function to write multiple lines to Excel
        print('Song name:' + name + '\n' + 'Album of origin:' + album +'\n' + 'Duration of play:' + str(time) + '\n' + 'Playback link:'+ link)
        
wb.save('Jay.xlsx')            
#Finally, save and name the Excel file

Run the code and the "Jay.xlsx" file will be created. Open this file to see the stored data.

Next is a simple review (the students who just chose to skip the csv module can also have a look at the review content of the csv module, after all, review the old and learn the new).

#Code written by csv:

import csv
csv_file = open('demo.csv','w',newline='')
writer = csv.writer(csv_file)
writer.writerow(['Film','Bean score'])
csv_file.close()

#Code read by csv:

import csv
csv_file = open('demo.csv','r',newline='')
reader=csv.reader(csv_file)
for row in reader:
    print(row)

#Code written by Excel:

import openpyxl 
wb = openpyxl.Workbook() 
sheet = wb.active
sheet.title ='new title'
sheet['A1'] = 'Universal universe'
rows = [['Captain America','Iron Man','Spider-Man','Raytheon'],['yes','Man Wei','universe', 'Classic','character']]
for i in rows:
    sheet.append(i)
print(rows)
wb.save('Marvel.xlsx')

#Code read by Excel:

import openpyxl
wb = openpyxl.load_workbook('Marvel.xlsx')
sheet = wb['new title']
sheetname = wb.sheetnames
print(sheetname)
A1_value = sheet['A1'].value
print(A1_value)

In this connection, we refer to the csv module and openpyxl module. These two modules are written by predecessors, we can use them to realize the operation of data storage.

It is with the modules written by predecessors that we can write a new code that can realize the storage function without any trouble.

This kind of knowledge precipitation created by predecessors, which is shared to us in a selfless way, is called open source spirit. In the programming world, this spirit is the most precious wealth.

It is also because of open source that we can appreciate and obtain the spiritual wealth of our predecessors. For example, with the help of other functions and methods of openpyxl module, we can even realize automatic office (automatic processing of complicated data Excel files, etc.), freeing our hands.

Of course, this requires learning more about this module. However, this is not the focus of this course. Maybe we will set up an automatic office course in the future.

The purpose of saying this is to let you understand the greatness of the open source spirit, and the deeper meaning - to completely reduce repetitive work and improve efficiency.

Finally, I hope that we can not only be beneficiaries of the open source spirit, but also be advocates and practitioners of the open source spirit.

Published 60 original articles, won praise 7, visited 6350
Private letter follow

Tags: Excel encoding JSON REST

Posted on Thu, 06 Feb 2020 00:24:18 -0800 by grayscale2005.