QQ music files have been downloaded and this part of the code is not publicly available for copyright reasons.It is the responsibility of everyone to use this project as a learning exchange only, to support genuine versions



Project introduction

  • You need some music information to write a project, but you haven't found a satisfactory music corpus for a long time on the internet, so you wrote a crawler of QQ music with scrapy
  • Since I only need to use Chinese songs, I only use this crawler to crawl 490,000+ song information of the top 6400 mainland and Hong Kong and Taiwan singers in QQ music. This resource is also shared to Baidu cloud (this resource is only for learning and communication, not for commercial use, if there is infringement, please contact to delete)
  • QQ music's song information is dynamically populated using js. Although the singer and song information of QQ music is requested in plain text using GET, the developer has added some redundant information to the URL request parameter and encrypted some parameters, so most of the effort is still focused on understanding and analyzing the url.

Running Environment


Usage method

Enter the project root directory and run the following command:scrapy crawl qqmusic

Song Format

The crawled song information is stored in a music file in the root directory, where each line represents a song and is stored in json format.

Field description of the song:

  • singer_name: The name of the singer, in an array, because a song may be sung by multiple singers
  • song_name: Song name
  • Subtitle: the subtitle of the song
  • album_name: album name
  • singer_id: singer id, in array form
  • singer_mid: mid of the singer, in array form
  • song_time_public: Song release time
  • song_type: song type
  • Language: song language
  • song_id: song ID
  • song_mid: song mid
  • song_url: URL for song playback
  • lyric: lyrics
  • hot_comments: Great reviews of songs (highlights of songs are only crawled here, some of the less popular songs have the latest reviews, but no highlights), in array form.If there are no good comments, set it as "null"
    • comment_name: The nickname of the reviewer
    • comment_text: Comment content

Crawl Say or don't cry (with Xiao Axin) Example:

	'singer_name': ['Jay Chou'],
	'song_name': 'Say or don't cry ( with May Axin)',
	'subtitle': '',
	'album_name': 'Say or don't cry ( with May Axin)',
	'singer_id': [4558],
	'singer_mid': ['0025NhlN2yWrP4'],
	'song_time_public': '2019-09-16',
	'song_type': 0,
	'language': 0,
	'song_id': 237773700,
	'song_mid': '001qvvgF38HVc4',
	'song_url': '',
	'hot_comments': [{
		'comment_name': 'Cohen',
		'comment_text': '<Say or Don't Cry is a work the listener is looking forward to. The gradual introduction of Piano Prelude and subsequent strings set a warm tone for the whole song.'
	}, {
		'comment_name': 'No one on the ferry',
		'comment_text': '"I stick to my music, I like my music, who calls me Jay Chou."Perhaps he likes it because he is so dedicated to music that he is fascinated by this musical genius.'
	}, {
		'comment_name': 'Hard candy',
		'comment_text': 'For many people in the late 80s and 90s, youth may just fade away. I still vaguely remember the smell of seven miles in the summer, and the cool-looking boy singing Simple Love.'
	}, {
		'comment_name': 'Gardenia ink',
		'comment_text': 'Jay Style song heavy return, long-awaited song.Fang Wenshan once again wrote his own lyrics and got lost in the prelude.The most surprising thing is Axin's dedication.'
	}, {
		'comment_name': 'Sendai Sailing Height',
		'comment_text': 'Boxes bought by my brother in junior high school piled up into tapes on Jay's cover.'
	}, {
		'comment_name': 'Rose Boy',
		'comment_text': 'This year, Jeren is 40 years old, but I still have that personalized boy in my mind who says, "Can I sing more and talk less?"....'
	}, {
		'comment_name': 'Snail..',
		'comment_text': 'I think there must be a song in your youth that belongs to Jay Yes.Lie in your heart, turn over occasionally, and just meet that mood, and some new emotions and new perceptions arise.'
	}, {
		'comment_name': 'This user has been blocked',
		'comment_text': 'Jay Chou is a generation of youth. He remembers that the song Blue and White Porcelain listened to at the party in 2008 suddenly opened in Mausetton, and there would be so good songs in the world.'
	}, {
		'comment_name': 'Tamarix chinensis TAO',
		'comment_text': 'Popularity top Jay Chou❗💎💖✨🌈'
	}, {
		'comment_name': 'Fingering fragrance Zhang Daxian z',
		'comment_text': 'The once proud young man, It's also the year of confusion.'
	}, {
		'comment_name': 'Fingering fragrance Zhang Daxian z',
		'comment_text': '40 Jarren, the old man, sat there, Looking at the past with affection'
	}, {
		'comment_name': 'Fingering fragrance Zhang Daxian z',
		'comment_text': 'It's hard to forget that when you first met you, under Big Ben in Red Dust Host, you laughed Sweet like a rainbow-colored maltose on the horizon.'
	}, {
		'comment_name': 'Tamarix chinensis TAO',
		'comment_text': 'Hand in hand, step by step, step by step, step by step, look at the sky Look at the stars, one two three four connected'
	}, {
		'comment_name': '\u2062',
		'comment_text': 'Jay Chou I QQ The only hero in music'
	}, {
		'comment_name': 'Fingering fragrance Zhang Daxian z',
		'comment_text': 'Youth never dies!The Oriental palace opens its window, the roll of rosemary of dreams, the world of blue and white porcelains is matchless'
	'lyric': 'Say or don't cry ( with May Axin) - Jay Chou (Jay Chou)\n Word: Fang Wenshan\n Song: Jay Chou\n Jay Chou:\n No contact with later life\n I always listen to others\n What's wrong with you? What's wrong with you?\n It's me who can't let go\n Stay in the corner when there are many people\n In case someone asks me\n What's wrong with you? You bowed your head\n Protect me without complaining\n The phone starts hiding and never tells me\n Not accustomed to living alone\n Leave me and have a good time\n Fear of disturbing me who wants freedom\n At this point you still care\n What do people think of me\n Desperately explaining that it's not my fault that you're leaving\n Looking at your sad words but not saying\n You'll smile, let go and cry to let me go\n Ashin:\n The phone starts hiding and never tells me\n Not accustomed to living alone\n Leave me and have a good time\n Fear of disturbing me who wants freedom\n At this point you still care\n What do people think of me\n Desperately explaining that it's not my fault that you're leaving\n In:\n Looking at your sad words but not saying\n You'll smile, let go and cry to let me go\n Jay Chou:\n You have nothing but to fuel my dream\n Ashin:\n How long has it been painful\n Jay Chou:\n How long has passed\n In:\n Still looking for a reason to wait for me'


The general logic of Crawlers

  • Crawl the specified number of singers first
  • Get a list of songs for each singer based on the singer's id (the list of Songs contains some information about the song, but does not include lyrics and comments)
  • Get lyric information of a song based on the song id
  • Get commentary information about a song based on the song id
  • Write songs to files

Corpus Sharing

This resource is only for learning and communication, not for commercial use. In case of infringement, please contact for deletion.

Corpus name Corpus address Corpus description
490,000 + song information Baidu Disk [Extraction Code: uokb] Contains 490,000+ song information of the top 6400 mainland and Hong Kong and Taiwan singers in QQ music, including song information, lyrics, excellent comments, etc.

Resolve the url of QQ music

QQ music singer and song information are dynamically populated using js, so it is not possible to get song information by crawling the html web page and parsing the content of the web page.Since it is populated dynamically through js, the format of the requested url needs to be parsed

url parsing of singer list

Open QQ Music Singer Page From the developer tool, find the url of the list of requested singers as follows:

You can see that the url carries many parameters, including g_tk, loginUin, hostUin, format, inCharset, outCharset, notice, platform, needNewCode, data

Put the url in postman, try canceling the parameters one by one, and find useful parameters.Ultimately, you can see that only the parameter data actually has a practical effect on the request. Remove the other parameters and get the simplified url as follows:

In conjunction with the singer page, a closer analysis of the simplified url above reveals that the data parameter implies a lot of actual request parameters:

  • Area: The area of the singer (mainland, Hong Kong, Taiwan, Europe, America, etc.).-100:All, 200:Mainland, 2:Hong Kong, 5:Europe, America, 4:Japan, 3:Korea, 6:Other
  • genre: singer style (pop, hip-hop, etc.).-100: All, 1: Pop, 6: Hip-hop, 2: Rock, 4: Electronics, 3: Folk, 8:R&B, 10: Folk, 9: Light Music, 5: Jazz, 14: Classic, 25: Country, 20: Blue
  • cur_page: Page number of current artist list
  • Index:cur_page*page_size(index denotes the starting index of the current page, page_size denotes the number of artists per page)

Using the control variable method, fixed the area and genre variables, and comparing the URLs of the following singers requesting pages 1, 2 and 3, we can see that there are some potential rules in index and cur_page.

In the following three URLs (* is added artificially for easy description), the number marked with ** after index is the variable index, and the number marked with ** after cur_page is the variable cur_page.stay Singer Page You can see 80 singers on each page.Obviously, cur_page=n, index=80(n-1) when requesting the singer on page n.**0**%2C%22cur_page%22%3A**1**%7D%7D%7D**80**%2C%22cur_page%22%3A**2**%7D%7D%7D**160**%2C%22cur_page%22%3A**3**%7D%7D%7D

From the above analysis, you can get the url format of the list of requested singers as follows:

singer_list_url = "{area}%2C%22sex%22%3A-100%2C%22genre%22%3A{genre}%2C%22index%22%3A-100%2C%22sin%22%3A{index}%2C%22cur_page%22%3A{cur_page}%7D%7D%7D"

Song list url parsing

Similarly find url to request song list

Filter out the useless parameters to get a simplified url:

Parameters hidden in data:

  • singerMid: mid of the singer
  • num: equal to page_size, which represents the number of songs per page
  • Begin:page*page_size(begin indicates the starting index of the current page)

From the above analysis, you can get the url format of the list of requested songs as follows:

song_list_url = "{singer_mid}%22%2C%22begin%22%3A{begin}%2C%22num%22%3A{num}%7D%2C%22module%22%3A%22musichall.song_list_server%22%7D%7D"

Request Lyric url parsing

By requesting the url of the song list, it is found that the lyrics information of the song is not returned, so the lyrics must be obtained through an additional url request, and the url of the requested lyrics is found as follows:

Put the url above in postman to send the request, but get no correct response. After some research, it is found that the url needs to add referer to make the header work properly


The url of the requested lyrics is simplified to:

Easy to see

  • musicid: song_id
  • "004RDW5Q2ol2jj" in referer means song_mid

The URL to get the requested lyrics is as follows, and lyric_url needs to take the referer header with it:

lyric_url = "{song_id}&format=json"

referer = "{song_mid}.html"

Song review url parsing

The URLs for finding song reviews are as follows:

After simplifying the parameters, the following URLs are obtained:

Parameter description:

  • topid: song_id of the song
  • pagenum: Number of pages for "Latest Comments"
  • Page size: Number of "latest comments" per page

Note: pagenum and pagesize here affect the results returned by "Latest Comments" without affecting "Highlighted Comments", which has no parameters to control the results returned by "Highlighted Comments"

The url format for requesting song comments is as follows:

comment_url = '{song_id}&cmd=8&pagenum={pagenum}&pagesize={pagesize}'

Song url

The url format of the song is as follows:{song_mid}.html

Lyrics Analysis

By Jay Chou Say or don't cry (with Xiao Axin) take as an example

The lyrics obtained from the lyric_url request are in the following format, which looks cluttered and contains characters

Get neat lyrics from regular expressions (separated by \n between each Lyric sentence):

Say or Don't Cry (with Axin) - Jay Chou\n
 Word: Fang Wenshan\n
 Song: Jay Chou\n
 Jay Chou: \n
 Without contact with later life\n
 I always hear people say\n
 What's wro n g with you?
I\n'm the one who can't get rid of it
 Stay in the corner when there are a lot of people\n
 Just in case someone asks me\n
 What happened to you when you lowered your head\n
 Protect me without complai n ing
 The phone starts hiding and never says\n to me
 Not accustomed to living alone\n
 Leave me and I'll have a good time \n
 Fear of disturbing me who wants freedom\n
 At this point you still care about \n
 What do people think of me\n
 Desperately explaining that it was not my fault that you were going to go\n
 Looking at your sad words but not saying\n
 You'll smile and let go and cry to let me go\n
 Ashin: \n
 The phone starts hiding and never says\n to me
 Not accustomed to living alone\n
 Leave me and I'll have a good time \n
 Fear of disturbing me who wants freedom\n
 At this point you still care about \n
 What do people think of me\n
 Desperately explaining that it was not my fault that you were going to go\n
 Combined: \n
 Looking at your sad words but not saying\n
 You'll smile and let go and cry to let me go\n
 Jay Chou: \n
 You don't have anything but cheer on my dream\n
 Ashin: \n
 How long has it been painful\n
 Jay Chou: \n
 How long has passed\n
 Combined: \n
 Still looking for a reason to wait for me

Parameter description for

  • DOWNLOAD_DELAY: The interval between request s
  • ROBOTSTXT_OBEY: Compliance with the crawler protocol of the website
  • SINGER_PAGE_NUM: Number of artist pages crawled
  • SINGER_PAGE_SIZE: Number of singers per page
  • SONG_PAGE_NUM: Number of pages crawled by each singer's song
  • SONG_PAGE_SIZE: Number of songs per page per singer

Avoid crawlers being ban ned

  • Using the user agent pool, take turns selecting one of them as the user agent
  • Set the download delay DOWNLOAD_DELAY to 1 or greater.(At first, I was afraid that QQ music would be detected by some anti-crawlers officially, so I set DOWNLOAD_DELAY to 1, that is, I sent the request every second, and found that the crawling speed was too slow and changed to 0.1.Final sex set to 0, found that the original QQ music has no anti-crawling measures for the url of Lyric information)
  • Use IP pool to dynamically change IP address

The user agent pool is as follows:

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/ Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

Future Work

  • Request method for parsing QQ music's song file
  • Korean and English lyrics contain special symbols, some of which are not well escaped
  • Perfect setup allows users more flexibility to crawl songs from different regions and styles

Unsolved problem

The program can run normally, but when the crawl ends (the log message says'finish_reason':'finished', indicating that the crawl task is complete), an error is reported, although the result is not affected, but the cause of the error is not known for the time being.

2019-12-21 15:08:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-21 15:08:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 8,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 8,
 'downloader/request_bytes': 682269284,
 'downloader/request_count': 1067470,
 'downloader/request_method_count/GET': 1067470,
 'downloader/response_bytes': 1563129445,
 'downloader/response_count': 1067462,
 'downloader/response_status_count/200': 1067460,
 'downloader/response_status_count/404': 2,
 'dupefilter/filtered': 30476,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 21, 7, 8, 53, 816618),
 'item_scraped_count': 489860,
 'log_count/DEBUG': 1557332,
 'log_count/ERROR': 1,
 'log_count/INFO': 229,
 'request_depth_max': 3,
 'response_received_count': 1067462,
 'retry/count': 8,
 'retry/reason_count/twisted.internet.error.TimeoutError': 8,
 'scheduler/dequeued': 1067468,
 'scheduler/dequeued/memory': 1067468,
 'scheduler/enqueued': 1067468,
 'scheduler/enqueued/memory': 1067468,
 'start_time': datetime.datetime(2019, 12, 20, 14, 59, 43, 749919)}
2019-12-21 15:08:53 [scrapy.core.engine] INFO: Spider closed (finished)
2019-12-21 15:08:53 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method MemoryUsage.engine_stopped of <scrapy.extensions.memus
age.MemoryUsage object at 0x000001EFCF7E8630>>
Traceback (most recent call last):
  File "c:\users\administrator\.conda\envs\ppy36\lib\site-packages\twisted\internet\", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "c:\users\administrator\.conda\envs\ppy36\lib\site-packages\pydispatch\", line 54, in robustApply
    return receiver(*arguments, **named)
  File "c:\users\administrator\.conda\envs\ppy36\lib\site-packages\scrapy\extensions\", line 70, in engine_stopped
    for tsk in self.tasks:
AttributeError: 'MemoryUsage' object has no attribute 'tasks'


