Python Scrapy's QQ Music Crawler Music Download, Crawl Song Information, Lyrics, Great Comments,

QQ Music Crawler (with scrapy)/QQ Music Spider

Update of magnetic search website 2020/01/07

https://www.cnblogs.com/cilisousuo/p/12099547.html

UPDATE 2019.12.23

QQ music files have been downloaded and this part of the code is not publicly available for copyright reasons.It is the responsibility of everyone to use this project as a learning exchange only, to support genuine versions

 

 

Project introduction

  • You need some music information to write a project, but you haven't found a satisfactory music corpus for a long time on the internet, so you wrote a crawler of QQ music with scrapy
  • Since I only need to use Chinese songs, I only use this crawler to crawl 490,000+ song information of the top 6400 mainland and Hong Kong and Taiwan singers in QQ music. This resource is also shared to Baidu cloud (this resource is only for learning and communication, not for commercial use, if there is infringement, please contact to delete)
  • QQ music's song information is dynamically populated using js. Although the singer and song information of QQ music is requested in plain text using GET, the developer has added some redundant information to the URL request parameter and encrypted some parameters, so most of the effort is still focused on understanding and analyzing the url.

Running Environment

scrapy==1.5.1

Usage method

Enter the project root directory and run the following command:scrapy crawl qqmusic

Song Format

The crawled song information is stored in a music file in the root directory, where each line represents a song and is stored in json format.

Field description of the song:

  • singer_name: The name of the singer, in an array, because a song may be sung by multiple singers
  • song_name: Song name
  • Subtitle: the subtitle of the song
  • album_name: album name
  • singer_id: singer id, in array form
  • singer_mid: mid of the singer, in array form
  • song_time_public: Song release time
  • song_type: song type
  • Language: song language
  • song_id: song ID
  • song_mid: song mid
  • song_url: URL for song playback
  • lyric: lyrics
  • hot_comments: Great reviews of songs (highlights of songs are only crawled here, some of the less popular songs have the latest reviews, but no highlights), in array form.If there are no good comments, set it as "null"
    • comment_name: The nickname of the reviewer
    • comment_text: Comment content

Crawl Say or don't cry (with Xiao Axin) Example:

{
	'singer_name': ['Jay Chou'],
	'song_name': 'Say or don't cry ( with May Axin)',
	'subtitle': '',
	'album_name': 'Say or don't cry ( with May Axin)',
	'singer_id': [4558],
	'singer_mid': ['0025NhlN2yWrP4'],
	'song_time_public': '2019-09-16',
	'song_type': 0,
	'language': 0,
	'song_id': 237773700,
	'song_mid': '001qvvgF38HVc4',
	'song_url': 'https://y.qq.com/n/yqq/song/001qvvgF38HVc4.html',
	'hot_comments': [{
		'comment_name': 'Cohen',
		'comment_text': '<Say or Don't Cry is a work the listener is looking forward to. The gradual introduction of Piano Prelude and subsequent strings set a warm tone for the whole song.I don't know if you have had a good look at the background of this song, "This is about「Appointment」and「Complete」Love songs, the whole song takes piano as the main story line, strings weave Lyric scenes, creating an atmosphere of love movies.Still in Weekly style, Jarren always holds the emotional tone of each song and can always impress his listeners with the finest lyrics and compilations.\\n\\n Of course, this song not only satisfies the audience's expectations, but also gives us some special features. We believe that the addition of the second paragraph of Axin will surprise many listeners.For me, my favorite piano piece of the whole song is warm, pure and touching. Perhaps the first time I listen to it, I don't fully understand what Jarren wants to tell us about a love story, but we can also feel the emotional part from simple lyrics and clean melodies.'
	}, {
		'comment_name': 'No one on the ferry',
		'comment_text': '"I stick to my music, I like my music, who calls me Jay Chou."Perhaps he likes it because he is so dedicated to music that he is fascinated by this musical genius.\\n I believe that a lot of people have him in their youth, from teenagers who had nothing but talent and were unknown to today's modern pop king, Jay Chou accompanied us through our youth with a song, saying that life is like a retrograde journey, how confused and discouraged, but he can always find himself in his songs.This year is not going to be old anymore. Life is still a Jay Chou spinning around on the sidewalk.\\n And this "Say or Don't Cry" is more like a convention, "You will smile and let go without crying to let me go", and we are not far away to make an appointment, this is Jay Chou, that is, youth.\\n'
	}, {
		'comment_name': 'Hard candy',
		'comment_text': 'For many people in the late 80s and 90s, youth may just fade away. I still vaguely remember the smell of seven miles in the summer, and the cool-looking boy singing Simple Love.\\n Listen quietly once, or that familiar style, that familiar Jay. Maybe Jarren's voice has changed, but he still carries on that touch in the song.\\n like Jay I can't forget how many nights singles loop, how many times I toss and turn around.His songs often resonate with you at some point in time.\\n Say no, don't cry, but how many people still burst into tears, either because 40-year-old Jay finally made a new song, or because some of the past, some moved, really can't go back...'
	}, {
		'comment_name': 'Gardenia ink',
		'comment_text': 'Jay Style song heavy return, long-awaited song.Fang Wenshan once again wrote his own lyrics and got lost in the prelude.The most surprising thing is Axin's dedication.One playing the piano and one playing the guitar are immortal partnerships.The melody of about 1 minute and 40 seconds is "Suddenly miss you, where are you?".The melody of 1 minute 47 seconds is "Say Good Happiness", and then enter Axin part. These exquisites can always be hidden in Jeremy's songs.Our youth is full.Slowly piano voice tells of sad love.There is a kind of dedication called silence, without words.There is a love called let go, you can live happily.One year after the last new song was released, how many young people are 40 years old Zhou Dong, and whether the familiar melody still rings in their ears.From the first solo album Jay>Today, we are passionate, and Zhou Dong is constantly surprising.Jay Chou - a charm that represents the singer of the times.Say no, don't you cry?'
	}, {
		'comment_name': 'Sendai Sailing Height',
		'comment_text': 'Boxes bought by my brother in junior high school piled up into tapes on Jay's cover.\\n Tape into the radio, tape turns, every pause, every unique style, every sound in your ear.From ignorance to maturity, they all like Jay Chou's annual ring.\\n In this tape rotating ring, in the list of songs played, in the joy and sadness of life mood, the same thing is Jay Chou's song.\\n His songs have wrapped up my whole youth, spent most of my student life, and accompanied me on countless lonely nights.\\n Perhaps the most perfect interpretation of his song is to listen to it all.\\n Jay Chou is my belief and strength.\\n Now my brother is working, and my school life is fleeting.\\n Jay Chou's love and fear is to hear the melody of his music melt into my heart, and go with it.'
	}, {
		'comment_name': 'Rose Boy',
		'comment_text': 'This year, Jeren is 40 years old, but I still have that personalized boy in my mind who says, "Can I sing more and talk less?"....<Say or Don't Cry is Jarren's heart. Maybe he won't sing as fast in the future as he used to, and he's also worried about whether his songs still meet fans'requirements.,But don't cry.\\n I believe that for many fans, Jarren has already given a complete music ocean, no matter sad or happy, he always seems to have songs matching you, always taking care of his emotions, even today when all kinds of songs are in full blossom, every night still likes to fall asleep in his songs.\\n No crying, no matter in the past or in the future, we still want to let your songs accompany us, accompany the whole youth, accompany for a lifetime.'
	}, {
		'comment_name': 'Snail..',
		'comment_text': 'I think there must be a song in your youth that belongs to Jay Yes.Lie in your heart, turn over occasionally, and just meet that mood, and some new emotions and new perceptions arise.A youth, a week of Jay Ron, listen when depressed, whether gentle or hard, can always sing full of vitality.Good music, talk.\n It seems that every time he is struggling, confused and helpless, he can be slowly cured in his music world, and music and Jay Chou are a complete encounter.\n Youth is hard to keep and summer is over.Life may be different, but cheering on him when he needs us is enough. Youth is youthful.\n At any time, anywhere, I want all the fans to look back and see the same Jay Chou.'
	}, {
		'comment_name': 'This user has been blocked',
		'comment_text': 'Jay Chou is a generation of youth. He remembers that the song Blue and White Porcelain listened to at the party in 2008 suddenly opened in Mausetton, and there would be so good songs in the world.After that, I fell in love with all his songs and spent five years with me in my dormitory listening at night when I bought my own tape to school.This time we have a new song from Zhou Youth. I have been waiting for it for a few days, staying up late, but I think all these waiting are worth it.'
	}, {
		'comment_name': 'Tamarix chinensis TAO',
		'comment_text': 'Popularity top Jay Chou❗💎💖✨🌈\\n Strength singer Jay Chou❗💎💖✨🌈\\n Musical genius Jay Chou❗💎💖✨🌈\\n Asian King Jay Chou❗💎💖✨🌈\\n Family Perfect Jay Chou❗💎💖✨🌈\\n Magic Master Jay Chou❗️💎💖✨🌈\\n Chinese King Jay Chou❗️💎💖✨🌈\\n Unrivalled Jay Chou❗️💎💖✨🌈\\n Milk Tea Fairy Jay Chou❗️💎💖✨🌈\\n Guard the best Jay Chou in the world💖💖💖'
	}, {
		'comment_name': 'Fingering fragrance Zhang Daxian z',
		'comment_text': 'The once proud young man,\\n It's also the year of confusion.\\n From a shy boy who always likes to hide his face under a duck-tongue cap,\\n Now the little public act of scattering dog food when jokes and humorous sentences come back frequently,\\n The new album arrives,\\n It is the greatest fortune in life to meet Jerome in this age.\\n His work illuminates your way and accompanies you through the long night,\\n Your support and inclusiveness helped him change from being unknown to being seen.\\n Maybe one day you're busy living. He falls into silence and gradually no longer intersects.\xa0,\\n Hope that as you get older you can still recall the crazy pursuit of youthfulness for one person.\\n His name is\\n Week!Jet!Lun!Love for life JAY\\n Thank you for your music\\n It gives me the motivation to learn guitar all the time.\\n Thank you for your music\\n Every night I spend with you,\\n Thank you for your music\\n Accompanying me on the basketball court every day,\\n Youth has you, so good, JAY !'
	}, {
		'comment_name': 'Fingering fragrance Zhang Daxian z',
		'comment_text': '40 Jarren, the old man, sat there,\\n Looking at the past with affection,\\n Full of eyes are the images of Van Tecy, who was born at the age of 22......\\n As the most successful and influential singer and musician in Chinese music,\\n15 Golden Music Award, Eighth Greater China Sales Champion, Fourth World Music Award WMA,25 Global Creatives, The Fast Company>The world's top 100 creative people, the third largest song download in the world in 2010, the first Chinese Hollywood theme song in history, Asian King, one of the world's top 10 ghost musicians,\\n This is Jay Chou, who is a belief, a genius, an era, a legend, a light for the Chinese people, a memory of the 80s, a youth of the 90s, and witnesses the growth of the 00s.\\n Thank you.\\n The trajectory you have traveled is a memory of your youth,\\n Salute, Jay Chou!\\n Forever small public act, forever love!'
	}, {
		'comment_name': 'Fingering fragrance Zhang Daxian z',
		'comment_text': 'It's hard to forget that when you first met you, under Big Ben in Red Dust Host, you laughed Sweet like a rainbow-colored maltose on the horizon. They all said that love came too fast like a tornado. You and I were close to each other, met and fell in love with each other. I often watched you in Silence and told you that Say Don't Cry.Forever; in every Sunny Day, I will be on the school basketball court "Wait for you to finish class"; My girl, you are like I "Can't Tell Secret", I want to give you the romantic like "White Balloon". I want to make a "Dandelion Agreement" with you, and then hold your hand and lift up that "Handwritten Before" with your guitar, so I will always take you to see "The Longest Telegraph"Shadow, to see the most beautiful Star Sunny every night, until always; My girl, you are like the Princess Mermaid in the sea, I would like to be your prince, and I love you all the time❤Always by your side.😘'
	}, {
		'comment_name': 'Tamarix chinensis TAO',
		'comment_text': 'Hand in hand, step by step, step by step, step by step, look at the sky\\n Look at the stars, one two three four connected\\n Jay Chou's "Say" trilogy will officially end on September 16, 2019\\n Don't miss it. It's a mistake for a lifetime!'
	}, {
		'comment_name': '\u2062',
		'comment_text': 'Jay Chou\\n I QQ The only hero in music\\n You said you wanted to listen to your mother and drink Grandpa's tea\\n What you say is most beautiful is to hide from the rain under the eaves of your house\\n I mentioned a few lines to describe who you are\\n The deciduous leaves of Champs Elysee don't feel a little hard to follow\\n Time is the antidote and the one I'm taking right now\\n My eighteen years of youth are proud of your presence\\n As long as you're still singing for us and still facing adversity with a smile\\n Wind will not blow far away my youth will not grow old\\n Some people say that you are no longer as talented as you were when Jiang Lang was born\\n That's the pinnacle and brilliance they envy you\\n Say goodbye to crying tonight\\n Thank you for being with me throughout my youth\\n❤️❤️❤️'
	}, {
		'comment_name': 'Fingering fragrance Zhang Daxian z',
		'comment_text': 'Youth never dies!The Oriental palace opens its window, the roll of rosemary of dreams, the world of blue and white porcelains is matchless, but the love letters turn yellow, the order of brochure orchid pavilion, the orphaned drinking of daughters red, the dance of spring and autumn evening is not central, the garden party has a brilliant seven-mile fragrance, Miss-mother in the West box, caressed the chord, played the seventh chapter of the night, the same tune is sentimental.Maple leaves are falling, the sea of flowers is quiet and evening is winding into the country, Ninja stars are sunny, Capricorn is unusual, terraces smell rice, sweet heart rain is shining sunny day, General's roof is dazed, cowards will be eliminated, I do not deserve the look of Zhou Grand Dragon Warriors, go away in the name of my father without excuse, bid farewell to Hodgkin's armor, wear gold armor, play with double truncators, and drift all the way north across thousands of mountains and rivers.Come, go thousands of miles away to the daisy terrace, tornadoes and blue storms, sing from all sides, the memory of the war's end, the track of the scale, ride the time machine, the end of the world goes back to the past, simple love has been stranded in the Flight Diary of love.'
	}],
	'lyric': 'Say or don't cry ( with May Axin) - Jay Chou (Jay Chou)\\n Word: Fang Wenshan\\n Song: Jay Chou\\n Jay Chou:\\n No contact with later life\\n I always listen to others\\n What's wrong with you? What's wrong with you?\\n It's me who can't let go\\n Stay in the corner when there are many people\\n In case someone asks me\\n What's wrong with you? You bowed your head\\n Protect me without complaining\\n The phone starts hiding and never tells me\\n Not accustomed to living alone\\n Leave me and have a good time\\n Fear of disturbing me who wants freedom\\n At this point you still care\\n What do people think of me\\n Desperately explaining that it's not my fault that you're leaving\\n Looking at your sad words but not saying\\n You'll smile, let go and cry to let me go\\n Ashin:\\n The phone starts hiding and never tells me\\n Not accustomed to living alone\\n Leave me and have a good time\\n Fear of disturbing me who wants freedom\\n At this point you still care\\n What do people think of me\\n Desperately explaining that it's not my fault that you're leaving\\n In:\\n Looking at your sad words but not saying\\n You'll smile, let go and cry to let me go\\n Jay Chou:\\n You have nothing but to fuel my dream\\n Ashin:\\n How long has it been painful\\n Jay Chou:\\n How long has passed\\n In:\\n Still looking for a reason to wait for me'
}

  

The general logic of Crawlers

  • Crawl the specified number of singers first
  • Get a list of songs for each singer based on the singer's id (the list of Songs contains some information about the song, but does not include lyrics and comments)
  • Get lyric information of a song based on the song id
  • Get commentary information about a song based on the song id
  • Write songs to files

Corpus Sharing

This resource is only for learning and communication, not for commercial use. In case of infringement, please contact for deletion.

Corpus name Corpus address Corpus description
490,000 + song information Baidu Disk [Extraction Code: uokb] Contains 490,000+ song information of the top 6400 mainland and Hong Kong and Taiwan singers in QQ music, including song information, lyrics, excellent comments, etc.

Resolve the url of QQ music

QQ music singer and song information are dynamically populated using js, so it is not possible to get song information by crawling the html web page and parsing the content of the web page.Since it is populated dynamically through js, the format of the requested url needs to be parsed

url parsing of singer list

Open QQ Music Singer Page From the developer tool, find the url of the list of requested singers as follows:

https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI9574303950614538&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A0%2C%22cur_page%22%3A1%7D%7D%7D

You can see that the url carries many parameters, including g_tk, loginUin, hostUin, format, inCharset, outCharset, notice, platform, needNewCode, data

Put the url in postman, try canceling the parameters one by one, and find useful parameters.Ultimately, you can see that only the parameter data actually has a practical effect on the request. Remove the other parameters and get the simplified url as follows:

https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A0%2C%22cur_page%22%3A1%7D%7D%7D

In conjunction with the singer page, a closer analysis of the simplified url above reveals that the data parameter implies a lot of actual request parameters:

  • Area: The area of the singer (mainland, Hong Kong, Taiwan, Europe, America, etc.).-100:All, 200:Mainland, 2:Hong Kong, 5:Europe, America, 4:Japan, 3:Korea, 6:Other
  • genre: singer style (pop, hip-hop, etc.).-100: All, 1: Pop, 6: Hip-hop, 2: Rock, 4: Electronics, 3: Folk, 8:R&B, 10: Folk, 9: Light Music, 5: Jazz, 14: Classic, 25: Country, 20: Blue
  • cur_page: Page number of current artist list
  • Index:cur_page*page_size(index denotes the starting index of the current page, page_size denotes the number of artists per page)

Using the control variable method, fixed the area and genre variables, and comparing the URLs of the following singers requesting pages 1, 2 and 3, we can see that there are some potential rules in index and cur_page.

In the following three URLs (* is added artificially for easy description), the number marked with ** after index is the variable index, and the number marked with ** after cur_page is the variable cur_page.stay Singer Page You can see 80 singers on each page.Obviously, cur_page=n, index=80(n-1) when requesting the singer on page n.

https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A**0**%2C%22cur_page%22%3A**1**%7D%7D%7D

https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A**80**%2C%22cur_page%22%3A**2**%7D%7D%7D

https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A**160**%2C%22cur_page%22%3A**3**%7D%7D%7D

From the above analysis, you can get the url format of the list of requested singers as follows:

singer_list_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A{area}%2C%22sex%22%3A-100%2C%22genre%22%3A{genre}%2C%22index%22%3A-100%2C%22sin%22%3A{index}%2C%22cur_page%22%3A{cur_page}%7D%7D%7D"

Song list url parsing

Similarly find url to request song list

https://u.y.qq.com/cgi-bin/musicu.fcg?-=getSingerSong8235365887193979&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerSongList%22%3A%7B%22method%22%3A%22GetSingerSongList%22%2C%22param%22%3A%7B%22order%22%3A1%2C%22singerMid%22%3A%22004Be55m1SJaLk%22%2C%22begin%22%3A0%2C%22num%22%3A10%7D%2C%22module%22%3A%22musichall.song_list_server%22%7D%7D

Filter out the useless parameters to get a simplified url:

https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerSongList%22%3A%7B%22method%22%3A%22GetSingerSongList%22%2C%22param%22%3A%7B%22order%22%3A1%2C%22singerMid%22%3A%22004Be55m1SJaLk%22%2C%22begin%22%3A0%2C%22num%22%3A10%7D%2C%22module%22%3A%22musichall.song_list_server%22%7D%7D

Parameters hidden in data:

  • singerMid: mid of the singer
  • num: equal to page_size, which represents the number of songs per page
  • Begin:page*page_size(begin indicates the starting index of the current page)

From the above analysis, you can get the url format of the list of requested songs as follows:

song_list_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A0%7D%2C%22singerSongList%22%3A%7B%22method%22%3A%22GetSingerSongList%22%2C%22param%22%3A%7B%22order%22%3A1%2C%22singerMid%22%3A%22{singer_mid}%22%2C%22begin%22%3A{begin}%2C%22num%22%3A{num}%7D%2C%22module%22%3A%22musichall.song_list_server%22%7D%7D"

Request Lyric url parsing

By requesting the url of the song list, it is found that the lyrics information of the song is not returned, so the lyrics must be obtained through an additional url request, and the url of the requested lyrics is found as follows:

https://c.y.qq.com/lyric/fcgi-bin/fcg_query_lyric_yqq.fcg?nobase64=1&musicid=105648715&-=jsonp1&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0

Put the url above in postman to send the request, but get no correct response. After some research, it is found that the url needs to add referer to make the header work properly

referer:https://y.qq.com/n/yqq/song/004RDW5Q2ol2jj.html

The url of the requested lyrics is simplified to:

https://c.y.qq.com/lyric/fcgi-bin/fcg_query_lyric_yqq.fcg?musicid=105648715&format=json

Easy to see

  • musicid: song_id
  • "004RDW5Q2ol2jj" in referer means song_mid

The URL to get the requested lyrics is as follows, and lyric_url needs to take the referer header with it:

lyric_url = "https://c.y.qq.com/lyric/fcgi-bin/fcg_query_lyric_yqq.fcg?nobase64=1&musicid={song_id}&format=json"

referer = "https://y.qq.com/n/yqq/song/{song_mid}.html"

Song review url parsing

The URLs for finding song reviews are as follows:

https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=GB2312&notice=0&platform=yqq.json&needNewCode=0&cid=205360772&reqtype=2&biztype=1&topid=105648715&cmd=8&needmusiccrit=0&pagenum=0&pagesize=25&lasthotcommentid=&domain=qq.com&ct=24&cv=10101010

After simplifying the parameters, the following URLs are obtained:

https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?biztype=1&topid=105648715&cmd=8&pagenum=0&pagesize=25

Parameter description:

  • topid: song_id of the song
  • pagenum: Number of pages for "Latest Comments"
  • Page size: Number of "latest comments" per page

Note: pagenum and pagesize here affect the results returned by "Latest Comments" without affecting "Highlighted Comments", which has no parameters to control the results returned by "Highlighted Comments"

The url format for requesting song comments is as follows:

comment_url = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?biztype=1&topid={song_id}&cmd=8&pagenum={pagenum}&pagesize={pagesize}'

Song url

The url format of the song is as follows:

https://y.qq.com/n/yqq/song/{song_mid}.html

Lyrics Analysis

By Jay Chou Say or don't cry (with Xiao Axin) take as an example

The lyrics obtained from the lyric_url request are in the following format, which looks cluttered and contains characters

[ti : Say not to cry (With   Shaoxin)] 
 [ar : Jay Chou] 
 [al : Say not to cry (With   Shaoxin)] 

[by :] 
 [offset : 0] 
 [00 : 00 . 00] Say no crying (with   Axin)   -   Jay   Chou ) 

[00 : 14 . 94] Words: Fang Wenshan 
 [00 : 19 . 09] Song: Jay Chou 
 [00 : 23 . 24] Jay Chou: 

[00 : 26 . 51] No contact   Later life 
 [00 : 29 . 76] I've always heard people say 
 [00 : 32 . 85] What's wrong with you   What's wrong with you? #10;
[00 : 36 . 02] It's me who can't leave it, 
 [00 : 39 . 20] When there are more people,   just stay in the corner 
 [00 : 42 . 34] for fear of being asked about me 

[00 : 45 . 37] What's wrong with you   You look down 
 [00 : 48 . 30] Protect me without complaining at all 
 [00 : 51 . 82] The phone starts to dodge   Never tell me 

[00 : 54 . 98] is not accustomed to living alone 
 [00 : 58 . 11] After leaving me   let me live well 
 [01 : 01 &\#46; 32] for fear of disturbing me who wants freedom 

[01 : 04 . 39] It's all this time   you still care about #10; [01 : 07 . 56] how other people see me; [01 : 10 . 77] struggling to explain #32; it's not my fault   you're going to go #10;
[01 : 15 . 55] Look at you sadly   Retain words but don't say 
 [01 : 28 . 14] You will smile and let go   Say no cry and let me go 
 [01 : 52 . 13] Ashin: 

[01 : 54 . 95] The phone starts to dodge   Never say to me 
 [01 : 58 . 17] Not accustomed to a person's life 
 [02 : 01 . 26] Leave me   Make me live well 

[02 : 04 . 41] Fear of disturbing me who wants freedom 
 [02 : 07 . 62] At this time   You still care about 
 [02 : 10 . 62] What other people think of me 

[02 : 13 . 90] is desperately explaining   it's not my fault   it's your going 
 [02 : 18 . 51] together: 
 [02 : 18 . 71] Watch you sad   leave words unseen 

[02 : 31 . 28] You'll smile and let go   Say no crying and let me go 
 [02 : 50 &\#46; 54] Jay Chou: 
 [02 : 53 . 38] You have nothing   but you're still fueling my dream 

[03 : 04 . 99] Ashin: 
 [03 : 05 . 92] How long has it been a heartbreak? #10; [03 : 09 . 83] Jay Chou: 
 [03 : 10 . 02] How long has it been? #10;
[03 : 12 . 58] Combined: 
 [03 : 12 . 77] is still looking for a reason to wait for me

Get neat lyrics from regular expressions (separated by \n between each Lyric sentence):

Say or Don't Cry (with Axin) - Jay Chou\n
 Word: Fang Wenshan\n
 Song: Jay Chou\n
 Jay Chou: \n
 Without contact with later life\n
 I always hear people say\n
 What's wro n g with you?
I\n'm the one who can't get rid of it
 Stay in the corner when there are a lot of people\n
 Just in case someone asks me\n
 What happened to you when you lowered your head\n
 Protect me without complai n ing
 The phone starts hiding and never says\n to me
 Not accustomed to living alone\n
 Leave me and I'll have a good time \n
 Fear of disturbing me who wants freedom\n
 At this point you still care about \n
 What do people think of me\n
 Desperately explaining that it was not my fault that you were going to go\n
 Looking at your sad words but not saying\n
 You'll smile and let go and cry to let me go\n
 Ashin: \n
 The phone starts hiding and never says\n to me
 Not accustomed to living alone\n
 Leave me and I'll have a good time \n
 Fear of disturbing me who wants freedom\n
 At this point you still care about \n
 What do people think of me\n
 Desperately explaining that it was not my fault that you were going to go\n
 Combined: \n
 Looking at your sad words but not saying\n
 You'll smile and let go and cry to let me go\n
 Jay Chou: \n
 You don't have anything but cheer on my dream\n
 Ashin: \n
 How long has it been painful\n
 Jay Chou: \n
 How long has passed\n
 Combined: \n
 Still looking for a reason to wait for me

Parameter description for settings.py

  • DOWNLOAD_DELAY: The interval between request s
  • ROBOTSTXT_OBEY: Compliance with the crawler protocol of the website
  • SINGER_PAGE_NUM: Number of artist pages crawled
  • SINGER_PAGE_SIZE: Number of singers per page
  • SONG_PAGE_NUM: Number of pages crawled by each singer's song
  • SONG_PAGE_SIZE: Number of songs per page per singer

Avoid crawlers being ban ned

  • Using the user agent pool, take turns selecting one of them as the user agent
  • Set the download delay DOWNLOAD_DELAY to 1 or greater.(At first, I was afraid that QQ music would be detected by some anti-crawlers officially, so I set DOWNLOAD_DELAY to 1, that is, I sent the request every second, and found that the crawling speed was too slow and changed to 0.1.Final sex set to 0, found that the original QQ music has no anti-crawling measures for the url of Lyric information)
  • Use IP pool to dynamically change IP address

The user agent pool is as follows:

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

Future Work

  • Request method for parsing QQ music's song file
  • Korean and English lyrics contain special symbols, some of which are not well escaped
  • Perfect setup allows users more flexibility to crawl songs from different regions and styles

Unsolved problem

The program can run normally, but when the crawl ends (the log message says'finish_reason':'finished', indicating that the crawl task is complete), an error is reported, although the result is not affected, but the cause of the error is not known for the time being.

2019-12-21 15:08:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-21 15:08:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 8,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 8,
 'downloader/request_bytes': 682269284,
 'downloader/request_count': 1067470,
 'downloader/request_method_count/GET': 1067470,
 'downloader/response_bytes': 1563129445,
 'downloader/response_count': 1067462,
 'downloader/response_status_count/200': 1067460,
 'downloader/response_status_count/404': 2,
 'dupefilter/filtered': 30476,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 21, 7, 8, 53, 816618),
 'item_scraped_count': 489860,
 'log_count/DEBUG': 1557332,
 'log_count/ERROR': 1,
 'log_count/INFO': 229,
 'request_depth_max': 3,
 'response_received_count': 1067462,
 'retry/count': 8,
 'retry/reason_count/twisted.internet.error.TimeoutError': 8,
 'scheduler/dequeued': 1067468,
 'scheduler/dequeued/memory': 1067468,
 'scheduler/enqueued': 1067468,
 'scheduler/enqueued/memory': 1067468,
 'start_time': datetime.datetime(2019, 12, 20, 14, 59, 43, 749919)}
2019-12-21 15:08:53 [scrapy.core.engine] INFO: Spider closed (finished)
2019-12-21 15:08:53 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method MemoryUsage.engine_stopped of <scrapy.extensions.memus
age.MemoryUsage object at 0x000001EFCF7E8630>>
Traceback (most recent call last):
  File "c:\users\administrator\.conda\envs\ppy36\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "c:\users\administrator\.conda\envs\ppy36\lib\site-packages\pydispatch\robustapply.py", line 54, in robustApply
    return receiver(*arguments, **named)
  File "c:\users\administrator\.conda\envs\ppy36\lib\site-packages\scrapy\extensions\memusage.py", line 70, in engine_stopped
    for tsk in self.tasks:
AttributeError: 'MemoryUsage' object has no attribute 'tasks'

 Github:https://github.com/yangjianxin1/QQMusicSpider

Tags: Python Windows JSON less Linux

Posted on Wed, 08 Jan 2020 21:09:23 -0800 by asgerhallas