Shrimp and Rice Music Reptilian

Shrimp and Rice Music Reptilian

https://www.xiami.com/ This is the website I crawled this time.

This is a front-end and back-end analysis website, this type of web is actually very good, as long as you find the right API, successful requests, then the desired data will be directly obtained.

Let's take the green song as an example. https://www.xiami.com/song/mTrNQf7d590, APi interface for analyzing her comments

  1. Request analysis mainly knows the address of APi, the parameters of request, the way of request, and what parameters the request header needs to carry.

    This is the parameter data carried by this visit.

    _q: {"objectId":"mTrNQf7d590","objectType":"song","pagingVO":{"page":1,"pageSize":20}}
    _s: a05126e10a02e9702e790e47c27d2002
    
    Let's start with the analysis (s is not the same as yours, it's normal)
    _ q: A plaintext, data can find objectId is a data in the url, it is not difficult to get, objectType here is fixed as song, pagingVO is the number of pages and how many pieces of data per page, _q is easy.
    _ s: It's a 32-bit random string. The first thing to think about is the 32-bit data generated by md5. It's very likely that _s is the data encrypted by md5. What is the encrypted data? For background development, MD5 is irreversible, the MD5 encrypted string can not be decrypted back-end, so the data can only be encrypted through MD5 according to the data from the front-end, and then compared with the MD5 string from the front-end. Except that others know your encryption method. Different songs, different reviews page number _s is not the same, then the MD5 encrypted data must be identified (refers to different data). Many backends use timestamps as one of the encrypted data to generate MD5 encrypted strings, but the front end has to pass this timestamp to the backend. Here, there is no time stamp to the background, so guessing the encrypted data is likely to be the value of _q.

    You visit only once: https://www.xiami.com/song/mTrNQf7d590, it is difficult to find the encryption rules, we refresh the page again, or find the api just commented on, to see how the parameters of the request are different. Very sure (provided you don't do anything else), the request parameters found are the same as the first one.

    _q: {"objectId":"mTrNQf7d590","objectType":"song","pagingVO":{"page":1,"pageSize":20}}
    _s: a05126e10a02e9702e790e47c27d2002

    No matter how many times you refresh, the request parameters are the same, the time is always changing, the value of _s is still unchanged, so the possibility of using timestamp as encrypted data is excluded directly, the value of _q is unchanged, and the value of _s is unchanged, so the possibility of _q being the encrypted parameter of _s is increased.

    We've been looking at the comment information on the first page of this song. Let's try the comment information on the second page to see how the request parameters have changed. Turn to the bottom of the page and click on more comments. We'll continue to find the comment interface.

    _q: {"objectId":"mTrNQf7d590","objectType":"song","pagingVO":{"page":2,"pageSize":20}}
    _s: f82616e410172d3e76b9b48561a0d257

    _ The value of Q has changed, and the value of _s has changed accordingly. We can confirm that _q is one of the encrypted data of _s (there may be other encrypted data).

  2. _ s encryption mode cracking

    It's difficult to find the location of js encryption. When you visit a web page, you request many js files, the content of a single js file is large, and even the js file is confused, named after a,b,c,d,e, you don't know what it means.

    The way I look for a location is to see which js file the request is and where the js file is sent, then slowly look up. This is just a way of thinking.

    Another idea is to search for keywords and find the encrypted location through the location of keywords.

    Here, I finally found the encrypted location by searching for the keyword _q. Location is found, and JS breakpoint debugging is done through chrome developer tools. I'll talk about this js.

    _ The value of q requires us to generate the corresponding string ourselves as needed.
    _ s means this line of code: A. split ("") [0] + "_xmMain" + E. _url + "" + T
     a: The value corresponding to xm_sg_tk in cookie s
     e._url: For example, the URL address of the comment request is https://www.xiami.com/api/comment/getCommentList.
        So e._url is / api/comment/getCommentList
     t: That's the value of _q.
    Finally, they are spliced into strings and encrypted by md5, which is the value of _s.

    Be careful:

    For the above a, cookie is the value of xm_sg_tk. Before you send an Api request, go to the next home page and get the value of the corresponding cookie.

    _ q is obviously a JSON string, generated through the json.dumps dictionary, but note that the dictionary is out of order, json.dumps out of the string order will be different.

  3. After a successful request, you get the data for the response.

Here I analyze the comment API, as do other Api encryption methods. It's just that the value of _q is different from the corresponding url address. You need to crawl other data and change it accordingly.

import requests, pprint
from fake_useragent import UserAgent
from hashlib import md5


class XiaMi:
    ua = UserAgent()
    DOMAIN = "https://www.xiami.com"

    # API interface addresses
    # Daily Music Recommendation
    APIDailySongs = "/api/recommend/getDailySongs"
    # Billboard Music
    APIBillboardDetail = "/api/billboard/getBillboardDetail"
    # All rankings
    APIBillboardALL = "/api/billboard/getBillboards"
    # Song Details
    APISongDetails = "/api/song/getPlayInfo"

    def __init__(self):
        self.session = requests.Session()
        self.headers = {
            "user-agent": self.ua.random
        }
        self.session.get(self.DOMAIN)

    def _get_api_url(self, api):
        return self.DOMAIN + api

    # Get 30 songs recommended daily
    def get_daily_songs(self):
        url = self._get_api_url(self.APIDailySongs)
        params = {
            "_s": self._get_params__s(self.APIDailySongs)
        }
        result = self.session.get(url=url, params=params).json()
        self._dispose(result)

    # Get the music charts of shrimp music
    def get_billboard_song(self, billboard_id: int = 0):
        '''
        :param billboard_id: Rankings of various types
        :return: Billboard Music Data
        '''
        if not hasattr(self, "billboard_dict"):
            self._get_billboard_dict_map()

        assert hasattr(self, "billboard_dict"), "billboard_dict Acquisition failure"
        pprint.pprint(self.billboard_dict)
        if billboard_id == 0:
            billboard_id = input("Input correspondence ID,Access to ranking information")
        assert billboard_id in self.billboard_dict, "billboard_id error"

        url = self._get_api_url(self.APIBillboardDetail)
        _q = '{\"billboardId\":\"%s\"}' % billboard_id
        params = {
            "_q": _q,
            "_s": self._get_params__s(self.APIBillboardDetail, _q)
        }
        result = self.session.get(url=url, params=params).json()
        self._dispose(result)

    # Generate a dictionary map corresponding to the rankings
    def _get_billboard_dict_map(self):
        billboard_dict = {}
        billboards_info = self.get_billboard_all()
        try:
            if billboards_info["code"] == "SUCCESS":
                xiamiBillboards_list = billboards_info["result"]["data"]["xiamiBillboards"]
                for xiamiBillboards in xiamiBillboards_list:
                    for xiamiBillboard in xiamiBillboards:
                        id = xiamiBillboard["billboardId"]
                        name = xiamiBillboard["name"]
                        billboard_dict[id] = name
                self.billboard_dict = billboard_dict
        except Exception:
            pass

    # Get all the ranking information
    def get_billboard_all(self):
        url = self._get_api_url(self.APIBillboardALL)
        params = {
            "_s": self._get_params__s(self.APIBillboardALL)
        }
        result = self.session.get(url=url, params=params).json()
        self._dispose(result)

    # Get song details
    def get_song_details(self, *song_ids) -> dict:
        '''
        :param song_ids: Song's id,Can be multiple
        :return: Details of the song
        '''
        assert len(song_ids) != 0, "Parameters cannot be null"

        for song_id in song_ids:
            if not isinstance(song_id, int):
                raise Exception("Each parameter must be integer")

        url = self._get_api_url(self.APISongDetails)
        _q = "{\"songIds\":%s}" % list(song_ids)
        params = {
            "_q": _q,
            "_s": self._get_params__s(self.APISongDetails, _q)
        }
        result = self.session.get(url=url, params=params).json()
        return self._dispose(result)

    # Get the download address of the song
    def get_song_download_url(self, *song_ids):
        download_url_dict = {}
        song_details = self.get_song_details(*song_ids)
        songPlayInfos = song_details["result"]["data"]["songPlayInfos"]
        for songPlayInfo in songPlayInfos:
            song_download_url = songPlayInfo["playInfos"][0]["listenFile"] or songPlayInfo["playInfos"][1]["listenFile"]
            song_id = songPlayInfo["songId"]
            download_url_dict[song_id] = song_download_url

        print("The download address of the song is:", download_url_dict)

    # Processing the data from the crawler, here I output the value
    def _dispose(self, data):
        pprint.pprint(data)
        return data

    # Get the encrypted string_s
    def _get_params__s(self, api: str, _q: str = "") -> str:
        '''
        :param api: URL Address
        :param _q:  Parameters requiring encryption
        :return: Encrypted string
        '''
        xm_sg_tk = self._get_xm_sg_tk()
        data = xm_sg_tk + "_xmMain_" + api + "_" + _q
        return md5(bytes(data, encoding="utf-8")).hexdigest()

    # Get the value of xm_sg_tk for data encryption parameters
    def _get_xm_sg_tk(self) -> str:
        xm_sg_tk = self.session.cookies.get("xm_sg_tk", None)
        assert xm_sg_tk is not None, "xm_sg_tk Acquisition failure"
        return xm_sg_tk.split("_")[0]

    def test(self):
        # self.get_daily_songs()
        # self._get_xm_sg_tk()
        # self.get_billboard_song(332)
        # self.get_billboard_all()
        # self.get_song_details(1813243760)
        # self.get_song_download_url(1813243760)

        pass


if __name__ == '__main__':
    xm = XiaMi()
    xm.test()

Tags: Python Session JSON encoding

Posted on Mon, 09 Sep 2019 22:44:44 -0700 by Stole