How to assign different DB to different crawler items instead of just using db0

How to assign different DB to different crawler items instead of just using db0

1. Background

By default, redis will generate 16 dB: db0 ~ db15. When writing a scrape redis distributed crawler, db0 will be used by default to store the de duplication, seed queue and item data. But in general, we don't have only one crawler project. If we put all of them into one database, it's easy to confuse. So it is necessary to assign different DB to different crawler items.

2. Environment

  • System: win7
  • scrapy-redis
  • redis 3.0.5
  • python 3.6.1

3. Analysis

  • First of all, let's analyze the source code of scrape redis to see where to set the db?
  • Step 1:. \ lib \ site packages \ summary \ redis \ scheduler.py
@classmethod
    def from_settings(cls, settings):
        kwargs = {
            'persist': settings.getbool('SCHEDULER_PERSIST'),
            'flush_on_start': settings.getbool('SCHEDULER_FLUSH_ON_START'),
            'idle_before_close': settings.getint('SCHEDULER_IDLE_BEFORE_CLOSE'),
        }

        # If these values are missing, it means we want to use the defaults.
        optional = {
            # TODO: Use custom prefixes for this settings to note that are
            # specific to scrapy-redis.
            'queue_key': 'SCHEDULER_QUEUE_KEY',
            'queue_cls': 'SCHEDULER_QUEUE_CLASS',
            'dupefilter_key': 'SCHEDULER_DUPEFILTER_KEY',
            # We use the default setting name to keep compatibility.
            'dupefilter_cls': 'DUPEFILTER_CLASS',
            'serializer': 'SCHEDULER_SERIALIZER',
        }
        for name, setting_name in optional.items():
            val = settings.get(setting_name)
            if val:
                kwargs[name] = val

        # Support serializer as a path to a module.
        if isinstance(kwargs.get('serializer'), six.string_types):
            kwargs['serializer'] = importlib.import_module(kwargs['serializer'])

        # Initialize redis server
        server = connection.from_settings(settings)
        # Ensure the connection is working.
        server.ping()

        return cls(server=server, **kwargs)

Call from settings under connection.py to initialize Redis server

  • Step 2:. \ lib \ site packages \ summary \ redis \ connection.py
# Backwards compatible alias.
from_settings = get_redis_from_settings

def get_redis_from_settings(settings):
    """Returns a redis client instance from given Scrapy settings object.

    This function uses ``get_client`` to instantiate the client and uses
    ``defaults.REDIS_PARAMS`` global as defaults values for the parameters. You
    can override them using the ``REDIS_PARAMS`` setting.

    Parameters
    ----------
    settings : Settings
        A scrapy settings object. See the supported settings below.

    Returns
    -------
    server
        Redis client instance.

    Other Parameters
    ----------------
    REDIS_URL : str, optional
        Server connection URL.
    REDIS_HOST : str, optional
        Server host.
    REDIS_PORT : str, optional
        Server port.
    REDIS_ENCODING : str, optional
        Data encoding.
    REDIS_PARAMS : dict, optional
        Additional client parameters.

    """
    params = defaults.REDIS_PARAMS.copy()

    # The key point is at this position. Here, we can fill in the redis custom parameter
    params.update(settings.getdict('REDIS_PARAMS'))
    # XXX: Deprecate REDIS_* settings.
    for source, dest in SETTINGS_PARAMS_MAP.items():
        val = settings.get(source)
        if val:
            params[dest] = val

    # Allow ``redis_cls`` to be a path to a class.
    if isinstance(params.get('redis_cls'), six.string_types):
        params['redis_cls'] = load_object(params['redis_cls'])

    return get_redis(**params)

As shown in the above code, the parameters will be obtained from the redis ﹣ params item of settings, and then filled in get ﹣ redis (* * params) to initialize the redis server, as follows:

def get_redis(**kwargs):
    """Returns a redis client instance.

    Parameters
    ----------
    redis_cls : class, optional
        Defaults to ``redis.StrictRedis``.
    url : str, optional
        If given, ``redis_cls.from_url`` is used to instantiate the class.
    **kwargs
        Extra parameters to be passed to the ``redis_cls`` class.

    Returns
    -------
    server
        Redis client instance.

    """
    redis_cls = kwargs.pop('redis_cls', defaults.REDIS_CLS)
    url = kwargs.pop('url', None)
    if url:
        return redis_cls.from_url(url, **kwargs)
    else:
        return redis_cls(**kwargs)

4. Method

  • Through the above analysis, it's very simple to do it. Just configure the settings item redis ﹣ params for the crawler.
  • In the same way, you can set password in this way.
# Specify using db2
class MySpider(RedisSpider):
    """Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'xxxx'
    redis_key = 'xxxx:start_urls'

    # ......
    custom_settings = {
        'LOG_LEVEL': 'DEBUG',
        'DOWNLOAD_DELAY': 0,

        # Specify the connection parameters of redis database
        'REDIS_HOST': '192.168.1.99',
        'REDIS_PORT': 6379,

        # Specify the redis link password and which database to use
        'REDIS_PARAMS' : {
            'password': 'redisPasswordTest123456',
            'db': 2
        },
    }
  • The effect is as follows:

  • matters needing attention:
    After modifying the database, you need to specify the corresponding database when adding start gurls and data dump from redis to mongodb, as follows:

# Create a redis database connection
rediscli = redis.Redis(host = redis_Host, port = 6379, db = "2")

Tags: Redis Database Python encoding

Posted on Sun, 03 May 2020 03:16:03 -0700 by TheFilmGod