scrapy configuration parameters (settings.py)

Import Configuration

How do I gracefully import settings.py configuration parameters from scrapy?You can't always use from scrapy import settings or from scrapy.settings import XXX.It doesn't look like anything.

scrapy provides a way to import settings: from_crawler

@classmethod
def from_crawler(cls, crawler):
  server = crawler.settings.get('SERVER')
  # FIXME: for now, stats are only supported from this constructor
  return cls(server)

Next, you can simply receive these parameters at u init_.

def __init__(self, server):
	self.server = server

This is used in the source code of some official components, but it seems a bit redundant

@classmethod
def from_settings(cls, settings):
	server = settings.get('SERVER')
	return cls(server)

@classmethod
def from_crawler(cls, crawler):
  # FIXME: for now, stats are only supported from this constructor
  return cls.from_settings(crawler.settings)

Also, not all classes can use this class method.Only components such as plug-ins, middleware, signal managers, and project pipelines can use this type of method to import configurations. If you do not have your own spider or custom file, you need to use the following method to import:

from scrapy.utils.project import get_project_settings
settings = get_project_settings()

Here settings is a dictionary of all the configurations of settings.py.

Main configuration parameters

There are many configurations in scrapy, let's talk about some of the more common ones:

  • CONCURRENT_ITEMS: Maximum number of concurrent project pipelines
  • CONCURRENT_REQUESTS:scrapy downloader maximum concurrency
  • DOWNLOAD_DELAY: The interval between visits to the same site in seconds.The general default is a random value between 0.5*DOWNLOAD_DELAY and 1.5*DOWNLOAD_DELAY.You can also set it to a fixed value, specified by RANDOMIZE_DOWNLOAD_DELAY whether it is fixed or not, and the default True is random.The same Web site here can be either a domain name or an IP, determined by the value of CONCURRENT_REQUESTS_PER_IP.
  • CONCURRENT_REQUESTS_PER_DOMAIN: Maximum concurrency for a single domain name
  • CONCURRENT_REQUESTS_PER_IP: For the maximum concurrency of a single IP, the CONCURRENT_REQUESTS_PER_DOMAIN parameter is ignored if the value is not zero, and the same site with the DOWNLOAD_DELAY parameter refers to IP
  • DEFAULT_ITEM_CLASS: Default item class to execute the scrapy shell command, default scrapy.item.Item
  • DEPTH_LIMIT: Maximum depth of crawl
  • DEPTH_PRIORITY: Positive value is breadth first (BFO), negative value is depth first (DFO), calculation formula: request.priority = request.priority - (depth * DEPTH_PRIORITY)
  • COOKIES_ENABLED: Enable cookie middleware, also known as automatic cookie management
  • COOKIES_DEBUG: Write logs containing request cookie s and responses with Set-Cookie s
  • DOWNLOADER_MIDDLEWARE: Downloader Middleware and Priority Dictionary
  • DEFAULT_REQUEST_HEADERS: Default header for Scrapy HTTP requests
  • DUPEFILTER_CLASS: Unweighted classes can be changed to use a Bloom filter instead of the default
  • LOG_ENABLED: Enable logging
  • LOG_FILE: Log file path, default is None
  • LOG_FORMAT: Log formatting expression
  • Time formatting expression in LOG_DATEFORMAT:LOG_FORMAT
  • LOG_LEVEL: Minimum log level, default DEBUG, available: CRITICAL, ERROR, WARNING, INFO, DEBUG
  • LOG_STDOUT: Whether all standard output (and errors) will be redirected to the log, e.g. print will also be logged
  • LOG_SHORT_NAMES: If True, the log will only contain the root path; if False, the component responsible for the log output will be displayed
  • LOGSTATS_INTERVAL: The interval between the printouts of each statistical record
  • MEMDEBUG_ENABLED: Enable memory debugging
  • REDIRECT_MAX_TIMES: Defines the maximum time a request can be redirected
  • REDIRECT_PRIORITY_ADJUST: Adjusts the priority of redirection requests to be higher when positive
  • RETRY_PRIORITY_ADJUST: Adjust the priority of retry requests
  • ROBOTSTXT_OBEY: Compliance with robot Protocol
  • SCRAPER_SLOT_MAX_ACTIVE_SIZE: The soft limit (in bytes) on the response data being processed, and if the total size of all the responses being processed is greater than this value, Scrapy will not process new requests.
  • SPIDER_MIDDLEWARES: Spider Middleware
  • USER_AGENT: User-Agent used by default

I'm also a beginner. I don't use scrapy systematically, I just used it to practice some small projects, so if there are any errors, please point them out.

So many settings can't be checked once, so we need to modify the contents of settings.py in the template file created by default by the scrapy startproject command to save the above comments and parameters in this file. Whenever we create a new project, we just need to see which parameter in settings.py needs to be changed.Template file in AnacondaLib\site-packages\scrapy\templates\project\module (if anaconda)

Notes for most configurations in settings.py:

# entry name
BOT_NAME = '$project_name'

SPIDER_MODULES = ['$project_name.spiders']
NEWSPIDER_MODULE = '$project_name.spiders'

# The maximum number of concurrent items (per response) processed concurrently in the project processor (also known as the Project Pipeline), defaulting to 100.
#CONCURRENT_ITEMS = 100

# Maximum number of concurrent (i.e., concurrent) requests that the Scrapy downloader will execute, default 16
CONCURRENT_REQUESTS = 8

# The amount of time (in seconds) that the download program should wait before downloading sequential pages from the same site.
# This can be used to limit crawling speed to avoid too much impact on the server.Decimals are supported.
# By default, Scrapy does not wait a fixed amount of time between requests, but instead uses0.5* DOWNLOAD_DELAY and1.5* DOWNLOAD_Random interval between DELAY.
#DOWNLOAD_DELAY = 0

# The maximum number of concurrent (that is, concurrent) requests that will be executed against any single domain, default 8
#CONCURRENT_REQUESTS_PER_DOMAIN = 16

# The maximum number of concurrent (that is, concurrent) requests that will be executed against any single IP, defaulting to 0.
# CONCURRENT_if not 0REQUESTS_PER_The parameter DOMAIN is ignored, that is, IP is not domain name.DOWNLOAD_DELAY is also by IP
#CONCURRENT_REQUESTS_PER_IP = 16

# Default class that will be used to instantiate items in the Scrapy shell
#DEFAULT_ITEM_CLASS = 'scrapy.item.Item'

# For any site, the maximum depth of crawling will be allowed.If zero, no restrictions apply
#DEPTH_LIMIT = 0

# According to DEPTH_The value of PRIORITY depends on depth-first or breadth-first, i.e., positive values are breadth-first (BFO) and negative values are depth-first (DFO)
# Calculation formula:Request.priority=Request.priority- (depth * DEPTH_PRIORITY)
#DEPTH_PRIORITY = 0

# Is cookie enabled
COOKIES_ENABLED = False

# If enabled, Scrapy records all cookies sent in the request (i.e. cookie headers) and all cookies received in the response (i.e. Set-Cookie headers)
#COOKIES_DEBUG = False

# Whether to collect detailed depth statistics.If enabled, the number of requests per depth is collected in Statistics
#DEPTH_STATS_VERBOSE = False

# Whether DNS memory caching is enabled
#DNSCACHE_ENABLED = True

# DNS Memory Cache Size
#DNSCACHE_SIZE = 10000

# Timeout in seconds for processing DNS queries.Supports floating
#DNS_TIMEOUT = 60

# Downloader for crawling
#DOWNLOADER = 'scrapy.core.downloader.Downloader'

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# A dictionary containing downloader middleware enabled in your project and its commands
#DOWNLOADER_MIDDLEWARE = {}

# Default header for Scrapy HTTP requests.They are populated in DefaultHeaders Middleware
DEFAULT_REQUEST_HEADERS = {
}

# A Dictionary of downloader middleware enabled by default in Scrapy.Low values are closer to the engine and high values are closer to the downloader.
# Do not attempt to modify this setting, modify DOWNLOADER_MIDDLEWARE
#DOWNLOADER_MIDDLEWARES_BASE = {
#     'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
#     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
#     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
#     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
#     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
#     'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
#     'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
#     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
#     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
#     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
#     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
#     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
#     'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
#     'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
# }

# Whether to enable downloader statistics collection
#DOWNLOADER_STATS = True

# A dictionary containing request download handlers enabled in the project
#DOWNLOAD_HANDLERS = {}

# Default dictionary containing request download handlers
# If you want to disable FTP handlers, set DOWNLOAD_HANDLERS = {'ftp': None}
#DOWNLOAD_HANDLERS_BASE = {
#     'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
#     'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
#     'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
#     's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
#     'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
# }

# Timeout for downloading programs in seconds
#DOWNLOAD_TIMEOUT = 180

# The maximum response size (in bytes, default 1024MB) that the loader will download is unlimited at 0
#DOWNLOAD_MAXSIZE = 1073741824

# The downloader will begin the response size of the warning (in bytes, default 32MB)
#DOWNLOAD_WARNSIZE = 33554432

# Declared Content-Length does not match what was sent by the server, does it trigger an exception ResponseFailed([_DataLoss]) 
# If False, if'dataloss'in response.flags can be judged and processed in the crawler file: 
#DOWNLOAD_FAIL_ON_DATALOSS = True

# Classes for detecting and filtering duplicate requests
#DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'

# By default, RFPDupeFilter only records the first duplicate request.Set DUPEFILTER_DEBUG to True and it will record all duplicate requests.
#DUPEFILTER_DEBUG = False

# A dictionary containing the extensions enabled in your project and their order
#EXTENSIONS = {}

# Dictionary containing extenders available by default in Scrapy and their order
#EXTENSIONS_BASE = {
#     'scrapy.extensions.corestats.CoreStats': 0,
#     'scrapy.extensions.telnet.TelnetConsole': 0,
#     'scrapy.extensions.memusage.MemoryUsage': 0,
#     'scrapy.extensions.memdebug.MemoryDebugger': 0,
#     'scrapy.extensions.closespider.CloseSpider': 0,
#     'scrapy.extensions.feedexport.FeedExporter': 0,
#     'scrapy.extensions.logstats.LogStats': 0,
#     'scrapy.extensions.spiderstate.SpiderState': 0,
#     'scrapy.extensions.throttle.AutoThrottle': 0,
# }

# A dictionary containing the project pipelines to be used and their order.Values are arbitrary, but they are customarily defined in the 0-1000 range.Low values take precedence over high values
#ITEM_PIPELINES = {}

# Whether logging is enabled
#LOG_ENABLED = True

# Encoding for logging
#LOG_ENCODING = 'utf-8'

# File name used to record output
#LOG_FILE = None

# String used to format log messages
#LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'

# String used to format date/time, to change asctime placeholder in LOG_FORMAT
#LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'

# Classes for formatting log messages for different operations
#LOG_FORMATTER = "scrapy.logformatter.LogFormatter"

# Minimum record level, available: CRITICAL, ERROR, WARNING, INFO, DEBUG
#LOG_LEVEL = 'DEBUG'

# If True, all standard output (and errors) will be redirected to the log, and print s, for example, will also be logged
#LOG_STDOUT = False

# If True, the log will contain only the root path; if False, the component responsible for the log output will be displayed
#LOG_SHORT_NAMES = False

# Interval between printouts of each statistical record in seconds
#LOGSTATS_INTERVAL = 60.0

# Whether memory debugging is enabled
#MEMDEBUG_ENABLED = False

# When memory debugging is enabled, if this setting is not empty, the memory report will be sent to the specified mailbox address, otherwise it will be written to the log.
# For example: MEMDEBUG_NOTIFY = ['user@example.com']
#MEMDEBUG_NOTIFY = []

# Whether memory usage extensions are enabled.This extension tracks the peak memory used by the process (write it to statistics).
# When the memory limit is exceeded, it also has the option to shut down the Scrapy process and notify it by e-mail when this occurs
#MEMUSAGE_ENABLED = True

# Maximum amount of memory allowed before closing Scrapy
#MEMUSAGE_LIMIT_MB = 0

#MEMUSAGE_CHECK_INTERVAL_SECONDS = 60.0

# E-mail list to notify if memory limit has been reached
#MEMUSAGE_NOTIFY_MAIL = False

# The maximum amount of memory (in megabytes) allowed before sending a warning e-mail notification of maximum memory.If zero, no warning will be issued
#MEMUSAGE_WARNING_MB = 0

# Create a template for the crawler using the genspider command
#NEWSPIDER_MODULE = ""

# If enabled, Scrapy will wait for a random time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while getting requests from the same site
#RANDOMIZE_DOWNLOAD_DELAY = True

# Maximum limit on the size of the Twisted Reactor thread pool.This is a common multipurpose thread pool used by various Scrapy components.
# Thread DNS resolver, BlockingFeedStorage, S3FilesStore are just a few examples.
# Increase this value if you are experiencing insufficient blocking IO.
#REACTOR_THREADPOOL_MAXSIZE = 10

# Defines the maximum time a request can be redirected.When this maximum is exceeded, the response to the request is returned as is
#REDIRECT_MAX_TIMES = 20

# Adjust the priority of redirection requests to regular priority
#REDIRECT_PRIORITY_ADJUST = 2

# Prioritize Retry Requests
#RETRY_PRIORITY_ADJUST = -1

# Whether to follow the robot s protocol
ROBOTSTXT_OBEY = False

# Parser backend for parsing robots.txt files
#ROBOTSTXT_PARSER = 'scrapy.robotstxt.ProtegoRobotParser'

#ROBOTSTXT_USER_AGENT = None

# Dispatcher for crawling
#SCHEDULER = 'scrapy.core.scheduler.Scheduler'

# Setting to True records debugging information about the request dispatcher
#SCHEDULER_DEBUG = False

# The type of disk queue the dispatcher will use.Other available types: scrapy.squeues.PickleFifoDiskQueue,
# scrapy.squeues.MarshalFifoDiskQueue´╝î scrapy.squeues.MarshalLifoDiskQueue
#SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'

# The type of memory queue used by the dispatcher.Other available types: scrapy.squeues.FifoMemoryQueue
#SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'

# The type of priority queue used by the dispatcher.Another available type is scrapy.pqueues.DownloaderAwarePriorityQueue
#SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'

# The soft limit (in bytes) of response data is being processed.
# If the total size of all the responses being processed is greater than this value, Scrapy will not process new requests
#SCRAPER_SLOT_MAX_ACTIVE_SIZE  = 5_000_000

# A dictionary containing spider contracts enabled in your project for testing spiders
#SPIDER_CONTRACTS = {}

# Dictionary containing Scrapy contracts enabled by default in Scrapy contracts
#SPIDER_CONTRACTS_BASE  = {
#     'scrapy.contracts.default.UrlContract' : 1,
#     'scrapy.contracts.default.ReturnsContract': 2,
#     'scrapy.contracts.default.ScrapesContract': 3,
# }

# Classes that will be used to load spiders
#SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'

# A dictionary containing spider middleware enabled in your project and its commands
#SPIDER_MIDDLEWARES = {}

#SPIDER_MIDDLEWARES_BASE = {
#     'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
#     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
#     'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
#     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
#     'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
# }

# Scrapy will look for a list of templates for spiders
#SPIDER_MODULES  = {}

# Classes for collecting statistical information
#STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'

# When the spider is finished, dump the Scrapy statistics into the Scrapy log
#STATS_DUMP = True

# List of mailboxes that send Scrapy statistics when a spider finishes grabbing
#STATSMAILER_RCPTS = []

# Specifies whether the telnet console will be enabled
#TELNETCONSOLE_ENABLED = True

# Port range for telnet console.If set to None or 0, use dynamically assigned ports
#TELNETCONSOLE_PORT = [6023, 6073]

# The directory in which to look for templates when using the startproject command to create a new project and the genspider command to create a new Spider
#TEMPLATES_DIR = "templates"

# Maximum URL length allowed for captured URLs
#URLLENGTH_LIMIT = 2083

# Default User-Agent used when crawling
#USER_AGENT = "Scrapy/VERSION (+https://scrapy.org)"

Tags: Python DNS ftp shell Anaconda

Posted on Mon, 18 May 2020 10:03:15 -0700 by rgriffin3838