- setting.py中的ROBOTSTXT.OBEY变量最好设置为False,默认是Ture。如果不修改很多禁止蜘蛛爬行的网站将无法爬取。
采集失败返回信息403,如:
[scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.deepsc.net/> (referer: None) [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.deepsc.net/>: HTTP status code is not handled or not allowed
原因:被采集服务器拒绝访问,新手刚开始的没有配置用户代理(User-Agent),很有可能出现这样问题。当然也可能是访问频繁被封禁IP。下面先说说配置UA。
在middlewares.py中添加类:
class RandomUserAgentMiddleware():
def __init__(self):
self.uas = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
]
def process_request(self,request,spider):
request.headers['User-Agent'] = Random.choice(self.uas)
之后在setting.py中找到DOWNLOADER_MIDDLEWARES = {}字段 ,向其添加'项目名.middlewares.RandomUserAgentMiddleware':543
即可。