MENU

Scrapy新手学习笔记(遇到的问题篇)

December 20, 2020 • 数据采集与数据分析(python)

  • setting.py中的ROBOTSTXT.OBEY变量最好设置为False,默认是Ture。如果不修改很多禁止蜘蛛爬行的网站将无法爬取。
  • 采集失败返回信息403,如:

     [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.deepsc.net/> (referer: None)
     [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.deepsc.net/>: HTTP status code is not handled or not allowed

    原因:被采集服务器拒绝访问,新手刚开始的没有配置用户代理(User-Agent),很有可能出现这样问题。当然也可能是访问频繁被封禁IP。下面先说说配置UA。

在middlewares.py中添加类:

class RandomUserAgentMiddleware():
    def __init__(self):
        self.uas = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
            "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
        ]

    def process_request(self,request,spider):
        request.headers['User-Agent'] = Random.choice(self.uas)

之后在setting.py中找到DOWNLOADER_MIDDLEWARES = {}字段 ,向其添加'项目名.middlewares.RandomUserAgentMiddleware':543 即可。

Last Modified: January 4, 2021