爬取网站数据，提取结构性数据而编写的应用框架

一、安装

pip install scrapy

二、使用

创建scrapy项目
cmd: scrapy startproject 项目名

创建爬虫文件

进入spiders文件夹创建爬虫文件

创建爬虫文件

一般不加http协议

# scrapy genspider 爬虫文件的名字 要爬取的网页
scrapy genspider baidu www.baidu.com

- 得到的页面

  ```python
  import scrapy
  
  
  class BaiduSpider(scrapy.Spider):
      # 爬虫的名字，运行爬虫时使用的值
      name = "baidu"
      # 允许访问的域名
      allowed_domains = ["www.baidu.com"]
      # 起始的url地址，第一次访问的域名
      # start_urls 在allowed_domains前添加了http:// 末尾添加了/
      start_urls = ["http://www.baidu.com"]
  
      # 是执行了start_urls之后执行的方法 方法中的response就是返回的最终对象
      # 相当于 response = urllib.request.urlopen()
      #       response = requests.get()
      def parse(self, response):
          pass

运行爬虫代码

scrapy crawl 爬虫名字

scrapy crawl baidu

- 君子协议：https://www.baidu.com/robots.txt

  ```python
  # 注释掉setting.py文件中ROBOTSTXT_OBEY = True

三、response属性和方法

response.text 提取响应的字符串
response.body 提取二进制数据
response.xpath 直接使用xpath方法来解析response中的内容
response.extract() 提取使用xpath获取的seletor对象的data属性值
response.extract_first() 获取seletor列表的第一个数据

四、scrapy shell

spider交互终端，对于业务调试比较方便，不需要每次去执行scrapy crawl name，只需要在控制台回车即可得到想要测试的数据

4.1 使用

安装ipython

pip install ipython

ipython终端提供只能自动补全，高亮输出及其他特性

应用

直接进入cmd后输入scrapy shell …回车输出结束后会自动进入ipython

scrapy shell www.baidu.com
scrapy shell http://www.baidu.com
scrapy shell ‘http://www.baidu.com‘
scrapy shell ‘www.baidu.com‘

语法

response对象

response.body
response.text
response.url
response.status

response解析

response.xpath() 【常用】
- 使用xpath路径查询特定元素，返回一个selector列表对象
response.css()
- 使用css_selector查询元素，返回一个selector列表对象
- 获取内容：response.css(‘#su::text’).extract_first()
- 获取属性：response.css(‘#su::attr(“value”)’).extract_first()
selector对象（通过xpath方法调用返回的是seletor列表）
- extract()

五、CrawlSpider

重点：这个功能用于提取页面底部的页码，而不是某个东西的进入地址，这个可以直接用xpath抓

案例：读书网

继承scarpy.Spider
crawlspider可以定义解析html的规则，根据链接规则提取出指定的链接，如果有需要跟进链接的需求，爬取了网页之后，需要提取链接再次爬取，使用crawlspider是非常合适的

scrapy shell 示例

# 开启scrapy shell
scrapy shell https://www.dushu.com/book/1188.html
# 导入LinkExtractor
from scrapy.linkextractors import LinkExtractor
# allow指定link规则
link = LinkExtractor(allow=r'/book/1188_\d+\.html')
# restrict_xpaths规则
# link1 = LinkExtractor(restrict_xpaths=r'//div[@class="pages"]/a/@href')
# restrict_css规则
# link2 = LinkExtractor(restrict_css='.x')
# 获取link规则的url
link.extract_links(response)

5.1 提取链接规则

scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    restrict_xpaths = (),
    restrict_css = ()
)

5.2 写法

links = LinkExtractor(allow=(r''))

5.3 提取链接

link.extract_links(res)

六、日志

日志级别：

CRITICAL 严重错误
ERROR 一半错误
WARNING 警告
INFO 一般信息
DEBUG 调试信息

默认DEBUG

settings.py文件配置

LOG_FILE : ‘xxx.log’ //必须是.log后缀

LOG_LEVEL : ‘日志级别’ // 不建议设置，设置logfile后控制台的输出会变成warning级别

七、案例

7.1 58同城

环境

scrapy startproject scrapy_58tc
cd ...../spiders
scrapy genspider tc https://cn.58.com/sou/?key=%E5%89%8D%E7%AB%AF%E5%BC%80%E5%8F%91

代码

import scrapy
# 简单获取某一数据，主要为了熟悉response语法
class TcSpider(scrapy.Spider):
    name = "tc"
    allowed_domains = ["cn.58.com"]
    start_urls = ["https://cn.58.com/sou/?key=%E5%89%8D%E7%AB%AF%E5%BC%80%E5%8F%91"]

    def parse(self, response):
        # 字符串
        content = response.text
        # 二进制，此时不为content
        content = response.body
        # xpath语法：response.xpath
        span = response.xpath('//div[@id="filter"]/div[@class="tabs"]/a[@class="select"]/span')[0]
        # span == <Selector query='//div[@id="filter"]/div[@class="tabs"]/a[@class="select"]/span' data='<span>全部</span>'>
        # 提取selector对象的data属性值：<span>全部</span>
        print(span.extract_first())

运行

scrapy crawl tc

7.2 汽车之家

环境

scrapy startproject scrapy_carhome
cd ......./sipders
scrapy genspider car https://car.autohome.com.cn/price/brand-15.html

代码

import scrapy

# 获取此网页的车型和价格
class CarSpider(scrapy.Spider):
    name = "car"
    allowed_domains = ["car.autohome.com.cn"]
    start_urls = ["https://car.autohome.com.cn/price/brand-15.html"]

    def parse(self, response):
        print('=================================')
        name_list = response.xpath('//div[@class="tab-content fn-visible"]/div[@id="brandtab-1"]//div[@class="list-cont"]//div[@class="list-cont-main"]/div[@class="main-title"]/a/text()')
        price_list = response.xpath('//div[@class="main-lever"]//span[@class="font-arial"]/text()')
        for i in range(len(name_list)):
            name = name_list[i].extract()
            price = price_list[i].extract()
            print(name,price)

7.3 当当网

所用概念： yield 管道封装多条管道下载多页数据下载

创建项目

scrapy startproject dangdangnet
cd ..../spiders
scrapy genspider dang https://category.dangdang.com/cp01.01.02.00.00.00.html

项目代码

dang.py

全局属性base_url和page用来定义多页查询的url
parse前半部分用于当前页面查询条件
在yield book后写的代码是一个实现多页查询的条件

import scrapy
from scrapy_dangdangnet.items import ScrapyDangdangnetItem


class DangSpider(scrapy.Spider):
    name = "dang"
    # 如果是多页下载则需要调整allowed_domains范围 一般情况下只写域名
    allowed_domains = ["category.dangdang.com"]
    start_urls = ["https://category.dangdang.com/cp01.01.02.00.00.00.html"]

    base_url = 'https://category.dangdang.com/pg'
    page = 1

    def parse(self, response):
        # 测试反爬
        # print('=======================')
        # piplines 下载大数据
        # items 定义数据结构

        # src = //ul[@id="component_59"]/li/a/img/@src
        # alt = //ul[@id="component_59"]/li/a/img/@alt
        # price = //ul[@id="component_59"]/li/p[@class="price"]/span[@class="search_now_price"]/text()
        # 所有的seletor的对象都可以再次调用xpath方法
        li_list = response.xpath('//ul[@id="component_59"]/li')
        for li in li_list:
            # 懒加载的src
            # src = li.xpath('./a/img/@src').extract_first()
            src = li.xpath('./a/img/@data-original').extract_first()
            # 第一张图的src是真实地址，其他的是懒加载数据data-original，当第一张数据没有获取时会成为None
            if src:
                src = src
            else:
                src = li.xpath('./a/img/@src').extract_first()
            name = li.xpath('./a/img/@alt').extract_first()
            price = li.xpath('./p[@class="price"]/span[@class="search_now_price"]/text()').extract_first()
            # print(src, name, price)
            # 导入from scrapy_dangdangnet.item import ScrapyDangdangnetItem进行封装数据
            book = ScrapyDangdangnetItem(src=src, name=name, price=price)
            # 每获取一个对象就返回给items一次
            yield book

        # 爬取多页
        # https://category.dangdang.com/pg4-cp01.01.02.00.00.00.html
        if self.page < 100:
            self.page += 1

            url = self.base_url + str(self.page) + '-cp01.01.02.00.00.00.html'
            # scrapy.Request就是scrapy的get请求
            # url是请求地址
            # callback是要执行的函数，注意parse不允许加圆括号
            yield scrapy.Request(url=url,callback=self.parse)

items.py

爬取的数据结构的py，没什么好说的

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class ScrapyDangdangnetItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 通俗的说就是爬取的数据都有什么
    # 图片
    src = scrapy.Field()
    # 名字
    name = scrapy.Field()
    # 价格
    price = scrapy.Field()

piplines.py

管道输出文件
open/close_spider(self,spider)方法是另外加的，让管道进行高效读写
DangDangDownloadPipeline为自定义的多管道，引入request，实现图片本地下载的一个功能

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


# 如果想使用管道，则需要在settings.py中解开管道
# ITEM_PIPELINES = {
#    "scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline": 300,
# }
class ScrapyDangdangnetPipeline:
    # 在爬虫文件开始之前执行的方法
    def open_spider(self, spider):
        self.fp = open('book.json', 'w', encoding='utf-8')

    # item就是yield后面传进来的book对象
    def process_item(self, item, spider):
        # 以下这种模式不推荐，文件操作过于频繁

        # write方法必须是str类型，不能是对象
        # w模式会每次覆盖掉数据
        # with open('book.json', 'a', encoding='utf-8') as fp:
        #     fp.write(str(item))

        self.fp.write(str(item))
        return item

    # 在爬虫文件执行完后的方法
    def close_spider(self, spider):
        self.fp.close()


import urllib.request


# 多条管道开启
# settings.py开启管道
# "scrapy_dangdangnet.pipelines.DangDangDownloadPipeline": 301,
class DangDangDownloadPipeline:
    def process_item(self, item, spider):
        url = 'http:' + item.get('src')
        filename = './books/' + item.get('name') + '.jpg'

        urllib.request.urlretrieve(url=url, filename=filename)
        return item

settings.py

配置信息文件
ROBOTSTXT_OBEY = True 注释掉意为撕毁君子协议，强行爬取
开启管道功能
ITEM_PIPELINES = {

值是优先级，值是1-1000，值越小优先级越高

“scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline”: 300,

DangDangDownloadPipeline 开启多管道

“scrapy_dangdangnet.pipelines.DangDangDownloadPipeline”: 301,
}

# Scrapy settings for scrapy_dangdangnet project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_dangdangnet"

SPIDER_MODULES = ["scrapy_dangdangnet.spiders"]
NEWSPIDER_MODULE = "scrapy_dangdangnet.spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "scrapy_dangdangnet (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    "scrapy_dangdangnet.middlewares.ScrapyDangdangnetSpiderMiddleware": 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    "scrapy_dangdangnet.middlewares.ScrapyDangdangnetDownloaderMiddleware": 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    # 值是优先级，值是1-1000，值越小优先级越高
    "scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline": 300,
    # DangDangDownloadPipeline 开启多管道
    "scrapy_dangdangnet.pipelines.DangDangDownloadPipeline": 301,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

运行项目
scrapy crawl dang

7.4 电影天堂

所用概念：一个item包含多级页面的数据

创建项目

scrapy startproject scrapy_movie
cd ......./spiders
scrapy genspider mv https://dy.dytt8.net/html/gndy/dyzz/index.html

项目代码

mv.py

import scrapy
from scrapy_movie.items import ScrapyMovieItem


class MvSpider(scrapy.Spider):
    name = "mv"
    allowed_domains = ["dy.dytt8.net"]
    start_urls = ["https://dy.dytt8.net/html/gndy/dyzz/index.html"]

    def parse(self, response):
        print('=========正在运行==============')
        # 获取第一层的名字和通往第二层的href
        # href = //div[@class="co_content8"]/ul/table[@class="tbspan"]//td/b/a/@href
        # name = //div[@class="co_content8"]/ul/table[@class="tbspan"]//td/b/a/text()
        a_list = response.xpath('//div[@class="co_content8"]/ul//table[@class="tbspan"]//a')
        for a in a_list:
            # 获取第一页的name和要点击的链接
            name = a.xpath('./text()').extract_first()
            href = a.xpath('./@href').extract_first()

            # 第二页的地址
            url = 'https://dy.dytt8.net' + href
            # 进入第二页，通过meta传递name给第二页
            yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name})

    def parse_second(self, response):
        # 无法识别span标签
        # 同样无法识别tbody之类的标签
        # 如果拿不到数据，请检查xpath语法
        src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first()
        name = response.meta['name']
        movie = ScrapyMovieItem(name=name, src=src)
        yield movie

item.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyMovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 电影名
    name = scrapy.Field()
    # 电影图片
    src = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScrapyMoviePipeline:

    def open_spider(self,spider):
        self.fp = open('movie.json','w',encoding='utf-8')
        self.fp.write('[')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        self.fp.write(',')
        return item

    def close_spider(self,spider):
        self.fp.write(']')
        self.fp.close()

settings.py

# Scrapy settings for scrapy_movie project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_movie"

SPIDER_MODULES = ["scrapy_movie.spiders"]
NEWSPIDER_MODULE = "scrapy_movie.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "scrapy_movie (+http://www.yourdomain.com)"

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "scrapy_movie.middlewares.ScrapyMovieSpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "scrapy_movie.middlewares.ScrapyMovieDownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "scrapy_movie.pipelines.ScrapyMoviePipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

运行项目
scrapy crawl mv

7.5 读书网数据入库

创建项目

scrapy startproject scrapy_dushu
cd ..../spiders
# 创建爬虫类
scrapy genspider -t crawl read https://www.dushu.com/book/1188.html

与先前不同

# 创建爬虫类
scrapy genspider -t crawl read https://www.dushu.com/book/1188.html
# 参数不同
class ReadSpider(CrawlSpider)
# 新增规则
rules = (Rule(LinkExtractor(allow=r"Items/"), callback="parse_item", follow=True),)

直接拿json数据方式

read.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from scrapy_dushu.items import ScrapyDushuItem


class ReadSpider(CrawlSpider):
    name = "read"
    allowed_domains = ["www.dushu.com"]
    # crawlspider坑，不是首页的解析地址,此项不解析会越过此页
    # start_urls = ["https://www.dushu.com/book/1188.html"]
    # 排坑
    start_urls = ["https://www.dushu.com/book/1188_1.html"]
    # allow解析的是页数规则
    # follow会追踪页码，如果当前页码显示只有1-13条，那么TRUE后会追踪后续的所有页码
    rules = (Rule(LinkExtractor(allow=r"/book/1188_\d+\.html"), callback="parse_item", follow=True),)

    def parse_item(self, response):
        # print('----------------------------------------------------------')
        img_list = response.xpath('//div[@class="bookslist"]//img')

        for img in img_list:
            name = img.xpath('./@alt').extract_first()
            src = img.xpath('./@data-original').extract_first()
            print(name, src)
            book = ScrapyDushuItem(name=name, src=src)
            yield book

settings.py

# Scrapy settings for scrapy_dushu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_dushu"

SPIDER_MODULES = ["scrapy_dushu.spiders"]
NEWSPIDER_MODULE = "scrapy_dushu.spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "scrapy_dushu (+http://www.yourdomain.com)"

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    "scrapy_dushu.middlewares.ScrapyDushuSpiderMiddleware": 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    "scrapy_dushu.middlewares.ScrapyDushuDownloaderMiddleware": 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    "scrapy_dushu.pipelines.ScrapyDushuPipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScrapyDushuPipeline:
    def open_spider(self, spider):
        self.fp = open('book.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self, spider):
        self.fp.close()

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyDushuItem(scrapy.Item):
    # define the fields for your item here like:
    # 名字
    name = scrapy.Field()
    # 图片路径
    src = scrapy.Field()

数据入库操作

建表

CREATE TABLE `book` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `src` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

settings.py设置数据库

# 设置数据库信息
DB_HOST = '127.0.0.1'
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWORD = 'root'
DB_NAME = 'dushuwang'
# 编码不要加'-'
DB_CHARSET = 'utf8'

pipelines.py

# 新建一个管道
class MysqlPipeline:
    def open_spider(self, spider):
        self.fp = open('book.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self, spider):
        self.fp.close()

settings.py设置管道

ITEM_PIPELINES = {
    "scrapy_dushu.pipelines.ScrapyDushuPipeline": 300,
    # MysqlPipeline
    "scrapy_dushu.pipelines.MysqlPipeline": 301,
}

pipelines.py管道中MysqlPipeline前添加获取settings.py的参数

# 加载settings.py文件
from scrapy.utils.project import get_project_settings

下载pymysql

# 用于连接数据库
pip install pymysql

pipelines.py修改MysqlPipeline管道内容

# 导入pymysql
import pymysql
class MysqlPipeline:
    def open_spider(self, spider):
        settings = get_project_settings()
        self.host = settings['DB_HOST']
        self.port = settings['DB_PORT']
        self.user = settings['DB_USER']
        self.password = settings['DB_PASSWORD']
        self.name = settings['DB_NAME']
        self.charset = settings['DB_CHARSET']

        self.connect()

    def connect(self):
        self.conn = pymysql.connect(
            host=self.host,
            port=self.port,
            user=self.user,
            password=self.password,
            db=self.name,
            charset=self.charset
        )
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        sql = 'insert into book(name,src) values("{}","{}")'.format(item['name'], item['src'])
        self.cursor.execute(sql)
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

执行
scrapy crawl read

7.6 百度翻译POST

创建项目

scrapy startproject scrapy_fanyi
cd ..../spiders
scrapy genspider fanyi https://fanyi.baidu.com/sug

修改fanyi.py

import scrapy
import json

class FanyiSpider(scrapy.Spider):
    name = "fanyi"
    allowed_domains = ["fanyi.baidu.com"]

    # post请求没有参数将毫无意义
    # start_urls = ["https://fanyi.baidu.com/sug"]

    # def parse(self, response):
    #     pass
    def start_requests(self):
        url = 'https://fanyi.baidu.com/sug'

        data = {
            'kw': 'final'
        }
        yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse_second)

    def parse_second(self,response):
        content = response.text
        # 直接打印会全是二进制
        obj = json.loads(content)
        print(obj)

直接运行
scrapy crawl fanyi