爬取网站数据,提取结构性数据而编写的应用框架

一、安装

pip install scrapy

二、使用

  1. 创建scrapy项目

    cmd: scrapy startproject 项目名
  2. 创建爬虫文件

    • 进入spiders文件夹创建爬虫文件

    • 创建爬虫文件

      • 一般不加http协议

      • # scrapy genspider 爬虫文件的名字 要爬取的网页
        scrapy genspider baidu www.baidu.com
        

        - 得到的页面

        ```python
        import scrapy


        class BaiduSpider(scrapy.Spider):
        # 爬虫的名字,运行爬虫时使用的值
        name = "baidu"
        # 允许访问的域名
        allowed_domains = ["www.baidu.com"]
        # 起始的url地址,第一次访问的域名
        # start_urls 在allowed_domains前添加了http:// 末尾添加了/
        start_urls = ["http://www.baidu.com"]

        # 是执行了start_urls之后执行的方法 方法中的response就是返回的最终对象
        # 相当于 response = urllib.request.urlopen()
        # response = requests.get()
        def parse(self, response):
        pass
  3. 运行爬虫代码

    • scrapy crawl 爬虫名字

    • scrapy crawl baidu
      

      - 君子协议:https://www.baidu.com/robots.txt

      ```python
      # 注释掉setting.py文件中ROBOTSTXT_OBEY = True

三、response属性和方法

  1. response.text 提取响应的字符串
  2. response.body 提取二进制数据
  3. response.xpath 直接使用xpath方法来解析response中的内容
  4. response.extract() 提取使用xpath获取的seletor对象的data属性值
  5. response.extract_first() 获取seletor列表的第一个数据

四、scrapy shell

spider交互终端,对于业务调试比较方便,不需要每次去执行scrapy crawl name,只需要在控制台回车即可得到想要测试的数据

4.1 使用

安装ipython

pip install ipython

ipython终端提供只能自动补全,高亮输出及其他特性

应用

直接进入cmd后输入scrapy shell …回车输出结束后会自动进入ipython

  1. scrapy shell www.baidu.com
  2. scrapy shell http://www.baidu.com
  3. scrapy shell ‘http://www.baidu.com
  4. scrapy shell ‘www.baidu.com

语法

response对象

  1. response.body
  2. response.text
  3. response.url
  4. response.status

response解析

  1. response.xpath() 【常用】
    • 使用xpath路径查询特定元素,返回一个selector列表对象
  2. response.css()
    • 使用css_selector查询元素,返回一个selector列表对象
    • 获取内容:response.css(‘#su::text’).extract_first()
    • 获取属性:response.css(‘#su::attr(“value”)’).extract_first()
  3. selector对象(通过xpath方法调用返回的是seletor列表)
    • extract()

五、CrawlSpider

重点:这个功能用于提取页面底部的页码,而不是某个东西的进入地址,这个可以直接用xpath抓

案例:读书网

  1. 继承scarpy.Spider
  2. crawlspider可以定义解析html的规则,根据链接规则提取出指定的链接,如果有需要跟进链接的需求,爬取了网页之后,需要提取链接再次爬取,使用crawlspider是非常合适的

scrapy shell 示例

# 开启scrapy shell
scrapy shell https://www.dushu.com/book/1188.html
# 导入LinkExtractor
from scrapy.linkextractors import LinkExtractor
# allow指定link规则
link = LinkExtractor(allow=r'/book/1188_\d+\.html')
# restrict_xpaths规则
# link1 = LinkExtractor(restrict_xpaths=r'//div[@class="pages"]/a/@href')
# restrict_css规则
# link2 = LinkExtractor(restrict_css='.x')
# 获取link规则的url
link.extract_links(response)

5.1 提取链接规则

scrapy.linkextractors.LinkExtractor(
allow = (),
deny = (),
restrict_xpaths = (),
restrict_css = ()
)

5.2 写法

links = LinkExtractor(allow=(r''))

5.3 提取链接

link.extract_links(res)

六、日志

日志级别:

  1. CRITICAL 严重错误
  2. ERROR 一半错误
  3. WARNING 警告
  4. INFO 一般信息
  5. DEBUG 调试信息

默认DEBUG

settings.py文件配置

LOG_FILE : ‘xxx.log’ //必须是.log后缀

LOG_LEVEL : ‘日志级别’ // 不建议设置,设置logfile后控制台的输出会变成warning级别

七、案例

7.1 58同城

环境

scrapy startproject scrapy_58tc
cd ...../spiders
scrapy genspider tc https://cn.58.com/sou/?key=%E5%89%8D%E7%AB%AF%E5%BC%80%E5%8F%91

代码

import scrapy
# 简单获取某一数据,主要为了熟悉response语法
class TcSpider(scrapy.Spider):
name = "tc"
allowed_domains = ["cn.58.com"]
start_urls = ["https://cn.58.com/sou/?key=%E5%89%8D%E7%AB%AF%E5%BC%80%E5%8F%91"]

def parse(self, response):
# 字符串
content = response.text
# 二进制,此时不为content
content = response.body
# xpath语法:response.xpath
span = response.xpath('//div[@id="filter"]/div[@class="tabs"]/a[@class="select"]/span')[0]
# span == <Selector query='//div[@id="filter"]/div[@class="tabs"]/a[@class="select"]/span' data='<span>全部</span>'>
# 提取selector对象的data属性值:<span>全部</span>
print(span.extract_first())

运行

scrapy crawl tc

7.2 汽车之家

环境

scrapy startproject scrapy_carhome
cd ......./sipders
scrapy genspider car https://car.autohome.com.cn/price/brand-15.html

代码

import scrapy

# 获取此网页的车型和价格
class CarSpider(scrapy.Spider):
name = "car"
allowed_domains = ["car.autohome.com.cn"]
start_urls = ["https://car.autohome.com.cn/price/brand-15.html"]

def parse(self, response):
print('=================================')
name_list = response.xpath('//div[@class="tab-content fn-visible"]/div[@id="brandtab-1"]//div[@class="list-cont"]//div[@class="list-cont-main"]/div[@class="main-title"]/a/text()')
price_list = response.xpath('//div[@class="main-lever"]//span[@class="font-arial"]/text()')
for i in range(len(name_list)):
name = name_list[i].extract()
price = price_list[i].extract()
print(name,price)

7.3 当当网

所用概念: yield 管道封装 多条管道下载 多页数据下载

  1. 创建项目

    scrapy startproject dangdangnet
    cd ..../spiders
    scrapy genspider dang https://category.dangdang.com/cp01.01.02.00.00.00.html
  2. 项目代码

    • dang.py

      1. 全局属性base_url和page用来定义多页查询的url
      2. parse前半部分用于当前页面查询条件
      3. 在yield book后写的代码是一个实现多页查询的条件
      import scrapy
      from scrapy_dangdangnet.items import ScrapyDangdangnetItem


      class DangSpider(scrapy.Spider):
      name = "dang"
      # 如果是多页下载则需要调整allowed_domains范围 一般情况下只写域名
      allowed_domains = ["category.dangdang.com"]
      start_urls = ["https://category.dangdang.com/cp01.01.02.00.00.00.html"]

      base_url = 'https://category.dangdang.com/pg'
      page = 1

      def parse(self, response):
      # 测试反爬
      # print('=======================')
      # piplines 下载大数据
      # items 定义数据结构

      # src = //ul[@id="component_59"]/li/a/img/@src
      # alt = //ul[@id="component_59"]/li/a/img/@alt
      # price = //ul[@id="component_59"]/li/p[@class="price"]/span[@class="search_now_price"]/text()
      # 所有的seletor的对象都可以再次调用xpath方法
      li_list = response.xpath('//ul[@id="component_59"]/li')
      for li in li_list:
      # 懒加载的src
      # src = li.xpath('./a/img/@src').extract_first()
      src = li.xpath('./a/img/@data-original').extract_first()
      # 第一张图的src是真实地址,其他的是懒加载数据data-original,当第一张数据没有获取时会成为None
      if src:
      src = src
      else:
      src = li.xpath('./a/img/@src').extract_first()
      name = li.xpath('./a/img/@alt').extract_first()
      price = li.xpath('./p[@class="price"]/span[@class="search_now_price"]/text()').extract_first()
      # print(src, name, price)
      # 导入from scrapy_dangdangnet.item import ScrapyDangdangnetItem进行封装数据
      book = ScrapyDangdangnetItem(src=src, name=name, price=price)
      # 每获取一个对象就返回给items一次
      yield book

      # 爬取多页
      # https://category.dangdang.com/pg4-cp01.01.02.00.00.00.html
      if self.page < 100:
      self.page += 1

      url = self.base_url + str(self.page) + '-cp01.01.02.00.00.00.html'
      # scrapy.Request就是scrapy的get请求
      # url是请求地址
      # callback是要执行的函数,注意parse不允许加圆括号
      yield scrapy.Request(url=url,callback=self.parse)
    • items.py

      1. 爬取的数据结构的py,没什么好说的
      # Define here the models for your scraped items
      #
      # See documentation in:
      # https://docs.scrapy.org/en/latest/topics/items.html

      import scrapy

      class ScrapyDangdangnetItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      # 通俗的说就是爬取的数据都有什么
      # 图片
      src = scrapy.Field()
      # 名字
      name = scrapy.Field()
      # 价格
      price = scrapy.Field()
    • piplines.py

      1. 管道输出文件
      2. open/close_spider(self,spider)方法是另外加的,让管道进行高效读写
      3. DangDangDownloadPipeline为自定义的多管道,引入request,实现图片本地下载的一个功能
      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


      # useful for handling different item types with a single interface
      from itemadapter import ItemAdapter


      # 如果想使用管道,则需要在settings.py中解开管道
      # ITEM_PIPELINES = {
      # "scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline": 300,
      # }
      class ScrapyDangdangnetPipeline:
      # 在爬虫文件开始之前执行的方法
      def open_spider(self, spider):
      self.fp = open('book.json', 'w', encoding='utf-8')

      # item就是yield后面传进来的book对象
      def process_item(self, item, spider):
      # 以下这种模式不推荐,文件操作过于频繁

      # write方法必须是str类型,不能是对象
      # w模式会每次覆盖掉数据
      # with open('book.json', 'a', encoding='utf-8') as fp:
      # fp.write(str(item))

      self.fp.write(str(item))
      return item

      # 在爬虫文件执行完后的方法
      def close_spider(self, spider):
      self.fp.close()


      import urllib.request


      # 多条管道开启
      # settings.py开启管道
      # "scrapy_dangdangnet.pipelines.DangDangDownloadPipeline": 301,
      class DangDangDownloadPipeline:
      def process_item(self, item, spider):
      url = 'http:' + item.get('src')
      filename = './books/' + item.get('name') + '.jpg'

      urllib.request.urlretrieve(url=url, filename=filename)
      return item

    • settings.py

      1. 配置信息文件

      2. ROBOTSTXT_OBEY = True 注释掉意为撕毁君子协议,强行爬取

      3. 开启管道功能
        ITEM_PIPELINES = {

        值是优先级,值是1-1000,值越小优先级越高

        ​ “scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline”: 300,

        DangDangDownloadPipeline 开启多管道

        ​ “scrapy_dangdangnet.pipelines.DangDangDownloadPipeline”: 301,
        }

      # Scrapy settings for scrapy_dangdangnet project
      #
      # For simplicity, this file contains only settings considered important or
      # commonly used. You can find more settings consulting the documentation:
      #
      # https://docs.scrapy.org/en/latest/topics/settings.html
      # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
      # https://docs.scrapy.org/en/latest/topics/spider-middleware.html

      BOT_NAME = "scrapy_dangdangnet"

      SPIDER_MODULES = ["scrapy_dangdangnet.spiders"]
      NEWSPIDER_MODULE = "scrapy_dangdangnet.spiders"

      # Crawl responsibly by identifying yourself (and your website) on the user-agent
      # USER_AGENT = "scrapy_dangdangnet (+http://www.yourdomain.com)"

      # Obey robots.txt rules
      ROBOTSTXT_OBEY = True

      # Configure maximum concurrent requests performed by Scrapy (default: 16)
      # CONCURRENT_REQUESTS = 32

      # Configure a delay for requests for the same website (default: 0)
      # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
      # See also autothrottle settings and docs
      # DOWNLOAD_DELAY = 3
      # The download delay setting will honor only one of:
      # CONCURRENT_REQUESTS_PER_DOMAIN = 16
      # CONCURRENT_REQUESTS_PER_IP = 16

      # Disable cookies (enabled by default)
      # COOKIES_ENABLED = False

      # Disable Telnet Console (enabled by default)
      # TELNETCONSOLE_ENABLED = False

      # Override the default request headers:
      # DEFAULT_REQUEST_HEADERS = {
      # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      # "Accept-Language": "en",
      # }

      # Enable or disable spider middlewares
      # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
      # SPIDER_MIDDLEWARES = {
      # "scrapy_dangdangnet.middlewares.ScrapyDangdangnetSpiderMiddleware": 543,
      # }

      # Enable or disable downloader middlewares
      # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
      # DOWNLOADER_MIDDLEWARES = {
      # "scrapy_dangdangnet.middlewares.ScrapyDangdangnetDownloaderMiddleware": 543,
      # }

      # Enable or disable extensions
      # See https://docs.scrapy.org/en/latest/topics/extensions.html
      # EXTENSIONS = {
      # "scrapy.extensions.telnet.TelnetConsole": None,
      # }

      # Configure item pipelines
      # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
      ITEM_PIPELINES = {
      # 值是优先级,值是1-1000,值越小优先级越高
      "scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline": 300,
      # DangDangDownloadPipeline 开启多管道
      "scrapy_dangdangnet.pipelines.DangDangDownloadPipeline": 301,
      }

      # Enable and configure the AutoThrottle extension (disabled by default)
      # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
      # AUTOTHROTTLE_ENABLED = True
      # The initial download delay
      # AUTOTHROTTLE_START_DELAY = 5
      # The maximum download delay to be set in case of high latencies
      # AUTOTHROTTLE_MAX_DELAY = 60
      # The average number of requests Scrapy should be sending in parallel to
      # each remote server
      # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
      # Enable showing throttling stats for every response received:
      # AUTOTHROTTLE_DEBUG = False

      # Enable and configure HTTP caching (disabled by default)
      # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
      # HTTPCACHE_ENABLED = True
      # HTTPCACHE_EXPIRATION_SECS = 0
      # HTTPCACHE_DIR = "httpcache"
      # HTTPCACHE_IGNORE_HTTP_CODES = []
      # HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

      # Set settings whose default value is deprecated to a future-proof value
      REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
      TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
      FEED_EXPORT_ENCODING = "utf-8"

  3. 运行项目

    scrapy crawl dang

7.4 电影天堂

所用概念:一个item包含多级页面的数据

  1. 创建项目

    scrapy startproject scrapy_movie
    cd ......./spiders
    scrapy genspider mv https://dy.dytt8.net/html/gndy/dyzz/index.html
  2. 项目代码

    • mv.py

      import scrapy
      from scrapy_movie.items import ScrapyMovieItem


      class MvSpider(scrapy.Spider):
      name = "mv"
      allowed_domains = ["dy.dytt8.net"]
      start_urls = ["https://dy.dytt8.net/html/gndy/dyzz/index.html"]

      def parse(self, response):
      print('=========正在运行==============')
      # 获取第一层的名字和通往第二层的href
      # href = //div[@class="co_content8"]/ul/table[@class="tbspan"]//td/b/a/@href
      # name = //div[@class="co_content8"]/ul/table[@class="tbspan"]//td/b/a/text()
      a_list = response.xpath('//div[@class="co_content8"]/ul//table[@class="tbspan"]//a')
      for a in a_list:
      # 获取第一页的name和要点击的链接
      name = a.xpath('./text()').extract_first()
      href = a.xpath('./@href').extract_first()

      # 第二页的地址
      url = 'https://dy.dytt8.net' + href
      # 进入第二页,通过meta传递name给第二页
      yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name})

      def parse_second(self, response):
      # 无法识别span标签
      # 同样无法识别tbody之类的标签
      # 如果拿不到数据,请检查xpath语法
      src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first()
      name = response.meta['name']
      movie = ScrapyMovieItem(name=name, src=src)
      yield movie
    • item.py

      # Define here the models for your scraped items
      #
      # See documentation in:
      # https://docs.scrapy.org/en/latest/topics/items.html

      import scrapy


      class ScrapyMovieItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      # 电影名
      name = scrapy.Field()
      # 电影图片
      src = scrapy.Field()
    • pipelines.py

      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


      # useful for handling different item types with a single interface
      from itemadapter import ItemAdapter


      class ScrapyMoviePipeline:

      def open_spider(self,spider):
      self.fp = open('movie.json','w',encoding='utf-8')
      self.fp.write('[')

      def process_item(self, item, spider):
      self.fp.write(str(item))
      self.fp.write(',')
      return item

      def close_spider(self,spider):
      self.fp.write(']')
      self.fp.close()
    • settings.py

      # Scrapy settings for scrapy_movie project
      #
      # For simplicity, this file contains only settings considered important or
      # commonly used. You can find more settings consulting the documentation:
      #
      # https://docs.scrapy.org/en/latest/topics/settings.html
      # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
      # https://docs.scrapy.org/en/latest/topics/spider-middleware.html

      BOT_NAME = "scrapy_movie"

      SPIDER_MODULES = ["scrapy_movie.spiders"]
      NEWSPIDER_MODULE = "scrapy_movie.spiders"


      # Crawl responsibly by identifying yourself (and your website) on the user-agent
      #USER_AGENT = "scrapy_movie (+http://www.yourdomain.com)"

      # Obey robots.txt rules
      # ROBOTSTXT_OBEY = True

      # Configure maximum concurrent requests performed by Scrapy (default: 16)
      #CONCURRENT_REQUESTS = 32

      # Configure a delay for requests for the same website (default: 0)
      # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
      # See also autothrottle settings and docs
      #DOWNLOAD_DELAY = 3
      # The download delay setting will honor only one of:
      #CONCURRENT_REQUESTS_PER_DOMAIN = 16
      #CONCURRENT_REQUESTS_PER_IP = 16

      # Disable cookies (enabled by default)
      #COOKIES_ENABLED = False

      # Disable Telnet Console (enabled by default)
      #TELNETCONSOLE_ENABLED = False

      # Override the default request headers:
      #DEFAULT_REQUEST_HEADERS = {
      # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      # "Accept-Language": "en",
      #}

      # Enable or disable spider middlewares
      # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
      #SPIDER_MIDDLEWARES = {
      # "scrapy_movie.middlewares.ScrapyMovieSpiderMiddleware": 543,
      #}

      # Enable or disable downloader middlewares
      # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
      #DOWNLOADER_MIDDLEWARES = {
      # "scrapy_movie.middlewares.ScrapyMovieDownloaderMiddleware": 543,
      #}

      # Enable or disable extensions
      # See https://docs.scrapy.org/en/latest/topics/extensions.html
      #EXTENSIONS = {
      # "scrapy.extensions.telnet.TelnetConsole": None,
      #}

      # Configure item pipelines
      # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
      ITEM_PIPELINES = {
      "scrapy_movie.pipelines.ScrapyMoviePipeline": 300,
      }

      # Enable and configure the AutoThrottle extension (disabled by default)
      # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
      #AUTOTHROTTLE_ENABLED = True
      # The initial download delay
      #AUTOTHROTTLE_START_DELAY = 5
      # The maximum download delay to be set in case of high latencies
      #AUTOTHROTTLE_MAX_DELAY = 60
      # The average number of requests Scrapy should be sending in parallel to
      # each remote server
      #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
      # Enable showing throttling stats for every response received:
      #AUTOTHROTTLE_DEBUG = False

      # Enable and configure HTTP caching (disabled by default)
      # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
      #HTTPCACHE_ENABLED = True
      #HTTPCACHE_EXPIRATION_SECS = 0
      #HTTPCACHE_DIR = "httpcache"
      #HTTPCACHE_IGNORE_HTTP_CODES = []
      #HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

      # Set settings whose default value is deprecated to a future-proof value
      REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
      TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
      FEED_EXPORT_ENCODING = "utf-8"

  3. 运行项目

    scrapy crawl mv

7.5 读书网数据入库

  1. 创建项目

    scrapy startproject scrapy_dushu
    cd ..../spiders
    # 创建爬虫类
    scrapy genspider -t crawl read https://www.dushu.com/book/1188.html
  2. 与先前不同

    # 创建爬虫类
    scrapy genspider -t crawl read https://www.dushu.com/book/1188.html
    # 参数不同
    class ReadSpider(CrawlSpider)
    # 新增规则
    rules = (Rule(LinkExtractor(allow=r"Items/"), callback="parse_item", follow=True),)
  3. 直接拿json数据方式

    1. read.py

      import scrapy
      from scrapy.linkextractors import LinkExtractor
      from scrapy.spiders import CrawlSpider, Rule

      from scrapy_dushu.items import ScrapyDushuItem


      class ReadSpider(CrawlSpider):
      name = "read"
      allowed_domains = ["www.dushu.com"]
      # crawlspider坑,不是首页的解析地址,此项不解析会越过此页
      # start_urls = ["https://www.dushu.com/book/1188.html"]
      # 排坑
      start_urls = ["https://www.dushu.com/book/1188_1.html"]
      # allow解析的是页数规则
      # follow会追踪页码,如果当前页码显示只有1-13条,那么TRUE后会追踪后续的所有页码
      rules = (Rule(LinkExtractor(allow=r"/book/1188_\d+\.html"), callback="parse_item", follow=True),)

      def parse_item(self, response):
      # print('----------------------------------------------------------')
      img_list = response.xpath('//div[@class="bookslist"]//img')

      for img in img_list:
      name = img.xpath('./@alt').extract_first()
      src = img.xpath('./@data-original').extract_first()
      print(name, src)
      book = ScrapyDushuItem(name=name, src=src)
      yield book

    2. settings.py

      # Scrapy settings for scrapy_dushu project
      #
      # For simplicity, this file contains only settings considered important or
      # commonly used. You can find more settings consulting the documentation:
      #
      # https://docs.scrapy.org/en/latest/topics/settings.html
      # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
      # https://docs.scrapy.org/en/latest/topics/spider-middleware.html

      BOT_NAME = "scrapy_dushu"

      SPIDER_MODULES = ["scrapy_dushu.spiders"]
      NEWSPIDER_MODULE = "scrapy_dushu.spiders"

      # Crawl responsibly by identifying yourself (and your website) on the user-agent
      # USER_AGENT = "scrapy_dushu (+http://www.yourdomain.com)"

      # Obey robots.txt rules
      # ROBOTSTXT_OBEY = True

      # Configure maximum concurrent requests performed by Scrapy (default: 16)
      # CONCURRENT_REQUESTS = 32

      # Configure a delay for requests for the same website (default: 0)
      # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
      # See also autothrottle settings and docs
      # DOWNLOAD_DELAY = 3
      # The download delay setting will honor only one of:
      # CONCURRENT_REQUESTS_PER_DOMAIN = 16
      # CONCURRENT_REQUESTS_PER_IP = 16

      # Disable cookies (enabled by default)
      # COOKIES_ENABLED = False

      # Disable Telnet Console (enabled by default)
      # TELNETCONSOLE_ENABLED = False

      # Override the default request headers:
      # DEFAULT_REQUEST_HEADERS = {
      # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      # "Accept-Language": "en",
      # }

      # Enable or disable spider middlewares
      # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
      # SPIDER_MIDDLEWARES = {
      # "scrapy_dushu.middlewares.ScrapyDushuSpiderMiddleware": 543,
      # }

      # Enable or disable downloader middlewares
      # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
      # DOWNLOADER_MIDDLEWARES = {
      # "scrapy_dushu.middlewares.ScrapyDushuDownloaderMiddleware": 543,
      # }

      # Enable or disable extensions
      # See https://docs.scrapy.org/en/latest/topics/extensions.html
      # EXTENSIONS = {
      # "scrapy.extensions.telnet.TelnetConsole": None,
      # }

      # Configure item pipelines
      # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
      ITEM_PIPELINES = {
      "scrapy_dushu.pipelines.ScrapyDushuPipeline": 300,
      }

      # Enable and configure the AutoThrottle extension (disabled by default)
      # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
      # AUTOTHROTTLE_ENABLED = True
      # The initial download delay
      # AUTOTHROTTLE_START_DELAY = 5
      # The maximum download delay to be set in case of high latencies
      # AUTOTHROTTLE_MAX_DELAY = 60
      # The average number of requests Scrapy should be sending in parallel to
      # each remote server
      # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
      # Enable showing throttling stats for every response received:
      # AUTOTHROTTLE_DEBUG = False

      # Enable and configure HTTP caching (disabled by default)
      # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
      # HTTPCACHE_ENABLED = True
      # HTTPCACHE_EXPIRATION_SECS = 0
      # HTTPCACHE_DIR = "httpcache"
      # HTTPCACHE_IGNORE_HTTP_CODES = []
      # HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

      # Set settings whose default value is deprecated to a future-proof value
      REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
      TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
      FEED_EXPORT_ENCODING = "utf-8"

    3. pipelines.py

      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


      # useful for handling different item types with a single interface
      from itemadapter import ItemAdapter


      class ScrapyDushuPipeline:
      def open_spider(self, spider):
      self.fp = open('book.json', 'w', encoding='utf-8')

      def process_item(self, item, spider):
      self.fp.write(str(item))
      return item

      def close_spider(self, spider):
      self.fp.close()

    4. items.py

      # Define here the models for your scraped items
      #
      # See documentation in:
      # https://docs.scrapy.org/en/latest/topics/items.html

      import scrapy


      class ScrapyDushuItem(scrapy.Item):
      # define the fields for your item here like:
      # 名字
      name = scrapy.Field()
      # 图片路径
      src = scrapy.Field()

  4. 数据入库操作

    1. 建表

      CREATE TABLE `book` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `name` varchar(255) DEFAULT NULL,
      `src` varchar(255) DEFAULT NULL,
      PRIMARY KEY (`id`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
    2. settings.py设置数据库

      # 设置数据库信息
      DB_HOST = '127.0.0.1'
      DB_PORT = 3306
      DB_USER = 'root'
      DB_PASSWORD = 'root'
      DB_NAME = 'dushuwang'
      # 编码不要加'-'
      DB_CHARSET = 'utf8'
    3. pipelines.py

      # 新建一个管道
      class MysqlPipeline:
      def open_spider(self, spider):
      self.fp = open('book.json', 'w', encoding='utf-8')

      def process_item(self, item, spider):
      self.fp.write(str(item))
      return item

      def close_spider(self, spider):
      self.fp.close()
    4. settings.py设置管道

      ITEM_PIPELINES = {
      "scrapy_dushu.pipelines.ScrapyDushuPipeline": 300,
      # MysqlPipeline
      "scrapy_dushu.pipelines.MysqlPipeline": 301,
      }
    5. pipelines.py管道中MysqlPipeline前添加获取settings.py的参数

      # 加载settings.py文件
      from scrapy.utils.project import get_project_settings
    6. 下载pymysql

      # 用于连接数据库
      pip install pymysql
    7. pipelines.py修改MysqlPipeline管道内容

      # 导入pymysql
      import pymysql
      class MysqlPipeline:
      def open_spider(self, spider):
      settings = get_project_settings()
      self.host = settings['DB_HOST']
      self.port = settings['DB_PORT']
      self.user = settings['DB_USER']
      self.password = settings['DB_PASSWORD']
      self.name = settings['DB_NAME']
      self.charset = settings['DB_CHARSET']

      self.connect()

      def connect(self):
      self.conn = pymysql.connect(
      host=self.host,
      port=self.port,
      user=self.user,
      password=self.password,
      db=self.name,
      charset=self.charset
      )
      self.cursor = self.conn.cursor()

      def process_item(self, item, spider):
      sql = 'insert into book(name,src) values("{}","{}")'.format(item['name'], item['src'])
      self.cursor.execute(sql)
      self.conn.commit()
      return item

      def close_spider(self, spider):
      self.cursor.close()
      self.conn.close()
  5. 执行

    scrapy crawl read

7.6 百度翻译POST

  1. 创建项目

    scrapy startproject scrapy_fanyi
    cd ..../spiders
    scrapy genspider fanyi https://fanyi.baidu.com/sug
  2. 修改fanyi.py

    import scrapy
    import json

    class FanyiSpider(scrapy.Spider):
    name = "fanyi"
    allowed_domains = ["fanyi.baidu.com"]

    # post请求没有参数将毫无意义
    # start_urls = ["https://fanyi.baidu.com/sug"]

    # def parse(self, response):
    # pass
    def start_requests(self):
    url = 'https://fanyi.baidu.com/sug'

    data = {
    'kw': 'final'
    }
    yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse_second)

    def parse_second(self,response):
    content = response.text
    # 直接打印会全是二进制
    obj = json.loads(content)
    print(obj)
  3. 直接运行

    scrapy crawl fanyi