scrapy
爬取网站数据,提取结构性数据而编写的应用框架
一、安装
pip install scrapy |
二、使用
创建scrapy项目
cmd: scrapy startproject 项目名
创建爬虫文件
进入spiders文件夹创建爬虫文件
创建爬虫文件
一般不加http协议
# scrapy genspider 爬虫文件的名字 要爬取的网页 scrapy genspider baidu www.baidu.com
- 得到的页面
```python
import scrapy
class BaiduSpider(scrapy.Spider):
# 爬虫的名字,运行爬虫时使用的值
name = "baidu"
# 允许访问的域名
allowed_domains = ["www.baidu.com"]
# 起始的url地址,第一次访问的域名
# start_urls 在allowed_domains前添加了http:// 末尾添加了/
start_urls = ["http://www.baidu.com"]
# 是执行了start_urls之后执行的方法 方法中的response就是返回的最终对象
# 相当于 response = urllib.request.urlopen()
# response = requests.get()
def parse(self, response):
pass
运行爬虫代码
scrapy crawl 爬虫名字
scrapy crawl baidu
- 君子协议:https://www.baidu.com/robots.txt
```python
# 注释掉setting.py文件中ROBOTSTXT_OBEY = True
三、response属性和方法
- response.text 提取响应的字符串
- response.body 提取二进制数据
- response.xpath 直接使用xpath方法来解析response中的内容
- response.extract() 提取使用xpath获取的seletor对象的data属性值
- response.extract_first() 获取seletor列表的第一个数据
四、scrapy shell
spider交互终端,对于业务调试比较方便,不需要每次去执行scrapy crawl name,只需要在控制台回车即可得到想要测试的数据
4.1 使用
安装ipython
pip install ipython |
ipython终端提供只能自动补全,高亮输出及其他特性
应用
直接进入cmd后输入scrapy shell …回车输出结束后会自动进入ipython
- scrapy shell www.baidu.com
- scrapy shell http://www.baidu.com
- scrapy shell ‘http://www.baidu.com‘
- scrapy shell ‘www.baidu.com‘
语法
response对象
- response.body
- response.text
- response.url
- response.status
response解析
- response.xpath() 【常用】
- 使用xpath路径查询特定元素,返回一个selector列表对象
- response.css()
- 使用css_selector查询元素,返回一个selector列表对象
- 获取内容:response.css(‘#su::text’).extract_first()
- 获取属性:response.css(‘#su::attr(“value”)’).extract_first()
- selector对象(通过xpath方法调用返回的是seletor列表)
- extract()
五、CrawlSpider
重点:这个功能用于提取页面底部的页码,而不是某个东西的进入地址,这个可以直接用xpath抓
案例:读书网
- 继承scarpy.Spider
- crawlspider可以定义解析html的规则,根据链接规则提取出指定的链接,如果有需要跟进链接的需求,爬取了网页之后,需要提取链接再次爬取,使用crawlspider是非常合适的
scrapy shell 示例
# 开启scrapy shell |
5.1 提取链接规则
scrapy.linkextractors.LinkExtractor( |
5.2 写法
links = LinkExtractor(allow=(r'')) |
5.3 提取链接
link.extract_links(res) |
六、日志
日志级别:
- CRITICAL 严重错误
- ERROR 一半错误
- WARNING 警告
- INFO 一般信息
- DEBUG 调试信息
默认DEBUG
settings.py文件配置
LOG_FILE : ‘xxx.log’ //必须是.log后缀
LOG_LEVEL : ‘日志级别’ // 不建议设置,设置logfile后控制台的输出会变成warning级别
七、案例
7.1 58同城
环境
scrapy startproject scrapy_58tc |
代码
import scrapy |
运行
scrapy crawl tc |
7.2 汽车之家
环境
scrapy startproject scrapy_carhome |
代码
import scrapy |
7.3 当当网
所用概念: yield 管道封装 多条管道下载 多页数据下载
创建项目
scrapy startproject dangdangnet
cd ..../spiders
scrapy genspider dang https://category.dangdang.com/cp01.01.02.00.00.00.html项目代码
dang.py
- 全局属性base_url和page用来定义多页查询的url
- parse前半部分用于当前页面查询条件
- 在yield book后写的代码是一个实现多页查询的条件
import scrapy
from scrapy_dangdangnet.items import ScrapyDangdangnetItem
class DangSpider(scrapy.Spider):
name = "dang"
# 如果是多页下载则需要调整allowed_domains范围 一般情况下只写域名
allowed_domains = ["category.dangdang.com"]
start_urls = ["https://category.dangdang.com/cp01.01.02.00.00.00.html"]
base_url = 'https://category.dangdang.com/pg'
page = 1
def parse(self, response):
# 测试反爬
# print('=======================')
# piplines 下载大数据
# items 定义数据结构
# src = //ul[@id="component_59"]/li/a/img/@src
# alt = //ul[@id="component_59"]/li/a/img/@alt
# price = //ul[@id="component_59"]/li/p[@class="price"]/span[@class="search_now_price"]/text()
# 所有的seletor的对象都可以再次调用xpath方法
li_list = response.xpath('//ul[@id="component_59"]/li')
for li in li_list:
# 懒加载的src
# src = li.xpath('./a/img/@src').extract_first()
src = li.xpath('./a/img/@data-original').extract_first()
# 第一张图的src是真实地址,其他的是懒加载数据data-original,当第一张数据没有获取时会成为None
if src:
src = src
else:
src = li.xpath('./a/img/@src').extract_first()
name = li.xpath('./a/img/@alt').extract_first()
price = li.xpath('./p[@class="price"]/span[@class="search_now_price"]/text()').extract_first()
# print(src, name, price)
# 导入from scrapy_dangdangnet.item import ScrapyDangdangnetItem进行封装数据
book = ScrapyDangdangnetItem(src=src, name=name, price=price)
# 每获取一个对象就返回给items一次
yield book
# 爬取多页
# https://category.dangdang.com/pg4-cp01.01.02.00.00.00.html
if self.page < 100:
self.page += 1
url = self.base_url + str(self.page) + '-cp01.01.02.00.00.00.html'
# scrapy.Request就是scrapy的get请求
# url是请求地址
# callback是要执行的函数,注意parse不允许加圆括号
yield scrapy.Request(url=url,callback=self.parse)items.py
- 爬取的数据结构的py,没什么好说的
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapyDangdangnetItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 通俗的说就是爬取的数据都有什么
# 图片
src = scrapy.Field()
# 名字
name = scrapy.Field()
# 价格
price = scrapy.Field()piplines.py
- 管道输出文件
- open/close_spider(self,spider)方法是另外加的,让管道进行高效读写
- DangDangDownloadPipeline为自定义的多管道,引入request,实现图片本地下载的一个功能
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
# 如果想使用管道,则需要在settings.py中解开管道
# ITEM_PIPELINES = {
# "scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline": 300,
# }
class ScrapyDangdangnetPipeline:
# 在爬虫文件开始之前执行的方法
def open_spider(self, spider):
self.fp = open('book.json', 'w', encoding='utf-8')
# item就是yield后面传进来的book对象
def process_item(self, item, spider):
# 以下这种模式不推荐,文件操作过于频繁
# write方法必须是str类型,不能是对象
# w模式会每次覆盖掉数据
# with open('book.json', 'a', encoding='utf-8') as fp:
# fp.write(str(item))
self.fp.write(str(item))
return item
# 在爬虫文件执行完后的方法
def close_spider(self, spider):
self.fp.close()
import urllib.request
# 多条管道开启
# settings.py开启管道
# "scrapy_dangdangnet.pipelines.DangDangDownloadPipeline": 301,
class DangDangDownloadPipeline:
def process_item(self, item, spider):
url = 'http:' + item.get('src')
filename = './books/' + item.get('name') + '.jpg'
urllib.request.urlretrieve(url=url, filename=filename)
return itemsettings.py
配置信息文件
ROBOTSTXT_OBEY = True 注释掉意为撕毁君子协议,强行爬取
开启管道功能
ITEM_PIPELINES = {值是优先级,值是1-1000,值越小优先级越高
“scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline”: 300,
DangDangDownloadPipeline 开启多管道
“scrapy_dangdangnet.pipelines.DangDangDownloadPipeline”: 301,
}
# Scrapy settings for scrapy_dangdangnet project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = "scrapy_dangdangnet"
SPIDER_MODULES = ["scrapy_dangdangnet.spiders"]
NEWSPIDER_MODULE = "scrapy_dangdangnet.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "scrapy_dangdangnet (+http://www.yourdomain.com)"
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# "scrapy_dangdangnet.middlewares.ScrapyDangdangnetSpiderMiddleware": 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# "scrapy_dangdangnet.middlewares.ScrapyDangdangnetDownloaderMiddleware": 543,
# }
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
# }
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 值是优先级,值是1-1000,值越小优先级越高
"scrapy_dangdangnet.pipelines.ScrapyDangdangnetPipeline": 300,
# DangDangDownloadPipeline 开启多管道
"scrapy_dangdangnet.pipelines.DangDangDownloadPipeline": 301,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
运行项目
scrapy crawl dang
7.4 电影天堂
所用概念:一个item包含多级页面的数据
创建项目
scrapy startproject scrapy_movie
cd ......./spiders
scrapy genspider mv https://dy.dytt8.net/html/gndy/dyzz/index.html项目代码
mv.py
import scrapy
from scrapy_movie.items import ScrapyMovieItem
class MvSpider(scrapy.Spider):
name = "mv"
allowed_domains = ["dy.dytt8.net"]
start_urls = ["https://dy.dytt8.net/html/gndy/dyzz/index.html"]
def parse(self, response):
print('=========正在运行==============')
# 获取第一层的名字和通往第二层的href
# href = //div[@class="co_content8"]/ul/table[@class="tbspan"]//td/b/a/@href
# name = //div[@class="co_content8"]/ul/table[@class="tbspan"]//td/b/a/text()
a_list = response.xpath('//div[@class="co_content8"]/ul//table[@class="tbspan"]//a')
for a in a_list:
# 获取第一页的name和要点击的链接
name = a.xpath('./text()').extract_first()
href = a.xpath('./@href').extract_first()
# 第二页的地址
url = 'https://dy.dytt8.net' + href
# 进入第二页,通过meta传递name给第二页
yield scrapy.Request(url=url, callback=self.parse_second, meta={'name': name})
def parse_second(self, response):
# 无法识别span标签
# 同样无法识别tbody之类的标签
# 如果拿不到数据,请检查xpath语法
src = response.xpath('//div[@id="Zoom"]//img/@src').extract_first()
name = response.meta['name']
movie = ScrapyMovieItem(name=name, src=src)
yield movieitem.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapyMovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 电影名
name = scrapy.Field()
# 电影图片
src = scrapy.Field()pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class ScrapyMoviePipeline:
def open_spider(self,spider):
self.fp = open('movie.json','w',encoding='utf-8')
self.fp.write('[')
def process_item(self, item, spider):
self.fp.write(str(item))
self.fp.write(',')
return item
def close_spider(self,spider):
self.fp.write(']')
self.fp.close()settings.py
# Scrapy settings for scrapy_movie project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = "scrapy_movie"
SPIDER_MODULES = ["scrapy_movie.spiders"]
NEWSPIDER_MODULE = "scrapy_movie.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "scrapy_movie (+http://www.yourdomain.com)"
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "scrapy_movie.middlewares.ScrapyMovieSpiderMiddleware": 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "scrapy_movie.middlewares.ScrapyMovieDownloaderMiddleware": 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"scrapy_movie.pipelines.ScrapyMoviePipeline": 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
运行项目
scrapy crawl mv
7.5 读书网数据入库
创建项目
scrapy startproject scrapy_dushu
cd ..../spiders
# 创建爬虫类
scrapy genspider -t crawl read https://www.dushu.com/book/1188.html与先前不同
# 创建爬虫类
scrapy genspider -t crawl read https://www.dushu.com/book/1188.html
# 参数不同
class ReadSpider(CrawlSpider)
# 新增规则
rules = (Rule(LinkExtractor(allow=r"Items/"), callback="parse_item", follow=True),)直接拿json数据方式
read.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_dushu.items import ScrapyDushuItem
class ReadSpider(CrawlSpider):
name = "read"
allowed_domains = ["www.dushu.com"]
# crawlspider坑,不是首页的解析地址,此项不解析会越过此页
# start_urls = ["https://www.dushu.com/book/1188.html"]
# 排坑
start_urls = ["https://www.dushu.com/book/1188_1.html"]
# allow解析的是页数规则
# follow会追踪页码,如果当前页码显示只有1-13条,那么TRUE后会追踪后续的所有页码
rules = (Rule(LinkExtractor(allow=r"/book/1188_\d+\.html"), callback="parse_item", follow=True),)
def parse_item(self, response):
# print('----------------------------------------------------------')
img_list = response.xpath('//div[@class="bookslist"]//img')
for img in img_list:
name = img.xpath('./@alt').extract_first()
src = img.xpath('./@data-original').extract_first()
print(name, src)
book = ScrapyDushuItem(name=name, src=src)
yield booksettings.py
# Scrapy settings for scrapy_dushu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = "scrapy_dushu"
SPIDER_MODULES = ["scrapy_dushu.spiders"]
NEWSPIDER_MODULE = "scrapy_dushu.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = "scrapy_dushu (+http://www.yourdomain.com)"
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
# }
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# "scrapy_dushu.middlewares.ScrapyDushuSpiderMiddleware": 543,
# }
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# "scrapy_dushu.middlewares.ScrapyDushuDownloaderMiddleware": 543,
# }
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
# }
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"scrapy_dushu.pipelines.ScrapyDushuPipeline": 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = "httpcache"
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class ScrapyDushuPipeline:
def open_spider(self, spider):
self.fp = open('book.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
self.fp.write(str(item))
return item
def close_spider(self, spider):
self.fp.close()items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapyDushuItem(scrapy.Item):
# define the fields for your item here like:
# 名字
name = scrapy.Field()
# 图片路径
src = scrapy.Field()
数据入库操作
建表
CREATE TABLE `book` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`src` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;settings.py设置数据库
# 设置数据库信息
DB_HOST = '127.0.0.1'
DB_PORT = 3306
DB_USER = 'root'
DB_PASSWORD = 'root'
DB_NAME = 'dushuwang'
# 编码不要加'-'
DB_CHARSET = 'utf8'pipelines.py
# 新建一个管道
class MysqlPipeline:
def open_spider(self, spider):
self.fp = open('book.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
self.fp.write(str(item))
return item
def close_spider(self, spider):
self.fp.close()settings.py设置管道
ITEM_PIPELINES = {
"scrapy_dushu.pipelines.ScrapyDushuPipeline": 300,
# MysqlPipeline
"scrapy_dushu.pipelines.MysqlPipeline": 301,
}pipelines.py管道中MysqlPipeline前添加获取settings.py的参数
# 加载settings.py文件
from scrapy.utils.project import get_project_settings下载pymysql
# 用于连接数据库
pip install pymysqlpipelines.py修改MysqlPipeline管道内容
# 导入pymysql
import pymysql
class MysqlPipeline:
def open_spider(self, spider):
settings = get_project_settings()
self.host = settings['DB_HOST']
self.port = settings['DB_PORT']
self.user = settings['DB_USER']
self.password = settings['DB_PASSWORD']
self.name = settings['DB_NAME']
self.charset = settings['DB_CHARSET']
self.connect()
def connect(self):
self.conn = pymysql.connect(
host=self.host,
port=self.port,
user=self.user,
password=self.password,
db=self.name,
charset=self.charset
)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
sql = 'insert into book(name,src) values("{}","{}")'.format(item['name'], item['src'])
self.cursor.execute(sql)
self.conn.commit()
return item
def close_spider(self, spider):
self.cursor.close()
self.conn.close()
执行
scrapy crawl read
7.6 百度翻译POST
创建项目
scrapy startproject scrapy_fanyi
cd ..../spiders
scrapy genspider fanyi https://fanyi.baidu.com/sug修改fanyi.py
import scrapy
import json
class FanyiSpider(scrapy.Spider):
name = "fanyi"
allowed_domains = ["fanyi.baidu.com"]
# post请求没有参数将毫无意义
# start_urls = ["https://fanyi.baidu.com/sug"]
# def parse(self, response):
# pass
def start_requests(self):
url = 'https://fanyi.baidu.com/sug'
data = {
'kw': 'final'
}
yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse_second)
def parse_second(self,response):
content = response.text
# 直接打印会全是二进制
obj = json.loads(content)
print(obj)直接运行
scrapy crawl fanyi