Scrapy

Python3 Scrapy 框架的简单使用
Scrapy框架官方网址

Scrapy中文维护站点

Scrapy

简介

1
2
3
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。
可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。

基本结构

  • 首先我们先来看一张图
    scrapy_all
1
2
3
4
5
6
7
8
9
10
11
12
13
scrapy的工作流程图
Scrapy engine 引擎是整个框架的核心,由scrapy实现好的。用来发送指令。
Scheduler 是调度器 专门用来发送请求的。
Downloader 下载器 专门用来下载数据的
Item Pipeline 管道文件 专门用来保存数据的
Spiders 我们自己写的爬虫
而我们只需要写Spiders 以及 Item Pipeline

基本流程

1
2
3
4
5
6
7
8
9
10
11
首先从我们写的爬虫开始 即Spiders, 一开始没有数据,
因此先发送请求,请求交给引擎,引擎判断是请求,直接交给Scheduler,引擎不做处理
Scheduler接受到请求之后,把请求交给Downloader,让Downloader去网上下载,Scheduler不做处理
Downloader把下载完的数据(即html源码)交回给Spiders
Spiders接受到Downloader下载完的数据后判断,
如果是请求,直接通过引擎交给Scheduler,继续爬去
如果是数据,直接交给Item Pipeline,做数据处理,

入门教程

1
2
3
4
5
本篇教程中将带您完成下列任务:
1.创建一个Scrapy项目
2.定义提取的Item
3.编写爬取网站的 spider 并提取 Item
4.编写 Item Pipeline 来存储提取到的Item(即数据)

创建项目

scrapy startproject freebuf

startproject

定义item

1
2
3
4
5
Item 是保存爬取到的数据的容器;
其使用方法和python字典类似
我们根据自己的需求,建立不同的字段,
这里以爬去freebuf的url,title 为例。
因此建立相应的字段。

item1

编写第一个爬虫

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Spider是用户编写用于从单个网站(或者一些网站)爬取数据的类。
scrapy.Spider 基础爬虫类。
name: 用于区别Spider。 因此具有唯一性
start_urls: 包含了Spider在启动时进行爬取的url列表。
parse() 是spider的一个方法。
被调用时,每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。
该方法负责解析返回的数据(response data),提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。
这里分析freebuf
发现 url变化如下
http://www.freebuf.com/vuls/page/1 第一页的文章地址
http://www.freebuf.com/vuls/page/2 第二页的文章地址
http://www.freebuf.com/vuls/page/3 第三页的文章地址

freebuftest

spider

分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
我们打开谷歌开发者工具,选取我们想要爬去的字段。
我们可以看到freebuf文章都是相同格式,
因此寻找一下它们是否有相同的根节点,然后从根节点开始匹配数据。
这里用使用xpath匹配。
谷歌插件 XPath helper
. 表示当前节点
// 表示任意位置
.. 表示当前节点的父节点 上一级的意思。
@ 选取属性
这里就不说xpath怎么使用了。。直接去看官方文档,

xpath教程

分析1

分析2

爬取

1
2
3
4
5
6
7
8
9
10
11
12
# if self.offset <= 2:
# self.offset += 1
# yield scrapy.Request(url=self.url + str(self.offset), callback = self.parse)
# 发送新的url GET请求加入待爬队列,并调用回调函数 self.parse
# 可以使用 yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。
scrapy crawl "自己写的爬虫的name" -o 1.json
-o 指定存储的路径
不使用Item Pipeline,存取数据如下。

nouseitem

spider1

1
2
3
scrapy crawl "自己写的爬虫的name"
使用Item Pipeline,存取数据如下。

itempipeline

itempipeline2

extract()

1
2
3
4
5
6
7
8
9
10
Scrapy Shell
Scrapy终端是一个交互终端,我们可以在未启动spider的情况下尝试及调试代码,
也可以用来测试XPath或CSS表达式,查看他们的工作方式,
方便我们爬取的网页中提取的数据。
scrapy shell "http://www.freebuf.com/vuls/page/1"
extract(): 序列化该节点为unicode字符串并返回list。

extract

CrawlSpiders

1
2
3
4
5
6
7
本篇教程中将带您完成下列任务:
1.CrawlSpiders的使用
2.DOWNLOADER_MIDDLEWARES 下载中间件的使用
3.setting文件参数。
4.如何抓取图片
5.如果是想抓取链接里面的内容,则可以把抓取到的链接,
当作请求在次发送出去,然后给定一个callback,在callback的函数里,提取链接里你想要的内容。
1
2
3
4
5
6
7
通过下面的命令可以快速创建 CrawlSpider模板 的代码:
scrapy genspider -t crawl tencent tencent.com
Spider的派生类
CrawlSpider类定义了一些规则(rule)提供 跟进link的方便 的机制,从爬取的网页中 获取link 并继续爬取的工作更适合。

settings.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# -*- coding: utf-8 -*-
BOT_NAME = 'freebuf'
SPIDER_MODULES = ['freebuf.spiders']
NEWSPIDER_MODULE = 'freebuf.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
USER_AGENTS = [
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)',
'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
'Mozilla/5.0 (Linux; U; Android 4.0.3; zh-cn; M032 Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13'
]
PROXIES = [
{"ip_prot" :"61.135.217.7:80", "user_passwd" : ""},
{"ip_prot" :"221.10.159.234:1337", "user_passwd" : ""},
{"ip_prot" :"220.249.185.178:9999", "user_passwd" : ""},
{"ip_prot" :"116.23.137.56:9999", "user_passwd" : ""},
{"ip_prot" :"183.56.177.130:808", "user_passwd" : ""},
{"ip_prot" :"221.214.214.144:53281", "user_passwd" : ""},
{"ip_prot" :"61.155.164.109:3128", "user_passwd" : ""},
{"ip_prot" :"218.56.132.158:8080", "user_passwd" : ""},
{"ip_prot" :"113.200.159.155:9999", "user_passwd" : ""},
{"ip_prot" :"123.7.38.31:9999", "user_passwd" : ""},
{"ip_prot" :"61.163.39.70:9999", "user_passwd" : ""}
]
# Obey robots.txt rules 是否遵从robots.txt 协议
# ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16) 同时可以处理多少个请求。默认16
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# 下载延迟,以秒为单位
DOWNLOAD_DELAY = random.random()+0.5
# 默认情况下,Scrapy在两个请求间不等待一个固定的值, 而是# 使用0.5到1.5之间的一个随机值
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False # 禁用cookie,有些网站通过cookie来判断是否有爬虫
DOWNLOAD_TIMEOUT = 15 # 减小下载超时(默认: 180)
RETRY_ENABLED = False # 禁止重试
REDIRECT_ENABLED = False # 禁止重定向
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'freebuf.middlewares.FreebufSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# 下载中间件是处于引擎和下载器之间的一层组件,可以有多个下载中间件被加载运行。
# 当引擎传递请求给下载器的过程中,下载中间件可以对请求进行处理 (例如增加http header信息,增加proxy信息等);
# 在下载器完成http请求,传递响应给引擎的过程中, 下载中间件可以对响应进行处理(例如进行gzip的解压等)
DOWNLOADER_MIDDLEWARES = {
'freebuf.middlewares.RandomUserAgent': 100,
'freebuf.middlewares.RandomProxy': 101,
# 'freebuf.middlewares.MyCustomDownloaderMiddleware': 543,
}
# 可选的级别有: CRITICAL、 ERROR、WARNING、INFO、DEBUG 。
# CRITICAL - 严重错误(critical)
# ERROR - 一般错误(regular errors)
# WARNING - 警告信息(warning messages)
# INFO - 一般信息(informational messages)
# DEBUG - 调试信息(debugging messages)
# log日志文件,当前目录里创建lagou.log。
# LOG_FILE = "lagou.log"
# 记录INFO级别及其以上的日志信息
# LOG_LEVEL = "INFO"
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 管道文件,值越小优先级越高
ITEM_PIPELINES = {
'freebuf.pipelines.FreebufPipeline': 300,
'freebuf.pipelines.ImagesPipeline': 301,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

middlewares.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# -*- coding: utf-8 -*-
from scrapy import signals
import random
from .settings import USER_AGENTS
from .settings import PROXIES
import base64
# process_request(self, request, spider)
# 当每个request通过下载中间件时,该方法被调用。
# 随机的User-Agent
class RandomUserAgent(object):
def process_request(self, request, spider):
useragent = random.choice(USER_AGENTS)
# print(useragent)
request.headers.setdefault("User-Agent", useragent)
class RandomProxy(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_passwd'] is None:
# 没有代理账户验证的代理使用方式
request.meta['proxy'] = "http://" + proxy['ip_port']
else:
# 对账户密码进行base64编码转换
base64_userpasswd = base64.b64encode(proxy['user_passwd'])
# 对应到代理服务器的信令格式里
request.headers['Proxy-Authorization'] = 'Basic ' + base64_userpasswd
request.meta['proxy'] = "http://" + proxy['ip_port']

pipelines.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import os
import scrapy
from .settings import IMAGES_STORE
from scrapy.pipelines.images import ImagesPipeline
# from scrapy.utils.project import get_project_settings
class FreebufPipeline(object):
def __init__(self):
self.filename = open("free.json", "w",encoding='utf-8')
def process_item(self, item, spider):
text = json.dumps(dict(item), ensure_ascii = False) + ",\n"
self.filename.write(text)
return item
def close_spider(self, spider):
self.filename.close()
class ImagesPipeline(ImagesPipeline):
# IMAGES_STORE = get_project_settings().get("IMAGES_STORE")
IMAGES_STORE = IMAGES_STORE
def get_media_requests(self, item, info):
image_url = item["img"]
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
# 固定写法,获取图片路径,同时判断这个路径是否正确,如果正确,就放到 image_path里,ImagesPipeline源码剖析可见
image_path = [x["path"] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["title"] + ".jpg")
item["img_path"] = self.IMAGES_STORE + "/" + item["name"]
return item
# get_media_requests的作用就是为每一个图片链接生成一个Request对象,
# 这个方法的输出将作为item_completed的输入中的results,
# results是一个元组,每个元组包括(success, imageinfoorfailure)。
# 如果success=true,imageinfoor_failure是一个字典,包括url/path/checksum三个key。

freebufcrawl.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from freebuf.items import FreebufItem
class FreebufcrawlSpider(CrawlSpider):
name = 'freebufcrawl'
allowed_domains = ['freebuf.com']
start_urls = ['http://www.freebuf.com/vuls/page/1']
# allow=r'/page/\d+' 用正则写
# LinkExtractor(allow=r'/page/\d+')↓
# 首先发送start_urls请求,然后根据Response的html源码进行链接的提取,返回的符合匹配规则的链接匹配对象的列表
Rule(LinkExtractor(allow=r'/page/\d+') ↓
# 获取这个列表里的链接,依次发送请求,并且继续跟进,调用指定回调函数处理
# follow=True
#指定了根据该规则从response提取的链接是否需要跟进。
#如果callback为None,follow 默认设置为True ,否则默认为False。
# 什么是跟进,就是提取出来的链接,是否重新发送出去。
rules = (
Rule(LinkExtractor(allow=r'/page/\d+'), callback='parse1_item', follow=True),
)
def parse1_item(self, response):
print(response.url)
for each in response.xpath('//div[@class="news_inner news-list"]'):
i = FreebufItem()
i['title'] = each.xpath('.//div[@class="news-info"]/dl/dt/a/text()').extract()[0]
i['url'] = each.xpath('.//div[@class="news-info"]/dl/dt/a/@href').extract()[0]
yield i

scrapygenjin

item.py

1
2
3
4
5
6
7
8
9
10
11
import scrapy
class FreebufItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
title = scrapy.Field()
url = scrapy.Field()
img = scrapy.Field()
img_path = scrapy.Field()
1
2
3
4
注意一点:
Rule(page_lx, callback = 'parse', follow = True)
# 回调函数不要写parse
由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败。

full 是用来区分图片和缩略图(如果使用的话)的一个子文件夹。详情参见 缩略图生成.

xiaoguotu
free

模拟POST

1
2
本篇教程中将带您完成下列任务:
1.完成POST登陆。

POST请求

1
2
3
4
5
6
可以使用 yield scrapy.FormRequest(url, formdata, callback)方法发送POST请求。
如果希望程序执行一开始就发送POST请求,可以重写Spider类的start_requests(self) 方法,
并且不再调用start_urls里的url。
ex.↓。。其他文件没做任何操作。仅仅测试了下登陆。

post2

模拟用户登录

使用FormRequest.from_response()方法模拟用户登录

1
2
3
4
5
6
通常网站通过 实现对某些表单字段(如数据或是登录界面中的认证令牌等)的预填充。
使用Scrapy抓取网页时,如果想要预填充或重写像用户名、用户密码这些表单字段,
可以使用 FormRequest.from_response() 方法实现。
ex.↓。。其他文件没做任何操作。仅仅测试了下登陆。

post1

1
实在没办法,直接使用保存登陆状态的Cookie模拟登陆
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'toolscrawl'
allowed_domains = ['t00ls.net']
start_urls = ['https://www.t00ls.net/redirect-42663.html#lastpost']
cookies = {
"discuz_fastpostrefresh" : "0",
"smile":"6D1",
"td_cookie":"18412346722144123069412388952927",
"UTH_cookietime":"2592000",
"UTH_auth":"d9d711323ZIHT32212G123123rSpA3T3123fr6Iix4UiQsMJm4n4ZMSCfJVV7PPVavX%2B7Ed9nzhPJDgEPpAUCvrLEKzqeEou5103d8eT0bma6Rr4dMjHOTuQQ",
"UTH_sid":"MmDXUm"
}
# 可以重写Spider类的start_requests方法,附带Cookie值,发送POST请求
def start_requests(self):
for url in self.start_urls:
yield scrapy.FormRequest(url, cookies = self.cookies, callback = self.parse_page)
# 处理响应内容
def parse_page(self, response):
print ("===========" + response.url)
with open("deng.html", "w") as filename:
filename.write(response.body.decode('utf-8'))

post3

tools1

导入数据库

Mongodb

  • items.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    import scrapy
    class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    pingjia = scrapy.Field()
    xing = scrapy.Field()
    jianjie = scrapy.Field()
    url = scrapy.Field()
    # images = scrapy.Field()
  • setting.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    # -*- coding: utf-8 -*-
    # Scrapy settings for douban project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    # http://doc.scrapy.org/en/latest/topics/settings.html
    # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    BOT_NAME = 'douban'
    SPIDER_MODULES = ['douban.spiders']
    NEWSPIDER_MODULE = 'douban.spiders'
    MYSQL_HOST = '127.0.0.1'
    MYSQL_PORT = 3306
    MYSQL_DBNAME = 'douban'
    MYSQL_USER = 'root'
    MYSQL_PASSWORD = 'root'
    # MONGODB 主机环回地址127.0.0.1
    MONGODB_HOST = '127.0.0.1'
    # 端口号,默认是27017
    MONGODB_PORT = 27017
    # 设置数据库名称
    MONGODB_DBNAME = 'DouBan'
    # 存放本次数据的表名称
    MONGODB_DOCNAME = 'DouBanMovies'
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    # USER_AGENT = 'douban (+http://www.yourdomain.com)'
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
    # Obey robots.txt rules
    # ROBOTSTXT_OBEY = True
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    # Override the default request headers:
    # DEFAULT_REQUEST_HEADERS = {
    # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    # 'Accept-Language': 'en',
    # }
    # Enable or disable spider middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    SPIDER_MIDDLEWARES = {
    'douban.middlewares.DoubanSpiderMiddleware': 543,
    }
    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    # 'douban.middlewares.MyCustomDownloaderMiddleware': 543,
    #}
    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    # 'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
    }
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  • doubancrawl.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    # -*- coding: utf-8 -*-
    import scrapy
    from douban.items import DoubanItem
    class DoubancrawlSpider(scrapy.Spider):
    name = 'doubancrawl'
    allowed_domains = ['douban.com']
    offset = 0
    url = 'https://movie.douban.com/top250?start='
    starturl = url + str(offset)
    start_urls = [starturl]
    def parse(self, response):
    for each in response.xpath('//div[@class="item"]/div[@class="info"]'):
    item = DoubanItem()
    item['name'] = each.xpath('.//a/span[1]/text()').extract()[0]
    item['url'] = each.xpath('.//a/@href').extract()[0]
    item['pingjia'] = each.xpath('.//div/span[4]/text()').extract()[0]
    item['xing'] = each.xpath('.//div/span[2]/text()').extract()[0]
    jianjie = each.xpath('.//p[@class="quote"]/span/text()').extract()
    if len(jianjie) == 0 :
    item['jianjie'] = []
    else:
    item['jianjie'] = each.xpath('.//p[@class="quote"]/span/text()').extract()[0]
    yield item
    if self.offset <= 250:
    self.offset += 25
    yield scrapy.Request(url=self.url + str(self.offset), callback = self.parse)
  • piplelines.py

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    # -*- coding: utf-8 -*-
    import codecs
    import json
    import pymongo
    from .settings import MONGODB_HOST
    from .settings import MONGODB_PORT
    from .settings import MONGODB_DBNAME
    from .settings import MONGODB_DOCNAME
    class DoubanPipeline(object):
    def __init__(self):
    # 获取setting主机名、端口号和数据库名
    host = MONGODB_HOST
    port = MONGODB_PORT
    dbname = MONGODB_DBNAME
    # client = pymongo.MongoClient('mongodb://lxhsec:root@localhost:27017/dbname')
    client = pymongo.MongoClient('mongodb://localhost:27017')
    mdb = client[dbname]
    self.post = mdb[MONGODB_DOCNAME]
    def process_item(self, item, spider):
    data = dict(item)
    self.post.insert(data)
    return item

MongoDB的使用

mongo
查看全部数据可用mongodb gui管理软件robomongo查看

mongogui

Mysql

1
2
3
4
5
6
7
8
9
10
11
1.创建数据库
create database douban charset='utf8';
2.创建表
create table doubanmovie(
id int not null primary key auto_increment,
name varchar(20) unique not null,
pingjia varchar(10),
xing varchar(10),
jianjie varchar(50),
url varchar(50) );
  • piplelines.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    # -*- coding: utf-8 -*-
    import pymysql
    from .settings import MYSQL_HOST
    from .settings import MYSQL_PORT
    from .settings import MYSQL_DBNAME
    from .settings import MYSQL_USER
    from .settings import MYSQL_PASSWORD
    class DoubanPipeline(object):
    def __init__(self):
    # 获取setting主机名、端口号和数据库名
    self.conn = pymysql.connect(host = MYSQL_HOST,
    port = MYSQL_PORT,
    user = MYSQL_USER,
    password = MYSQL_PASSWORD,
    database = MYSQL_DBNAME,
    charset = 'utf8')
    self.cs = self.conn.cursor()
    def process_item(self, item, spider):
    params = [item['name'],item['pingjia'],item['xing'],item['jianjie'],item['url']]
    self.cs.execute("insert into doubanmovie(name,pingjia,xing,jianjie,url) values(%s,%s,%s,%s,%s)",params)
    self.conn.commit()
    return item
    def close_spider(self, spider):
    self.cs.close()
    self.conn.close()
1
2
3
4
其他的文件都用mongodb的那些。
没做去重处理,也没加异常。
建表的时候有点弱智了。。
没加id字段。。。

mysql1

scrapy-redis分布式

未完待续。

个人理解,仅供参考。