scrapy安装及入门使用

scrapy安装及入门使用

安装

pip3.7 install Scrapy

输入scrapy命令查看是否安装成功

J-pro:myproject will$ scrapy 
Scrapy 2.1.0 - project: myproject

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command
  • 如出现上述提示,则表示scrapy安装成功。

简单入门

使用scrapy 创建项目

scrapy startproject myproject

进入到项目中查看,目录如下:

J-pro:myproject will$ ls -al
total 8
drwxr-xr-x   4 will  staff  128  6 11 23:47 .
drwxr-xr-x   3 will  staff   96  6 11 23:47 ..
drwxr-xr-x  10 will  staff  320  6 11 23:47 myproject // 项目目录
-rw-r--r--   1 will  staff  261  6 11 23:18 scrapy.cfg // 项目配置文件
J-pro:myproject will$ cd myproject/
J-pro:myproject will$ ls -al
total 56
drwxr-xr-x  10 will  staff   320  6 11 23:47 .
drwxr-xr-x   4 will  staff   128  6 11 23:47 ..
-rw-r--r--   1 will  staff     0  6 11 23:03 __init__.py
drwxr-xr-x   5 will  staff   160  6 11 23:42 __pycache__
-rw-r--r--   1 will  staff  8407  6 11 23:47 items.json // 爬虫抓爬下来的数据JSON
-rw-r--r--   1 will  staff   369  6 11 23:42 items.py // 定义需要提取数据的结构文件
-rw-r--r--   1 will  staff  3587  6 11 23:18 middlewares.py // 中间件文件,是和Scrapy的请求/响应处理相关联的框架
-rw-r--r--   1 will  staff   283  6 11 23:18 pipelines.py // 用来对items里面提取的数据进一步处理,如保存等
-rw-r--r--   1 will  staff  3115  6 11 23:18 settings.py // 设置文件
drwxr-xr-x   6 will  staff   192  6 11 23:47 spiders // 存储爬虫代码目录

实战Demo

编辑items.py文件,输入抓取数据字段

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DetailItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    reply = scrapy.Field()
    pass

进入spiders文件夹,创建一个爬虫文件:myspider.py

import scrapy
from myproject.items import DetailItem
import sys
 
 
class MySpider(scrapy.Spider):
    """
    name:scrapy唯一定位实例的属性,必须唯一
    allowed_domains:允许爬取的域名列表,不设置表示允许爬取所有
    start_urls:起始爬取列表
    start_requests:它就是从start_urls中读取链接,然后使用make_requests_from_url生成Request,
                    这就意味我们可以在start_requests方法中根据我们自己的需求往start_urls中写入
                    我们自定义的规律的链接
    parse:回调函数,处理response并返回处理后的数据和需要跟进的url
    log:打印日志信息
    closed:关闭spider
    """
    # 设置name
    name = "spidertieba"
    # 设定域名
    allowed_domains = ["baidu.com"]
    # 填写爬取地址
    start_urls = [
        "http://tieba.baidu.com/f?kw=%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB&ie=utf-8",
    ]
 
    # 编写爬取方法
    def parse(self, response):
        for line in response.xpath(‘//li[@class=" j_thread_list clearfix"]‘):
            # 初始化item对象保存爬取的信息
            item = DetailItem()
            # 这部分是爬取部分,使用xpath的方式选择信息,具体方法根据网页结构而定
            item[‘title‘] = line.xpath(‘.//div[contains(@class,"threadlist_title pull_left j_th_tit ")]/a/text()‘).extract()
            item[‘author‘] = line.xpath(‘.//div[contains(@class,"threadlist_author pull_right")]//span[contains(@class,"frs-author-name-wrap")]/a/text()‘).extract()
            item[‘reply‘] = line.xpath(‘.//div[contains(@class,"col2_left j_threadlist_li_left")]/span/text()‘).extract()
            yield item
  • 上述完成代码编写阶段,接下来执行爬虫。
scrapy crawl spidertieba -o items.json
  • 执行scrapy crawl进行抓取数据,spidertieba是myspider.py文件中定义的name。 -o 将抓取结果输出到指定文件中。 
  • 执行上述语句结果如下:
J-pro:myproject will$ scrapy crawl spidertieba -o items.json
2020-06-12 23:05:12 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: myproject)
2020-06-12 23:05:13 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.5 (default, Nov  1 2019, 02:16:32) - [Clang 11.0.0 (clang-1100.0.33.8)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-06-12 23:05:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-12 23:05:13 [scrapy.crawler] INFO: Overridden settings:
{‘BOT_NAME‘: ‘myproject‘,
 ‘NEWSPIDER_MODULE‘: ‘myproject.spiders‘,
 ‘ROBOTSTXT_OBEY‘: True,
 ‘SPIDER_MODULES‘: [‘myproject.spiders‘]}
2020-06-12 23:05:13 [scrapy.extensions.telnet] INFO: Telnet Password: b20d9ac1dc58b0eb
2020-06-12 23:05:13 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats‘,
 ‘scrapy.extensions.telnet.TelnetConsole‘,
 ‘scrapy.extensions.memusage.MemoryUsage‘,
 ‘scrapy.extensions.feedexport.FeedExporter‘,
 ‘scrapy.extensions.logstats.LogStats‘]
2020-06-12 23:05:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware‘,
 ‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware‘,
 ‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware‘,
 ‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware‘,
 ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘,
 ‘scrapy.downloadermiddlewares.retry.RetryMiddleware‘,
 ‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware‘,
 ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware‘,

................................................................................................

这个时候则在当前目录下发现多了指定输入文件item.json,打开则是爬取数据。

相关推荐