1.scrapy installation
pip isntall scrapy
2. New project
scrapy startproject mySpider
(p2) PS C:\Users\livingbody\Desktop> scrapy startproject mySpider
New Scrapy project 'mySpider', using template directory 'C:\miniconda3\envs\p2\lib\site-packages\scrapy\templates\project', created in:
C:\Users\livingbody\Desktop\mySpider
You can start your first spider with:
cd mySpider
scrapy genspider example example.com
scrapy.cfg : configuration file for the project
mySpider/ : The project's Python module, from which code will be referenced
mySpider/items.py : the object file for the project
mySpider/pipelines.py : the pipeline file for the project
mySpider/settings.py : the project's settings file
mySpider/spiders/ : The directory where the spider code is stored
3. Clear goals
Plan to crawl https://www.goodfrom.com/channel/teacher.shtml website instructor name, title and personal information
Item defines a structured data field that holds the crawled data, a bit like a dict in Python, but provides some extra protection to reduce errors.
An Item can be defined by creating a scrapy.Item class and defining a class property of type scrapy.Field.
Next, create an ItcastItem class, and build the item model.
import scrapy
class ItcastItem(scrapy.Item):
name=scrapy.Field()
level=scrapy.Field()
info=scrapy.Field()
4. Crawling data
Create a crawler named itcast in the mySpider/spider directory and specify the crawling domain as:
scrapy genspider itcast "goodfrom.com"
(p2) PS C:\Users\livingbody\Desktop\mySpider> tree /f
List of folder PATH for volume Windows
Volume serial number 6C51-3930
C:.
│ scrapy.cfg
│
└─mySpider
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├─spiders
│ │ itcast.py
│ │ __init__.py
│ │
│ └─__pycache__
│ __init__.cpython-39.pyc
│
└─__pycache__
settings.cpython-39.pyc
__init__.cpython-39.pyc
import scrapy
class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['goodfrom.com']
start_urls = ['http://goodfrom.com/']
def parse(self, response):
pass
Make modifications based on this
import scrapy
class ItcastSpider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['goodfrom.com']
start_urls = ["http://www.goodfrom.com/channel/teacher.shtml",]
def parse(self, response):
with open('teacher.html','w', encoding='utf-8') as f:
f.write(response.text)
scrapy crawl itcast
Crawl complete
5. Fetch data
import scrapy
from mySpider.items import ItcastItem
class ItcastSpider(scrapy.Spider):
name = "itcast"
allowed_domains = ["goodfrom.com"]
start_urls = ("http://www.goodfrom.com/channel/teacher.shtml",)
def parse(self, response):
#open("teacher.html","wb").write(response.body).close()
# Collection of teacher information
items = []
for each in response.xpath("//div[@class='li_txt']"):
# Wrap our data into an `ItcastItem` object
item = ItcastItem()
# The extract() method returns a string
name = each.xpath("h3/text()").extract()
title = each.xpath("h4/text()").extract()
info = each.xpath("p/text()").extract()
# xpath returns a list containing one element
item['name'] = name[0]
item['title'] = title[0]
item['info'] = info[0]
items.append(item)
# Return the final data directly
return items
6. Save data
There are four simplest ways for scrapy to save information. -o outputs the file in the specified format. The command is as follows:
# json format, default Unicode encoding
scrapy crawl itcast -o teachers.json
# json lines format, default Unicode encoding
scrapy crawl itcast -o teachers.jsonl
# csv comma expressions, openable in Excel
scrapy crawl itcast -o teachers.csv
# xml format
scrapy crawl itcast -o teachers.xml