scrapy crawling data learning

1.scrapy installation

pip isntall scrapy

2. New project

scrapy startproject mySpider
(p2) PS C:\Users\livingbody\Desktop> scrapy startproject mySpider
New Scrapy project 'mySpider', using template directory 'C:\miniconda3\envs\p2\lib\site-packages\scrapy\templates\project', created in:
    C:\Users\livingbody\Desktop\mySpider

You can start your first spider with:
    cd mySpider
    scrapy genspider example example.com

scrapy.cfg : configuration file for the project

mySpider/ : The project's Python module, from which code will be referenced

mySpider/items.py : the object file for the project

mySpider/pipelines.py : the pipeline file for the project

mySpider/settings.py : the project's settings file

mySpider/spiders/ : The directory where the spider code is stored

3. Clear goals

Plan to crawl https://www.goodfrom.com/channel/teacher.shtml website instructor name, title and personal information

  1. Open items.py in the mySpider directory

  2. Item defines a structured data field that holds the crawled data, a bit like a dict in Python, but provides some extra protection to reduce errors.

  3. An Item can be defined by creating a scrapy.Item class and defining a class property of type scrapy.Field.

  4. Next, create an ItcastItem class, and build the item model.

import scrapy

class ItcastItem(scrapy.Item):
    name=scrapy.Field()
    level=scrapy.Field()
    info=scrapy.Field()

4. Crawling data

Create a crawler named itcast in the mySpider/spider directory and specify the crawling domain as:

scrapy genspider itcast "goodfrom.com"
(p2) PS C:\Users\livingbody\Desktop\mySpider> tree /f
List of folder PATH for volume Windows
Volume serial number 6C51-3930 C:. │ scrapy.cfg └─mySpider │ items.py │ middlewares.py │ pipelines.py │ settings.py │ __init__.py ├─spiders │ │ itcast.py │ │ __init__.py │ │ │ └─__pycache__ │ __init__.cpython-39.pyc └─__pycache__ settings.cpython-39.pyc __init__.cpython-39.pyc
import scrapy
class ItcastSpider(scrapy.Spider):
    name = 'itcast'
    allowed_domains = ['goodfrom.com']
    start_urls = ['http://goodfrom.com/']
def parse(self, response): pass

Make modifications based on this

import scrapy
class ItcastSpider(scrapy.Spider):
    name = 'itcast'
    allowed_domains = ['goodfrom.com']
    start_urls = ["http://www.goodfrom.com/channel/teacher.shtml",]

    def parse(self, response):
        with open('teacher.html','w', encoding='utf-8') as f:
            f.write(response.text)
scrapy crawl itcast

Crawl complete

5. Fetch data

import scrapy
from mySpider.items import ItcastItem

class ItcastSpider(scrapy.Spider):
    name = "itcast"
    allowed_domains = ["goodfrom.com"]
    start_urls = ("http://www.goodfrom.com/channel/teacher.shtml",)
    def parse(self, response):
        #open("teacher.html","wb").write(response.body).close()
        # Collection of teacher information
        items = []

        for each in response.xpath("//div[@class='li_txt']"):
            # Wrap our data into an `ItcastItem` object
            item = ItcastItem()
            # The extract() method returns a string
            name = each.xpath("h3/text()").extract()
            title = each.xpath("h4/text()").extract()
            info = each.xpath("p/text()").extract()

            # xpath returns a list containing one element
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            items.append(item)

        # Return the final data directly
        return items

6. Save data

There are four simplest ways for scrapy to save information. -o outputs the file in the specified format. The command is as follows:

# json format, default Unicode encoding
scrapy crawl itcast -o teachers.json

# json lines format, default Unicode encoding
scrapy crawl itcast -o teachers.jsonl

# csv comma expressions, openable in Excel
scrapy crawl itcast -o teachers.csv

# xml format
scrapy crawl itcast -o teachers.xml
Post a Comment (0)
Previous Post Next Post