Web 爬蟲筆記

Tue, 2019 Oct 8

前置準備

請專注如何快速應用爬蟲

簡易爬取工具

Web Scraper

簡易的 browser 插件，可以直接選取並且爬取，參考官方教學即可。

參見

基本教學
- 新媒体人必会的傻瓜式爬虫工具：上手 Web Scraper 的 5 个步骤

透過libxml2抓取網頁內容

下面示範透過libxml2抓取頁面的連結內容以及連結。

import libxml2
import requests

# get html
URL = 'http://example.com/'
r = requests.get(url = URL)
print(r.text)

doc = libxml2.htmlParseDoc(r.text, 'utf8')
for it in doc.xpathEval('//*/a') 
    print(it.content)
    print(it.prop('href'))

透過beautifulsoup

安裝

pip install beautifulsoup4

安裝在家目錄

pip install beautifulsoup4 --user

抓取並且剖析，以下以列出href為範例

import requests
from bs4 import BeautifulSoup

# get html
URL = 'http://example.com/'
r = requests.get(url = URL)
print(r.text)

soup = BeautifulSoup(r.text, 'html.parser')

# get all link
for tag in soup.find_all('a'):
  print(tag.get('href'))

透過命令行剖析xpath

xmllint --xpath  '/html/body/h3/text()'  --html http://example.com  md5sum | awk '{ print $1 }'

參見

20個網頁抓取工具快速抓取網站

Python Web Web Crawler