【笔记】Python3 中 Requests 和 Beautiful Soup 的基础用法

Home » 笔记 » 【笔记】Python3 中 Requests 和 Beautiful Soup 的基础用法

【笔记】Python3 中 Requests 和 Beautiful Soup 的基础用法

发布于 April 1, 2022 笔记

Requests

.get 语法

import requests

req = requests.get(url = target)
# 发送请求，target必须是type为str的完成网页链接
htmlOutput = req.text
print(type(htmlOutput)) # string
# 格式整理

Beautiful Soup

提取html长串中感兴趣的内容，可使用正则表达式。

.find_all 语法

获取单个标签

from bs4 import BeautifulSoup

bfObject = BeautifulSoup(htmlString)
texts = bf.find_all('div', class_ = 'Tiger')

获得htmlString中的标签 <div id="content", class="Tiger"> 并assign给texts。texts的type为 bs4.element.ResultSet 的 list。如果htmlString中含有多个相同标签，则按照顺序排列在list中。

texts = bf.find_all('div', class_ = 'Tiger') 中使用 class_ = 而非 class = 是为了防止冲突。

第四行也可替换为 find_all(‘div’, id = ‘content’, class_ = ‘Tiger’)。在html中，id具有唯一性，一个网页中仅能用一次。class可以任意使用。id对应CSS中的 "#" 而class对应 "."。重要、特别的内容盒子使用id，局部使用class。

关于更多id与class，参考在html和css中使用id和class的区别

GuessedAtParserWarning

/Users/tiger/Desktop/demo.py:6: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

Parser 意为解析器，是一个结构化标记处理工具，主要用途是分析HTML文件。使用 bf = BeautifulSoup(htmlString, features="html.parser") 即可移除警告输出。

关于 html.parser，参考 Simple HTML and XHTML parser

获取多个标签

<!-- html String -->
<a href="/a/bcde/1.html">某超链</a>
<a href="/a/bcde/2.html">某超链</a>
<a href="/a/bcde/3.html">某超链</a>
<a href="/a/bcde/4.html">某超链</a>
<a href="/a/bcde/5.html">某超链</a>

bf = BeautifulSoup(htmlString)
htmlTags = bf.find_all('a')
for eachTagA in htmlTags:
    print(eachTagA.string, eachTagA.get('href'))
  # 某超链 /a/bcde/1.html

上面代码中，htmlTags的值为[<a href="/a/bcde/1.html">某超链</a>, ...]，list中的每一个元素的type为 bs4.element.Tag，和str有些许区别，某些str library不能直接用。

当对象的type为 bs4.element.ResultSet 时（htmlTags的type），.string 用于提取出HTML标签中的内容，.get("attribute") 用于提取出标签中的某属性。

参考文章

W3Cschool Python3 教程

笔记

robots.txt协议中规定了网站中哪些数据可以被爬取哪些数据不能被爬取

Tags: Python, Network, Request Get, BS Find All, HTML Id & Class, HTML Parser