from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse ’s story</title></head> <body> <pclass="title"name="dromouse"><b>The Dormouse ’s story</b></p> <pclass="story">Once upon a time there were three little sisters; and their names were <ahref="http://example.com/elsie"class="sister"id="linkl"> Elsie </a>, <ahref="http://example.com/lacie"class="sister"id="link2"> Lacie</a> and <ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a> ; and they lived at the bottom of a well .</p> <pclass="story"> ... </p> ''' soup = BeautifulSoup(html, 'lxml') print(soup.p) print(soup.title.string) print(soup.title.name) print(soup.p.attrs) #每个节点可能有多个属性,比如id、 class等,选择这个节点元素后,可以调用attrs获取所有属性: print(soup.p.attrs['name']) #两者是一样的 print(soup.p['name']) print('*'*100) print(soup.prettify())
<pclass="title"name="dromouse"><b>The Dormouse ’s story</b></p> The Dormouse ’s story title <head><title>The Dormouse ’s story</title></head> {'class': ['title'], 'name': 'dromouse'} #attrs 的返回结果是字典形式,它把选择的节点的所有属性和属性值组合成一个字典 接下来,如果要获取name属性,就相当于从字典中获取某个键值,只需要用中括号加属性名就可以 比如,要获取name属性,就可以通过attrs[name]来得到 dromouse ******************************************************************************* <html> <head> <title> The Dormouse ’s story </title> </head> <body> <pclass="title"name="dromouse"> <b> The Dormouse ’s story </b> </p> <pclass="story"> Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="linkl"> Elsie </a> , <aclass="sister"href="http://example.com/lacie"id="link2"> Lacie </a> and <aclass="sister"href="http://example.com/tillie"id="link3"> Tillie </a> ; and they lived at the bottom of a well . </p> <pclass="story"> ... </p> </body> </html>
这里我们调用了find_all()方法,传入 name 参数,其参数值为 ul 也就是说,我想要查询所 ul 节点,返回结果是列表类型,长度为2 ,每个元素依然都是bs4.element.Tag 类型 因为都是Tag类型,所以依然可以进行嵌套查询 还是同样的文本,这里查询出所有 节点后, 再继续查询其内部的 li 节点
text text 参数可用来匹配节点的文本,传入的形式可以是字符串,可以是正 表达式对象,示例如下:
1 2 3 4 5 6 7 8 9 10 11
import re html=''' <div class="panel"> <div class="panel-body"> <a>Hello, this is a link</a> <a>Hello, this is a link, too</a> </div> </div> ''' from bs4 import Beauti fulSoup soup = Beauti fulSoup(html, 'lxml' ) print(soup.find_all(text=re.compile('link')))
运行结果如下:
1
['Hello, this is a link','Hello, this is a link, too']