爬虫系列之Beautiful Soup的使用

2019年04月03日

Python

Beautiful Soup是Python的一个HTML或XML的解析库，可以用它来方便的从网页中提取数据。利用它可以省去很多的繁琐的提取工作，提高解析效率。在静态网页爬取中，推荐使用Beautiful Soup。

先查看下面的例子，对其有一个初步印象

from bs4 import BeautifulSoup 
html = '''
<html> <head><title>The Dormouse ’s story</title></head> 
<body> 
<p class="title" name="dromouse"><b>The Dormouse ’s story</b></p> 
<p class="story">Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="linkl">  Elsie </a>, 
<a href="http://example.com/lacie" class="sister" id="link2"> Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> ; 
and they lived at the bottom of a well .</p> 
<p class="story"> ... </p>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.p)
print(soup.title.string)
print(soup.title.name)
print(soup.p.attrs) #每个节点可能有多个属性，比如id、 class等，选择这个节点元素后，可以调用attrs获取所有属性：
print(soup.p.attrs['name']) #两者是一样的
print(soup.p['name'])  
print('*'*100)
print(soup.prettify())

运行结果

<p class="title" name="dromouse"><b>The Dormouse ’s story</b></p>
The Dormouse ’s story
title
<head><title>The Dormouse ’s story</title></head>
{'class': ['title'], 'name': 'dromouse'} #attrs 的返回结果是字典形式，它把选择的节点的所有属性和属性值组合成一个字典
接下来，如果要获取name属性，就相当于从字典中获取某个键值，只需要用中括号加属性名就可以
比如，要获取name属性，就可以通过attrs[name］来得到
dromouse
*******************************************************************************
<html>
 <head>
  <title>
   The Dormouse ’s story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse ’s story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="linkl">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; 
and they lived at the bottom of a well .
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

从运行的结果是否看出来什么？我们写的那个html，不是标准的html，但是BeautifulSoup(html, 'lxml')自动给我们补全了。

相关方法

关联选择

子节点和子孙节点

选取节点元素之后，如果想要获取它的直接子节点，可以调用contents属性

print(soup.p.contents)
print(soup.p.children) 
#children属性来选择，返回结果是生成器类型,我们用for循环输出相应的内容
for i, child in enumerate(soup.p.children): 
    print(i, child)
#获取所有子孙节点
for i, child in enumerate(soup.p.descendants) : 
    print(i, child)

运行结果就不在这里写了，大家可以自己去查看

父节点和祖先节点

1 2	parent = soup.a.parent 获取父节点 parents = soup.a.parents 获取所有祖先节点

兄弟节点

# 获取下一个兄弟节点
next = soup.a.next_sibling 
# 获取上一个兄弟节点
previous = soup.a.previous_sibling
# 获取后面和前面的所有兄弟节点
nexts = soup.a.next_siblings
previous = soup.a.previous_siblings

find_all()

find_all，顾名思义，就是查询所有符合条件的元素，给它传入一些属性或文本，就可以得到符
合条件的元素，它的功能十分强大
它的 API 如下：
find_all(name, attrs, recursive, text, **kwargs)

name

1 2	print(soup.find all(name='ul')) print(type(soup.find_all(name='ul')[O]))

这里我们调用了find_all()方法，传入 name 参数，其参数值为 ul 也就是说，我想要查询所
ul 节点，返回结果是列表类型，长度为2 ，每个元素依然都是bs4.element.Tag 类型
因为都是Tag类型，所以依然可以进行嵌套查询还是同样的文本，这里查询出所有节点后，
再继续查询其内部的 li 节点

for ul in soup.find_all(name＝'孔'）．
    print(ul.find_all(name='li')) 
    for li in ul.find_all(name='li'):
        print(li.string)

attrs

除了根据节点名查询，我们也可以传入一些属性来查询

1
2
3

print(soup.find_all(attrs ＝｛'id＇：'list'｝））
print(soup.find_all(attrs ＝｛'name'：'elements'｝））
print(soup.find_all(attrs ＝｛'class'：'sister'｝））

对于一些常用的属性，比如id和class等，我们可以不用attr 来传递比如，要查询 id为list-1
的节点，可以直接传人id这个参数，还是上面的文本，我们换种方式来查询：

1 2	print(soup.find_all(id='list-1')) print(soup.find_all(class_='element'))

这里直接传入 id=’list-1’，就可以查询id为list-1 的节点元素了。而对于class 来说，由于class
Python 里是一个关键字，所以后面需要加一个下划线，即 class_=’element’，返回的结果依然还
Tag组成的列

text

text
text 参数可用来匹配节点的文本，传入的形式可以是字符串，可以是正表达式对象，示例如下：

import re 
html=''' 
<div class="panel"> <div class="panel-body">
<a>Hello, this is a link</a> 
<a>Hello, this is a link, too</a> 
</div> 
</div> 
'''
from bs4 import Beauti fulSoup 
soup = Beauti fulSoup(html, 'lxml' ) 
print(soup.find_all(text=re.compile('link')))

运行结果如下：

1	［'Hello, this is a link','Hello, this is a link, too']

其它

find_all()是返回所有匹配的列表、find()只返回单个元素。
find_parents()和find_parent():前者返所有祖先节点后者返回直接父节点。
find_next_siblings()和find_next_sibling():前者返回后面所有的兄弟，后者返回后面第一个兄弟节点
find_previous_siblings()和find_previous_sibling():前者返回前面所有的兄弟节点后者返回第一个符条件的节点
find_all_next()和find_next():前者返回节点后所有符合条件的节点，后者返回第一个符条件的节点
find_all_previous()和find_previous():前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点返回前面第一个兄弟节点

CSS选择器

使用 css 选择器时，只需要调用 select()方法，传人相应的 css 选择器即可
这里就不多写了。

引用

Beautiful Soup 4.2.0 文档

原文链接: http://yoursite.com/2019/04/03/爬虫系列之Beautiful-Soup的使用/

版权声明: 转载请注明出处.

小小看护

守护家人，天天向上