BeautifulSoup的next_siblings()函数非常适用于表格查找,尤其是带有标题的表格。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
soup = BeautifulSoup(html, 'lxml')
siblings = soup.find("table",{'id':'giftList'}).tr.next_siblings
sum = 0
for sibling in siblings:
print(sibling)
sum+=1
print(sum)
结果为:
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
...
11
0
[Finished in 2.2s]
代码输出产品表中的所有产品,除了首行标题。因为:
- 查找对象本身不是自己的同辈,因此使用sibling相关函数时查找对象都会被跳过。
2.代码使用的是next siblings,因此会返回查找对象的下一个(些)同辈节点。
补充:除了next_siblings,记住previous_siblings经常用来查找已知最后一行容易定位且不需要抓取的情况。当然,next_sibling 和 previous_sibling 可以用来查找一个同辈节点。