手写python解析xml (html可以用xml解析吗)

Beautiful Soup 包:

Beautiful Soup: Python 的第三方插件用来提取 xml 和 HTML 中的数据。官网地址 https://www.crummy.com/software/BeautifulSoup/

1、安装 Beautiful Soup

打开 cmd(命令提示符),进入到 Python(Python2.7版本)安装目录中的 scripts 下,输入 dir 查看是否有 pip*ex.e**, 如果用就可以使用 Python 自带的 pip 命令进行安装,输入以下命令进行安装即可:

pip install beautifulsoup4

2、测试是否安装成功

编写一个 Python 文件,输入:

import bs4

print bs4

运行该文件,如果能够正常输出则安装成功。

五、使用 Beautiful Soup 解析 html 文件

# -*- coding: UTF-8 -*-
import bs4
import re

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 创建一个BeautifulSoup解析对象
soup = BeautifulSoup(html_doc, "html.parser", from_encoding="utf-8")
# 获取所有的链接
links = soup.find_all('a')
print("所有的链接")

for link in links:
    print(link.name, link['href'], link.get_text())

print("获取特定的URL地址")
link_node = soup.find('a', href="http://example.com/elsie")
print(link_node.name, link_node['href'], link_node['class'], link_node.get_text())

print("正则表达式匹配")

link_node = soup.find('a', href=re.compile(r"ti"))
print(link_node.name, link_node['href'], link_node['class'], link_node.get_text())

print("获取P段落的文字")

p_node = soup.find('p', class_='story')
print(p_node.name, p_node['class'], p_node.get_text())

===========

输出:

C:\Python3\python*ex.e** D:/python/project/爬虫/testBeautifllSoup.py

C:\Python3\lib\site-packages\bs4\__init__.py:226: UserWarning: You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.

warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.")

所有的链接

a http://example.com/elsie Elsie

a http://example.com/lacie Lacie

a http://example.com/tillie Tillie

获取特定的URL地址

a http://example.com/elsie ['sister'] Elsie

正则表达式匹配

a http://example.com/tillie ['sister'] Tillie

获取P段落的文字

p ['story'] Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

进程已结束,退出代码0