【Python学习】urllib库的四大模块详解

大家好，又见面了，我是你们的朋友全栈君。

文章目录

【前言】有好一段时间都没敲py了，今天将urllib库算是较全的学习了一下老实说还是敲py比较舒服，当然还有requests，Beautiful库，正则表达式这些对于进行对爬去文章的处理都是不可避免的。

urllib库

一 urllib库四大模块

1：request

http请求模块，可以用来模拟发送请求。就好比在浏览器中输入网址然后回车一样，只需要给库方法传入URL以及额外的参数，就可以模拟实现这个过程。

2：error

3：parse

一个工具模块，提供了好多URL处理方法，比如拆分，解析，合并等。

4：robotparser

主要用来识别网址的robots.txt文件，然后判断哪些网站可以爬，哪些网站不可以爬，它用的很少。

1：urlopen()

"""
作者：贾继康
时间：
程序功能:rullib.request 模拟浏览器的一个请求发送过程
         目的：获取网页的源代码
"""
# 导入rullib库
import urllib.request
response = urllib.request.urlopen('https://www.python.org');
print(response.read().decode('utf-8')) # 以编码utf-8的格式进行请求阅读

复制

2：data参数

"""
作者：贾继康
时间：
程序功能：urlopen()参数
"""
import urllib.request # 请求模块
import urllib.parse # urllib库中的工具模块
# 传递一个参数：word，值：hello-------》转字节流使用bytes()方法：第一个参数：str类型，需要使用urllib.parse模块
# 中的urlopen()方法来将参数字典转换为字符串，第二个参数：编码格式：utf-8

data = bytes(urllib.parse.urlencode({'word': 'hello'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print('\n',response.read())

复制

3：timeout参数

"""
作者：贾继康
时间：
程序功能：
"""
import socket# 判断异常
import urllib.error
import urllib.request
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    # socket.timeout超时异常
    if isinstance(e.reason, socket.timeout):
        print('时间超时')

复制

二：request.Request方法

1：一般用法

"""
作者：贾继康
时间：
程序功能：Request类
class urllib.request.Request(url,data=None, headers={},orgin_req_host=None,unverifiable=False,metho=None)
"""
"""
import urllib.request
request = urllib.request.Request('http://httpbin.org/get')# 请求响应
response = urllib.request.urlopen(request)# 使用urlopen()方法来发送请求：Request类型的对象
print(response.read().decode('utf-8'))

"""
from urllib import request,parse # 请求和处理方法

url = "http://httpbin.org/post"
headers = {
    # 伪装成谷歌浏览器
    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.12 Mobile Safari/537.36',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}

data = bytes(parse.urlencode(dict),encoding='utf-8')
# req = request.Request(url=url, data=data,method='POST')
# req.add_header('User_Agent','Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.12 Mobile Safari/537.36')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

复制

结果

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.12 Mobile Safari/537.36"
  }, 
  "json": null, 
  "origin": "182.245.65.138", 
  "url": "http://httpbin.org/post"
}

复制

2：高级用法

1：验证

当请求的一个网页需要验证：提示输入用户名和密码

"""
作者：贾继康
时间：
程序功能：
"""
# HTTPBasicAuthHandler：用于管理密码，为何用户名和密码的表
# build_opener()方法构建一个Opener：Opener在发送请求的时候就相当于已经验证成功
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError #  导入错误提示包

username = 'username'
password = 'password'
url = 'http://localhost:5000/'# 一个需要验证的网址

p = HTTPPasswordMgrWithDefaultRealm()# 实例化HTTPBasicAuthHandler对象，参数是 HTTPPasswordMgrWithDefaultRealm对象
p.add_password(None,url,username,password)
auth_hander = HTTPBasicAuthHandler(p)
opener = build_opener(auth_hander)# build_opener()方法构建一个Opener：Opener在发送请求的时候就相当于已经验证成功

try:
    result = opener.open(url)# 使用opener的open()打开这个链接
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

复制

2：代理

"""
作者：贾继康
时间：
程序功能：代理服务器
1：ProxyHandler：参数是一个字典---》键名：协议类型，键值：代理链接（可以添加多个代理）
2：然后使用Hander以及build_opener()方法构造一个opener
3: 发送请求
"""
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener


proxy_hander = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'http://127.0.0.1:9743'
})
# 使用build_opener()方法构建一个opener
opener = build_opener(proxy_hander)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

复制

3：Cookies

"""
作者：贾继康
时间：
程序功能：
   获取网站的Cookies
   1：声明CookieJar对象
   2：urllib.request.HTTPCookieProcessor(cookie)构建一个Handler
   3：利用build_opener()方法构建一个opener
   4: 执行open()函数

   改进版本：V2.0
   CookieJar改写成：MozillaCookieJar--生成文件时将会用到
   比如：读取和保存Cookies,可以将Cookies保存成Mozilla型浏览器的Cookies格式

   改进版本：V3.0
   保存格式：libwww-perl(LWP)格式的Cookies文件
   要改成libwww-perl(LWP)格式的Cookies文件只需要声明：cookie = http.cookiejar.LWPCookieJar(filename)

"""
import http.cookiejar,urllib.request

"""
cookie = http.cookiejar.CookieJar()# 1 声明CookieJar对象
handler = urllib.request.HTTPCookieProcessor(cookie) # 2
opener = urllib.request.build_opener(handler) # 3

response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)
结果：
    BAIDUID=77B6920A1FCACD2B94C2905DD2B83C90:FG=1
    BIDUPSID=77B6920A1FCACD2B94C2905DD2B83C90
    H_PS_PSSID=1445_21110_22074
    PSTM=1537189285
    BDSVRTM=0
    BD_HOME=0
    delPer=0

"""

"""
# V2.0
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler  = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

"""

"""
# V3.0
filename = 'cookies.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler  = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

"""
# 针对V3.0格式读取并利用
cookie = http.cookiejar.LWPCookieJar()
# 使用load()加载本地的cookies文件
cookie.load('cookies.txt',ignore_discard=True,ignore_expires=True)
# 获取Cookies的内容
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))# 输出百度网页的源代码

复制

三：异常处理

1：URLError

"""
作者：贾继康
时间：
程序功能：
"""
from urllib import request,error
try:
    response = request.urlopen('http://jiajiknag.com/index.html')
except error.URLError as e:
    print('页面不存在！！！')

复制

2：HTTPError

"""
作者：贾继康
时间：
程序功能：
        HTTPError 子类
        URLError 父类
        先捕获子类的错误后捕获父类的错误

"""
from urllib import request,error
try:
    response = request.urlopen('http://jiajiknag.com/index.html')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request.Sucessfully')

复制

四：解析链接

1：urlparse()

"""
作者：贾继康
时间：
程序功能：
"""
from urllib.parse import urlparse # 该方法可以实现URL 的识别和分段
result = urlparse('http://www.baidu . com/index .htr比u ser?id=S#comment')
print(type(result),result)

复制

结果：

 ParseResult(scheme='http', netloc='www.baidu . com', path='/index .htr比u ser', params='', query='id=S', fragment='comment')

复制

urlparse （）方法将其拆分成了6 个部分 : ：／／前面的就是scheme ，代表协议；

第一个／符号前面便是netloc ，即域名，

后面是path ，即访问路径；

分号；前面是params ，代表参数；

问号？后面是查询条件query ，一般用作GET 类型的URL; 井号＃后面是锚点，用于直接定位页面内部的下拉位置。

urlparse ()方法其他配置

1：urlstring

2：scheme

3：allow_fragments

"""
作者：贾继康
时间：
程序功能：
"""
"""
from urllib.parse import urlparse # 该方法可以实现URL 的识别和分段
result = urlparse('http://www.baidu . com/index .htr比u ser?id=S#comment')
print(type(result),result)
"""
"""
scheme：默认的协议
from urllib.parse import urlparse # 该方法可以实现URL 的识别和分段
# scheme:默认的协议
result = urlparse('http://www.baidu . com/index .htr比u ser?id=S#comment',scheme='https')
# print(type(result),result)
print(result)
结果：ParseResult(scheme='http', netloc='www.baidu . com', path='/index .htr比u ser', params='', query='id=S', fragment='comment')
"""

# 即是否忽略fragment 。如果它被设置为False ，干ragment 部分就会被忽略，
# 它会被解析为path 、parameters 或者query 的一部分，而fragment 部分为空。
from urllib.parse import urlparse # 该方法可以实现URL 的识别和分段
# scheme:默认的协议
result = urlparse('http://www.baidu . com/index .htr比u ser?id=S#comment',allow_fragments=False)
# print(type(result),result)
print(result)

复制

2：urlunparse()

"""
作者：贾继康
时间：
程序功能：有了urlparse()，
         相应地就有了它的对立方法urlunp arse （） 。它接受的参数是一个可迭代对象，
         但是它的长度必须是6 ， 否则会抛出参数数量不足或者过多的问题。
"""
from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

复制

结果：

http://www.baidu.com/index.html;user?a=6#comment

复制

3：urlsplit()

"""
作者：贾继康
时间：
程序功能：这个方法和urlparse() 方法非常相似，
         只不过它不再单独解析params 这一部分，只运回5个结果。
"""
from urllib.parse import urlsplit
result = urlsplit('http://www.baidu . com/index .htr比u ser?id=S#comment')
# 返回结果是Spli tResult ， 它其实也是一个元组类型， 既可以用属性获取值，也可以
# 用泵’引来获取。
print(result)
print(result.scheme,result[0])

复制

4：urlunsplit()

"""
作者：贾继康
时间：
程序功能：
     它也是将链接各个部分组合成完整链接的方法，传人的参数也是一个可迭
     代对象，例如列表、元组等，唯一的区别是长度必须为5 。
"""
from urllib.parse import urlunsplit
data = ['http','www.baidu.com','index.html','a=6','comment']
print(urlunsplit(data))

复制

结果：

http://www.baidu.com/index.html?a=6#comment

复制

5：urljoin()

"""
作者：贾继康
时间：
程序功能：
     生成链接还有另一个方法，那就是urljoin(I)方法。我们可以提供一个base_url （基础链
接） 作为第一个参数，将新的链接作为第二个参数.
     该方法会分析base_url 的scheme 、netloc 和path这3 个内容并对新链接缺失的部分进行补充，最后返回结果。
"""
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com ', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com讪d=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#co阳nent', '?category=2'))

复制

结果：

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

复制

6：urlencode()

"""
作者：贾继康
时间：
程序功能：声明了一个字典来将参数表示出来，然后调用urlencode （）方法将其序列化为GET 请求
参数。

"""
from urllib.parse import urlencode
params = {
    'name': 'germey',
    'age': 22
}

base_url = 'http://www.baidu.com?'# 创建一个链接
url = base_url + urlencode(params)
print(url)

复制

结果：

http://www.baidu.com?name=germey&age=22

复制

7：parse_qs()

"""
作者：贾继康
时间：
程序功能：parse_qs()方法， 就可以将它转回字典，示例如下：
"""
from urllib.parse import parse_qs
query = 'name=jiajikang&age=20'
print(query)
print(parse_qs(query))

复制

结果：

name=jiajikang&age=20
{'name': ['jiajikang'], 'age': ['20']}

复制

8：parse_qsl()

"""
作者：贾继康
时间：
程序功能：还有一个parse_qsl()方法，它用于将参数转化为元组组成的列表，
"""
from urllib.parse import parse_qsl
query = 'name=jiajiknag&age=22'
print(parse_qsl(query))

复制

结果：

[('name', 'jiajiknag'), ('age', '22')]

复制

9：quote()

"""
作者：贾继康
时间：
程序功能：
      该方法可以将内容转化为URL 编码的格式。URL 中带有中文参数时，有时可能会导致乱码的问题，
 此时用这个方法可以将巾文字符转化为U RL 编码
"""
from urllib.parse import quote
keyword = '贾继康'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

复制

结果：

https://www.baidu.com/s?wd=%E8%B4%BE%E7%BB%A7%E5%BA%B7

复制

#####10：unquote()

"""
作者：贾继康
时间：
程序功能：它可以进行URL 解码
"""
from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E8%B4%BE%E7%BB%A7%E5%BA%B7'
print(unquote(url))

复制

结果：

https://www.baidu.com/s?wd=贾继康

复制

五：分析Robots协议(urllib库中得robotparser模块)

1：Robots协议

Robots 协议也称作爬虫协议、机器人协议，它的全名叫作网络爬虫排除标准（ Robots Exclusion Protocol ），用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。它通常是一个叫作robots.txt的文本文件，一般放在网站的根目录下。当搜索爬虫访问一个站点时，它首先会检查这个站点根目录下是否存在robots.txt 文件，如果存在，搜索爬虫会根据其中定义的爬取范围来爬取。如果没有找到这个文件，搜索爬虫便会访问所有可直接访问的页面。

2：爬虫名称

爬虫名称名称网
BaiduSpider 百度 www .baidu.com
Googlebot 谷歌 www.google.com
360Spider 360搜索 www.so.com
YodaoBot 有道 www.youdao.com
ia archiver Alexa www.alexa.cn

3：robotparser(判断网页是否可以被抓取)

了解Robots 协议之后，我们就可以使用ro bot parser 模块来解析robots.txt 了。该模块提供了一个类RobotFileParser ，它可以根据某网站的robots.txt 文件来判断一个爬取爬虫是否有权限来爬取这个网页。该类用起来非常简单，只需要在构造方法里传人robots.txt 的链接即可。首先看一下它的声明： urllib.robotparser.RobotFileParser(url =’ ’) 当然，也可以在声明时不传人，默认为空，最后再使用set_url （）方法设置一下也可。下面列刷了这个类常用的几个方法。 1：set_url()：用来设置ro bots . txt 文件的链接。如果在创建RobotFileParser 对象时传入了链接，那么就不需要再使用这个方法设置了。 2：read()：读取robots .txt 文件并进行分析。注意，这个方法执行一个读取和分析操作，如果不调用这个方法，接下来的判断都会为False ，所以一定记得调用这个方法。这个方法不会返回任何内容，但是执行了读取操作。 3：parse()：用来解析robots.txt文件，传人的参数是robots . txt 某些行的内容，它会按照robots.txt 的语法规则来分析这些内容。 4：can_fetch()：该方法传人两个参数，第一个是Use r-age nt ，第二个是要抓取的URL 。返回的内容是该搜索引擎是否可以抓取这个URL ，返回结果是True 或False a 5：mtime()：返回的是上次抓取和分析robots.txt 的时间，这对于长时间分析和抓取的搜索爬虫是很有必要的，你可能需要定期检查来抓取最新的robots.txt 。 6：modified()：它同样对长时间分析和抓取的搜索爬虫很有帮助，将当前时间设置为上次抓取和分析robots.txt 的时间。

"""
作者：贾继康
时间：
程序功能：判断网页是否可以被抓取
          1：创建RobotFileParser()对象
          2：set_url()方法设置robots.txt的链接

"""

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*','http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections'))

复制

结果：

False
False

复制

发布者：全栈程序员栈长，转载请注明出处：https://javaforall.cn/234893.html原文链接：https://javaforall.cn