苏老的学习笔记-python爬取网页使用read()读取内容decode转换报错的解决方案

python爬取网页使用read()读取内容decode转换报错的解决方案

作者：shevechco 日期：2018-11-19 分类：Python笔记浏览：789次评论：0条

之前有用python爬取图片，但是最近爬取网页时老是报错，报错代码如下：

发生异常: UnicodeDecodeError
'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

是解码的错误，翻阅资料，可能使用由于对方网页传输经过gzip压缩导致的问题，我们需要对其内容进行解压缩，然后再进行decode转码

解决方法是

#导入zlib模块
import zlib
#找到
response = response.read()
#改为
if  response.headers and 'gzip'in response.headers.get('Content-Encoding'):
    response = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)
else:
    response = response.read()

我们可以通过response.headers查看headers所包含的对象有哪些，如下

Server: DnionOS/1.11.2.4_12
Date: Sun, 18 Nov 2018 16:43:17 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 5389
Connection: close
Vary: Accept-Encoding
X-Powered-By: PHP/7.2.5
Content-Encoding: gzip
Dnion-Transfer-Encoding: 1
Age: 32953
Via: https/1.1 CMC-CT-CNC-JSCZ-P-164-8 (DLC-6.1.19), http/1.1 CCKD-CT-GDFS-C-48-141 (DLC-6.1.19)
Server-Info: DnionATS
HitInfo: CDN_HIT
HitType: TCP_MEM_HIT

我们看到了Content-Encoding确实是经过gzip压缩的，那么按照上面的教程修改即可正常运行，其他代码参考原来的笔记即可：

https://sulao.cn/post/568.html

转载注明出处：https://sulao.cn/post/576.html

python爬取网页使用read()读取内容decode转换报错的解决方案

相关文章

我要评论