python爬取网页使用read()读取内容decode转换报错的解决方案

发布于2018-11-19 00:35:53 更新于2025-03-31 18:06:27
开发
浏览 1138
shevechco
手机浏览
评论数 0

之前有用python爬取图片，但是最近爬取网页时老是报错，报错代码如下：

发生异常: UnicodeDecodeError
'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

是解码的错误，翻阅资料，可能使用由于对方网页传输经过gzip压缩导致的问题，我们需要对其内容进行解压缩，然后再进行decode转码

解决方法是

#导入zlib模块
import zlib
#找到
response = response.read()
#改为
if  response.headers and 'gzip'in response.headers.get('Content-Encoding'):
    response = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)
else:
    response = response.read()

我们可以通过response.headers查看headers所包含的对象有哪些，如下

Server: DnionOS/1.11.2.4_12
Date: Sun, 18 Nov 2018 16:43:17 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 5389
Connection: close
Vary: Accept-Encoding
X-Powered-By: PHP/7.2.5
Content-Encoding: gzip
Dnion-Transfer-Encoding: 1
Age: 32953
Via: https/1.1 CMC-CT-CNC-JSCZ-P-164-8 (DLC-6.1.19), http/1.1 CCKD-CT-GDFS-C-48-141 (DLC-6.1.19)
Server-Info: DnionATS
HitInfo: CDN_HIT
HitType: TCP_MEM_HIT

我们看到了Content-Encoding确实是经过gzip压缩的，那么按照上面的教程修改即可正常运行。

标签
python
decode
zlib

Python常用解压缩模块zipfile的简单用法

Flask使用flask_cache缓存以及报错的解决办法

转载注明出处：https://sulao.cn/post/573

今日天气

分类统计

博文归档

2篇

8篇

6篇

6篇

3篇

12篇

15篇

43篇

23篇

热门推荐

热门标签

python爬取网页使用read()读取内容decode转换报错的解决方案

评论列表

相关阅读

常用网站