之前有用python爬取图片,但是最近爬取网页时老是报错,报错代码如下:
发生异常: UnicodeDecodeError 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
是解码的错误,翻阅资料,可能使用由于对方网页传输经过gzip压缩导致的问题,我们需要对其内容进行解压缩,然后再进行decode转码
解决方法是
#导入zlib模块 import zlib #找到 response = response.read() #改为 if response.headers and 'gzip'in response.headers.get('Content-Encoding'): response = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS) else: response = response.read()
我们可以通过response.headers查看headers所包含的对象有哪些,如下
Server: DnionOS/1.11.2.4_12 Date: Sun, 18 Nov 2018 16:43:17 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 5389 Connection: close Vary: Accept-Encoding X-Powered-By: PHP/7.2.5 Content-Encoding: gzip Dnion-Transfer-Encoding: 1 Age: 32953 Via: https/1.1 CMC-CT-CNC-JSCZ-P-164-8 (DLC-6.1.19), http/1.1 CCKD-CT-GDFS-C-48-141 (DLC-6.1.19) Server-Info: DnionATS HitInfo: CDN_HIT HitType: TCP_MEM_HIT
我们看到了Content-Encoding确实是经过gzip压缩的,那么按照上面的教程修改即可正常运行,其他代码参考原来的笔记即可:
https://sulao.cn/post/568.html