之前有用python爬取图片,但是最近爬取网页时老是报错,报错代码如下:
01.发生异常: UnicodeDecodeError02.'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
是解码的错误,翻阅资料,可能使用由于对方网页传输经过gzip压缩导致的问题,我们需要对其内容进行解压缩,然后再进行decode转码
解决方法是
01.#导入zlib模块02.import zlib03.#找到04.response = response.read()05.#改为06.if response.headers and 'gzip'in response.headers.get('Content-Encoding'):07. response = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)08.else:09. response = response.read()
我们可以通过response.headers查看headers所包含的对象有哪些,如下
01.Server: DnionOS/1.11.2.4_1202.Date: Sun, 18 Nov 2018 16:43:17 GMT03.Content-Type: text/html; charset=UTF-804.Content-Length: 538905.Connection: close06.Vary: Accept-Encoding07.X-Powered-By: PHP/7.2.508.Content-Encoding: gzip09.Dnion-Transfer-Encoding: 110.Age: 3295311.Via: https/1.1 CMC-CT-CNC-JSCZ-P-164-8 (DLC-6.1.19), http/1.1 CCKD-CT-GDFS-C-48-141 (DLC-6.1.19)12.Server-Info: DnionATS13.HitInfo: CDN_HIT14.HitType: TCP_MEM_HIT
我们看到了Content-Encoding确实是经过gzip压缩的,那么按照上面的教程修改即可正常运行。
内容版权声明:除非注明,否则皆为本站原创文章。
转载注明出处:https://sulao.cn/post/573
评论列表