python3爬取https网站报错的解决方法
- 2018-08-05 22:25:44
- 开发
- 75
- shevechco
没事研究爬虫,之前都是通过php的file_get_contents和curl进行抓取,现在没事开始学习python了,作为一个运维,后期的工作肯定也是python作为主要语言取进行相关开发和自动化方面的工作,今天学习下python的urllib模块,原来在php里面基本都是引入类库来处理,现在学python有些习惯的有改变,多的不说了,我的代码如下
01.#!/usr/local/bin/python302.#coding:utf-803.from urllib import request04.import re05. 06.#获取打开的对象07.response = request.urlopen(r'https://www.uumnt.cc/')08.#将获得的对象读取到字符串09.response = response.read()10.#使用read函数将对象读取成字符串以后要进行相应的字符转换11.response = response.decode('utf-8')12.#定义正则规则13.reg = '<img alt="(.*)" src="(.*)">'14.#将正则表达式的字符串形式编译为Pattern实例15.img_src = re.compile(reg)16.#匹配结果字符串中所有结果获得一个包含元祖的复合list17.img_list = re.findall(img_src, response)18.#重新组装成一个一维list19.img = []20.for i in img_list:21. #元祖切片追加到list中22. img.append(i[1])23. 24.#打印list25.print(img)
直接报错
01. File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open02. response = self._open(req, data)03. File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 543, in _open04. '_open', req)05. File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain06. result = func(*args)07. File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1360, in https_open08. context=self._context, check_hostname=self._check_hostname)09. File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1319, in do_open10. raise URLError(err)11.urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)>
我们看的到,是ssl相关的报错,查看网上资料,需要导入ssl模块
我们在顶部导入ssl模块
01.import ssl02.#然后在打开页面对象之前加入03.ssl._create_default_https_context = ssl._create_unverified_context
然后我们再次执行调试得到结果如下
得到的是一个列表,后面我们就可以接着后续操作了。
下面是一个普通爬取图片的完整代码
01.#!/usr/local/bin/python302.#coding:utf-803. 04.import os05.import re06.import datetime07.import random08.import ssl09.from urllib import request10. 11.def getImgList(url):12. ssl._create_default_https_context = ssl._create_unverified_context13. response = request.urlopen(url)14. response = response.read()15. response = response.decode('utf-8')16. reg = '<img src="(.+?\.jpg)" width="180" height="270" alt="(.*)" />'17. img_src = re.compile(reg)18. img_list = re.findall(img_src, response)19. img = []20. for p in img_list:21. img.append(p[0])22. return img23. 24.def downloadImg(img):25. now = datetime.datetime.now()26. now = now.strftime('%Y%m%d')27. isExists = os.path.exists(now)28. if not isExists:29. os.mkdir(now)30. 31. now = int(now)32. for p in img:33. num = random.randint(100, 999)34. num = str(num)35. request.urlretrieve(p, '%d/%s.jpg' % (now,num))36. 37.if __name__ == '__main__':38. html = getImgList("http://www.umtu.cc/")39. downloadImg(html)
内容版权声明:除非注明,否则皆为本站原创文章。
转载注明出处:http://www.sulao.cn/post/526