没事研究爬虫,之前都是通过php的file_get_contents和curl进行抓取,现在没事开始学习python了,作为一个运维,后期的工作肯定也是python作为主要语言取进行相关开发和自动化方面的工作,今天学习下python的urllib模块,原来在php里面基本都是引入类库来处理,现在学python有些习惯的有改变,多的不说了,我的代码如下
#!/usr/local/bin/python3 #coding:utf-8 from urllib import request import re #获取打开的对象 response = request.urlopen(r'https://www.uumnt.cc/') #将获得的对象读取到字符串 response = response.read() #使用read函数将对象读取成字符串以后要进行相应的字符转换 response = response.decode('utf-8') #定义正则规则 reg = '<img alt="(.*)" src="(.*)">' #将正则表达式的字符串形式编译为Pattern实例 img_src = re.compile(reg) #匹配结果字符串中所有结果获得一个包含元祖的复合list img_list = re.findall(img_src, response) #重新组装成一个一维list img = [] for i in img_list: #元祖切片追加到list中 img.append(i[1]) #打印list print(img)
直接报错
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1360, in https_open context=self._context, check_hostname=self._check_hostname) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1319, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)>
我们看的到,是ssl相关的报错,查看网上资料,需要导入ssl模块
我们在顶部导入ssl模块
import ssl #然后在打开页面对象之前加入 ssl._create_default_https_context = ssl._create_unverified_context
然后我们再次执行调试得到结果如下
得到的是一个列表,后面我们就可以接着后续操作了。
下面是一个普通爬取图片的完整代码
#!/usr/local/bin/python3 #coding:utf-8 import os import re import datetime import random import ssl from urllib import request def getImgList(url): ssl._create_default_https_context = ssl._create_unverified_context response = request.urlopen(url) response = response.read() response = response.decode('utf-8') reg = '<img src="(.+?\.jpg)" width="180" height="270" alt="(.*)" />' img_src = re.compile(reg) img_list = re.findall(img_src, response) img = [] for p in img_list: img.append(p[0]) return img def downloadImg(img): now = datetime.datetime.now() now = now.strftime('%Y%m%d') isExists = os.path.exists(now) if not isExists: os.mkdir(now) now = int(now) for p in img: num = random.randint(100, 999) num = str(num) request.urlretrieve(p, '%d/%s.jpg' % (now,num)) if __name__ == '__main__': html = getImgList("http://www.umtu.cc/") downloadImg(html)