没事研究爬虫,之前都是通过php的file_get_contents和curl进行抓取,现在没事开始学习python了,作为一个运维,后期的工作肯定也是python作为主要语言取进行相关开发和自动化方面的工作,今天学习下python的urllib模块,原来在php里面基本都是引入类库来处理,现在学python有些习惯的有改变,多的不说了,我的代码如下
#!/usr/local/bin/python3
#coding:utf-8
from urllib import request
import re
#获取打开的对象
response = request.urlopen(r'https://www.uumnt.cc/')
#将获得的对象读取到字符串
response = response.read()
#使用read函数将对象读取成字符串以后要进行相应的字符转换
response = response.decode('utf-8')
#定义正则规则
reg = '<img alt="(.*)" src="(.*)">'
#将正则表达式的字符串形式编译为Pattern实例
img_src = re.compile(reg)
#匹配结果字符串中所有结果获得一个包含元祖的复合list
img_list = re.findall(img_src, response)
#重新组装成一个一维list
img = []
for i in img_list:
#元祖切片追加到list中
img.append(i[1])
#打印list
print(img)直接报错
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1360, in https_open context=self._context, check_hostname=self._check_hostname) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1319, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)>
我们看的到,是ssl相关的报错,查看网上资料,需要导入ssl模块
我们在顶部导入ssl模块
import ssl #然后在打开页面对象之前加入 ssl._create_default_https_context = ssl._create_unverified_context
然后我们再次执行调试得到结果如下

得到的是一个列表,后面我们就可以接着后续操作了。
下面是一个普通爬取图片的完整代码
#!/usr/local/bin/python3
#coding:utf-8
import os
import re
import datetime
import random
import ssl
from urllib import request
def getImgList(url):
ssl._create_default_https_context = ssl._create_unverified_context
response = request.urlopen(url)
response = response.read()
response = response.decode('utf-8')
reg = '<img src="(.+?\.jpg)" width="180" height="270" alt="(.*)" />'
img_src = re.compile(reg)
img_list = re.findall(img_src, response)
img = []
for p in img_list:
img.append(p[0])
return img
def downloadImg(img):
now = datetime.datetime.now()
now = now.strftime('%Y%m%d')
isExists = os.path.exists(now)
if not isExists:
os.mkdir(now)
now = int(now)
for p in img:
num = random.randint(100, 999)
num = str(num)
request.urlretrieve(p, '%d/%s.jpg' % (now,num))
if __name__ == '__main__':
html = getImgList("http://www.umtu.cc/")
downloadImg(html)内容版权声明:除非注明,否则皆为本站原创文章。
转载注明出处:https://sulao.cn/post/526
评论列表