用PHP尝试了半天都不行,网上找的python代码也不行,满足我不了我爬取图片的场景,而后搜集了一些资料,最后发现用request模块下的get方法和可以携带header头,然后将获取的对象直接写入图片就行了
原来没有防盗链的可以直接使用urllib模块中的request.urlretrieve方法保存图片到本地,但是现在很多图片网站使用了图床和第三方存储服务器,这样通过nginx的防盗链就无法直接下载图片,需要在浏览器的header头中仿造Referer和User-Agent,下面写了一段简单的代码仅供参考,下面是主要代码,稍微改改就能抓取大多数网站图片了,都是美女图片网,不好意思了!
#!/usr/bin/python3 #coding:utf-8 from urllib import request import requests import ssl import re import random def getImg(html): ssl._create_default_https_context = ssl._create_unverified_context response = request.urlopen(html) response = response.read() response = response.decode('utf-8') reg1 = r'<h1 class="center">(.*)\(1/(.*)\)</h1>' page_src = re.compile(reg1) page = re.findall(page_src, response) reg2 = r'<img src="(.*)" alt="(.*)" />' img_src = re.compile(reg2) img_list = [] i = 1 while True: if i == 1: response = request.urlopen(html) response = response.read() response = response.decode('utf-8') img = re.findall(img_src, response) img_list.append(img[0][0]) if i > 1: html_more = html.replace(".html", "_%s.html" % str(i)) response_more = request.urlopen(html_more) response_more = response_more.read() response_more = response_more.decode('utf-8') img = re.findall(img_src, response_more) img_list.append(img[0][0]) if i >= int(page[0][1]): break i = i + 1 return img_list def downImg(img_list): ssl._create_default_https_context = ssl._create_unverified_context headers = { 'Referer':'http://www.uumnt.cc/', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' } for p in img_list: img = requests.get(p, headers=headers) num = random.randint(10000, 99999) with open("img/%d.jpg" % num, 'wb') as f: f.write(img.content) if "__main__" == __name__: html = r"https://www.uumtu.com/xinggan/30086.html" img_list = getImg(html) downImg(img_list)
下面几个图片大站的采集脚本已经写好,可以直接拿走不谢