`

花瓣网图片抓取器

 
阅读更多
因为花瓣网图片是js动态加载的,而且还是下拉加载,简单的通过查找<img>标签便不是很可行了。
所以最好的方法是分析url,分多次请求;这里面的分析参考http://blog.chinaunix.net/uid-23500957-id-3878770.html


程序实现思路如下:
1.首先访问某一画板主页,例如http://huaban.com/boards/18484185/
2.得到网页源码(不执行js)后,一般有20张图片的信息,存在这么一个json数据里
app.page["board"] = {"board_id":18484185, "user_id":16352918, "title":"可爱动漫", "description":"", "category_id":"anime", "seq":2, "pin_count":178, "follow_count":50, "like_count":1, "created_at":1415945887, "updated_at":1419412105, "deleting":0, "is_private":0, "extra":null, "user":{"user_id":16352918, "username":"爱吃饭团的小小泽", "urlname":"jbnihpdw84", "created_at":1415945122, "avatar":{"id":63028007, "farm":"farm1", "bucket":"hbimg", "key":"31ccfe5585d691cd7b6d48959397eec51daf5bae1d7b6-G3Fatz", "type":"image/jpeg", "width":720, "height":960, "frames":1}, "pin_count":3643, "board_count":67, "like_count":2, "follower_count":300, "creations_count":0, "boards_like_count":0, "following_count":83, "commodity_count":15, "profile":{"location":"", "sex":"0", "birthday":"", "job":"", "url":"", "about":""}, "status":{"emailvalid":false, "newbietask":0, "lr":1421675510, "invites":0, "share":"0"}}, "category_name":"动漫", "following":false, "liked":false, "pins":[{"pin_id":297911086, "user_id":16352918, "board_id":18484185, "file_id":65136253, "file":{"farm":"farm1", "bucket":"hbimg", "key":"8b5906dc77a84e6bdbdfda7d882378dfb3e8401724273-RFklTX", "type":"image/jpeg", "width":1024, "height":1575, "frames":1, "theme":"FAF8F0"}, "media_type":0, "source":"donmai.us", "link":"http://donmai.us/posts/1880339?tags=touhou", "raw_text":"#东方project#\n#娜兹玲#", "text_meta":{"tags":[{"start":0, "offset":11}, {"start":12, "offset":5}]}, "via":297275179, "via_user_id":6303198, "original":297275179, "created_at":1419412104, "like_count":0, "comment_count":0, "repin_count":2, "is_private":0, "orig_source":null, "hide_origin":false}, {"pin_id":297910029, "user_id":16352918, "board_id":18484185, "file_id":65137757, "file":{"farm":"farm1", "bucket":"hbimg", "key":"67f6920b0f1cc039c251b3a4d467b27464830fcc5240b-SC07Pk", "type":"image/jpeg", "width":621, "height":869, "frames":1, "theme":"F6EADB"}, "media_type":0, "source":"pixiv.net", "link":"http://www.pixiv.net/member_illust.php?mode=medium&illust_id=47696761", "raw_text":"#东方project#\n#今泉影狼##博丽灵梦##雾雨魔理沙#うちの子。", "text_meta":{"tags":[{"start":0, "offset":11}, {"start":12, "offset":6}, {"start":18, "offset":6}, {"start":24, "offset":7}]}, "via":297288273, "via_user_id":6303198, "original":297288273, "created_at":1419412016, "like_count":0, "comment_count":0, "repin_count":28, "is_private":0, "orig_source":null, "hide_origin":false}, ...

3.我们用如下正则表达式,提取图片的pinId,图片的key(用于得到图片的地址),图片类型
private List<Img> parsePinsFromXml(String xmlStr) {
		List<Img> pins = new ArrayList<Img>();
		String pattern = "\\{\"pin_id\":(\\d+),.+?\"key\":\"(.+?)\",.\"type\":\"image/(.+?)\",";

		// 创建 Pattern 对象
		Pattern r = Pattern.compile(pattern);

		// 现在创建 matcher 对象
		Matcher m = r.matcher(xmlStr);
		while (m.find()) {
			Img pin = new Img();
			System.out.println(m.group());
			pin.setPinId(m.group(1));
			pin.setKey(m.group(2));
			pin.setType(m.group(3));
			pins.add(pin);
			System.out.println(pin.getPinId()+","+pin.getKey()+","+pin.getType());
		}
		return pins;
	}

4.一般返回的是20个pin,我们选取最后一个pinId,使用如下url继续访问网站,获取接下来的20个pin,直至得到所有的pin,或是无法获得更多的pin
url = this.boardUrl + "?max=" + img.getPinId() + "&limit=20&wfl=1";

其中,boardUrl = "http://huaban.com/boards/18484185/"
5.到这一步,已经有所有图片的key值了
例如,31ccfe5585d691cd7b6d48959397eec51daf5bae1d7b6-G3Fatz
再两端分别加上"http://img.hb.aicdn.com/"与"_fw658"即可得到大图地址
6.到这儿就不用多说了,根据图片url把图片下载到指定位置即可








最终下载到的图片:


所附jar,需运行在jre8.0上
  • 大小: 27.7 KB
  • 大小: 34.1 KB
  • 大小: 58.1 KB
  • 大小: 453.2 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics