用户工具

站点工具


web:crawler

crawl

demo

[{"descr":"车次信息","name":"train"}
,{"descr":"小阅读","name":"pdnovel"}
,{"descr":"妹子图","name":"mzitu"}
,{"descr":"类UI","name":"layui"}
,{"descr":"行政区划","name":"district"}]

脚本示例:url为html:+网址,子元素nth-child(n),建议substring(from,to)与substr(from,len)不同,urls为下一级网址,data[key]=json为数据,cheerio.attr text children first eq(n) last find next each

urls=[]
data={}
prefix=url.substring(url.indexOf(':')+1,url.lastIndexOf('/')+1)
province=url.substring(url.lastIndexOf('/')+1,url.lastIndexOf('.'))
$('tr.citytr td:nth-child(2) a').each(function(idx,a){
  href=$(this).attr('href')
  urls.push(prefix+href)
  code=href.substring(href.indexOf('/')+1,href.lastIndexOf('.'))
  name=$(this).text().trim()
  data['hset_province'+province+'_'+code]={'name':name}
})
crawl = {'urls':urls,data:data}

数据处理

redis-cli -n 1 keys "property:city*"| xargs -tl -i redis-cli -n 1 del "{}"
runStep('district',0)  //测试时分析日志,成功后再清理数据,然后全量抓取

pdnovel

pdnovel:小阅读获取

[{"descr":"获取分页","level":"1","name":"page","url":"http://bbs.xlongwei.com/pdnovel.php?mod=list"}
,{"descr":"获取书籍","level":"2","name":"novel","url":"http://bbs.xlongwei.com/pdnovel.php?mod=list"}
,{"descr":"获取章节","level":"3","name":"chapter","url":"http://bbs.xlongwei.com/pdnovel.php?mod=chapter&novelid=48"}
,{"descr":"获取内容","level":"4","name":"content","url":"http://bbs.xlongwei.com/pdnovel.php?mod=read&novelid=48&chapterid=15435"}]

trains

trains:车次信息获取

[{"level":"1","name":"train","descr":"获取省份链接","url":"http://qq.ip138.com/train/"}
,{"level":"2","name":"province","descr":"获取城市链接","url":"http://qq.ip138.com/train/anhui/"}
,{"level":"3","name":"city","descr":"获取车次链接","url":"http://qq.ip138.com/train/anhui/AnQing.htm"}
,{"level":"4","name":"line","descr":"获取站台信息","url":"http://qq.ip138.com/train/D5601.htm"}]

mzitu

mzitu:图片网址获取

[{"level":"1","name":"page","descr":"获取所有页码","url":"http://www.mzitu.com/"}
,{"level":"2","name":"img","descr":"获取单页图片","url":"https://www.mzitu.com/page/2/"}
,{"level":"3","name":"page2","descr":"获取大图分页","url":"https://www.mzitu.com/181419/"}
,{"level":"4","name":"img2","descr":"获取大图链接","url":"https://www.mzitu.com/181419/50"}]

district

district:全国行政区划2020

[{"descr":"获取一级列表","level":"1","name":"level1","url":"http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html"}
,{"descr":"获取二级列表","level":"2","name":"level2","url":"http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/51.html"}
,{"descr":"获取三级列表","level":"3","name":"level3","url":"http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/51/5116.html"}
,{"descr":"获取四级列表","level":"4","name":"level4","url":"http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/51/16/511623.html"}
,{"descr":"获取五级列表","level":"5","name":"level5","url":"http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/51/16/23/511623113.html"}]
web/crawler.txt · 最后更改: 2021/03/21 14:11 由 admin