博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
工程代码の初體驗
阅读量:7113 次
发布时间:2019-06-28

本文共 5025 字,大约阅读时间需要 16 分钟。

受天天学长影响入坑python,颓了半天多+一晚上终于写出爬虫了,感觉NOIP药丸qaq

在慕课上学的,本来想抓关于OI的百度百科的,因为图样,目标函数太简单,刚开始就偏了……

结果界面也很丑,下回搞个更好的OvO

 

收货了很多经验,最有用的一条就是把主方法中抓取失败的except写成

except Exception as f:

print 'crew failed: ', f

就可以查出失败的大致方向

 

python有些东西打错了也不会报错

现在依旧经常犯新手经典错误:self没打或者打错

贴出爬虫的代码(几乎完全是照着慕课上的模板抄的一,一):

1 主要  2   3 import url_manager, html_downloader, html_parser, html_outputer  4   5 class SpiderMain(object):  6     def __init__(self):  7         self.urls = url_manager.UrlManager()  8         self.downloader = html_downloader.HtmlDownloader()  9         self.parser = html_parser.HtmlParser() 10         self.outputer = html_outputer.HtmlOutputer() 11  12     def craw(self, root_url): 13         count = 1 14         self.urls.add_new_url(root_url) 15         while self.urls.has_new_url(): 16             try: 17                 new_url = self.urls.get_new_url() 18                 print 'craw %d : %s' % (count, new_url) 19                 html_cont = self.downloader.download(new_url) 20                 new_urls, new_data = self.parser.parse(new_url, html_cont) 21                 self.urls.add_new_urls(new_urls) 22                 self.outputer.collect_data(new_data) 23             except Exception as f:#这么做可以指出抓取失败的大致方向 24                 print 'crew failed: ', f 25  26             if count == 100: 27                 break 28  29             count = count + 1 30  31         self.outputer.output_html() 32  33  34 if __name__=="__main__": 35     root_url = "http://baike.baidu.com/item/oi/74020" 36     boj_spider = SpiderMain() 37     boj_spider.craw(root_url) 38  39  40 输出器 41  42  43 class HtmlOutputer(object): 44     def __init__(self): 45         self.datas = [] 46  47     def collect_data(self, data): 48         if data is None: 49             return 50         self.datas.append(data) 51  52     def output_html(self): 53         fout = open('output.html', 'w') 54  55         fout.write("") 56         fout.write("") 57         fout.write("
") 58 fout.write("
") 59 60 for data in self.datas: 61 fout.write("
") 62 fout.write("
" % data['url']) 63 fout.write("
" % data['title'].encode('utf-8')) 64 fout.write("
" % data['summary'].encode('utf-8'))#old error:didn't have encode 65 fout.write("
") 66 67 fout.write("
%s %s %s
") 68 fout.write("") 69 fout.write("") 70 71 fout.close() 72 73 74 管理器 75 76 77 class UrlManager(object): 78 79 def __init__(self): 80 self.new_urls = set() 81 self.old_urls = set() 82 83 def add_new_url(self, url): 84 if url is None: 85 return 86 if url not in self.new_urls and url not in self.old_urls: 87 self.new_urls.add(url) 88 89 def add_new_urls(self, urls): 90 if urls is None or len(urls) == 0: 91 return 92 for url in urls: 93 self.add_new_url(url) 94 95 def has_new_url(self): 96 return len(self.new_urls) != 0 97 98 def get_new_url(self): 99 new_url = self.new_urls.pop()100 self.old_urls.add(new_url)101 return new_url102 103 104 解析器105 106 from bs4 import BeautifulSoup107 import re108 import urlparse109 110 class HtmlParser(object):111 112 def _get_new_urls(self,page_url, soup):#old error:didn't have self113 new_urls = set()114 115 links = soup.find_all('a', href=re.compile(r"/view/\d+\.htm"))#old error:re.complie116 for link in links:117 new_url = link['href']118 new_full_url = urlparse.urljoin(page_url, new_url)119 new_urls.add(new_full_url)120 121 return new_urls122 123 def _get_new_data(self,page_url, soup):124 res_data = {}125 126 # url127 res_data['url'] = page_url128 129 #

Python

130 title_node = soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find("h1")131 res_data['title'] = title_node.get_text()132 133 #
134 summary_node = soup.find('div', class_="lemma-summary")135 res_data['summary'] = summary_node.get_text()136 137 return res_data138 139 def parse(self, page_url, html_cont):140 if page_url is None or html_cont is None:141 return142 143 soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')144 new_urls = self._get_new_urls(page_url, soup)145 new_data = self._get_new_data(page_url, soup)#old error:new_urls =146 return new_urls, new_data147 148 149 150 下载器151 152 import urllib2153 154 class HtmlDownloader(object):155 156 def download(self, url):157 if url is None:158 return None159 160 response = urllib2.urlopen(url)161 162 if response.getcode() != 200:163 return None164 165 return response.read()
View Code

 

转载于:https://www.cnblogs.com/JSL2018/p/6067543.html

你可能感兴趣的文章
The difference between hard and soft links
查看>>
Python学习日记---字符串
查看>>
脚本入门之算术运算
查看>>
授之以渔-运维平台Saltstack Web 管理一(Returnner篇)
查看>>
PHP操作XML(二)——单词翻译功能
查看>>
Android系统的智能指针(轻量级指针、强指针和弱指针)的实现原理分析(5)...
查看>>
Struts2、Spring和Hibernate应用实例
查看>>
计算webView的高度
查看>>
把常见的编码类型文件(ASCI、Unicode、utf-8)读出到std::string中
查看>>
linux下tomcat服务的相关命令
查看>>
webstorm 配置autoprefixer 【转】
查看>>
Netty中TCP粘包问题代码示例与分析
查看>>
Windows下的常用快捷键
查看>>
我的友情链接
查看>>
水平分库如何做到平滑扩展
查看>>
设置nexus远程Maven仓库索引下载
查看>>
重新思考如何使用SIEM产品
查看>>
再谈SIEM和安全管理平台项目的失败因素(1)
查看>>
获取手机设备信息
查看>>
nginx 防盗链配置
查看>>