工程代码の初體驗-白红宇

强烈建议你试试无所不能的chatGPT，快点击我

工程代码の初體驗

阅读量：7113 次

发布时间：2019-06-28

本文共 5025 字，大约阅读时间需要 16 分钟。

受天天学长影响入坑python，颓了半天多+一晚上终于写出爬虫了，感觉NOIP药丸qaq

在慕课上学的，本来想抓关于OI的百度百科的，因为图样，目标函数太简单，刚开始就偏了……

结果界面也很丑，下回搞个更好的OvO

收货了很多经验，最有用的一条就是把主方法中抓取失败的except写成

except Exception as f:

print 'crew failed: ', f

就可以查出失败的大致方向

python有些东西打错了也不会报错

现在依旧经常犯新手经典错误：self没打或者打错

贴出爬虫的代码（几乎完全是照着慕课上的模板抄的一，一）：

1 主要  2   3 import url_manager, html_downloader, html_parser, html_outputer  4   5 class SpiderMain(object):  6     def __init__(self):  7         self.urls = url_manager.UrlManager()  8         self.downloader = html_downloader.HtmlDownloader()  9         self.parser = html_parser.HtmlParser() 10         self.outputer = html_outputer.HtmlOutputer() 11  12     def craw(self, root_url): 13         count = 1 14         self.urls.add_new_url(root_url) 15         while self.urls.has_new_url(): 16             try: 17                 new_url = self.urls.get_new_url() 18                 print 'craw %d : %s' % (count, new_url) 19                 html_cont = self.downloader.download(new_url) 20                 new_urls, new_data = self.parser.parse(new_url, html_cont) 21                 self.urls.add_new_urls(new_urls) 22                 self.outputer.collect_data(new_data) 23             except Exception as f:#这么做可以指出抓取失败的大致方向 24                 print 'crew failed: ', f 25  26             if count == 100: 27                 break 28  29             count = count + 1 30  31         self.outputer.output_html() 32  33  34 if __name__=="__main__": 35     root_url = "http://baike.baidu.com/item/oi/74020" 36     boj_spider = SpiderMain() 37     boj_spider.craw(root_url) 38  39  40 输出器 41  42  43 class HtmlOutputer(object): 44     def __init__(self): 45         self.datas = [] 46  47     def collect_data(self, data): 48         if data is None: 49             return 50         self.datas.append(data) 51  52     def output_html(self): 53         fout = open('output.html', 'w') 54  55         fout.write("") 56         fout.write("") 57         fout.write("
      ") 58         fout.write("
      
       ") 59  60         for data in self.datas: 61             fout.write("
         ") 62             fout.write("
" % data['url']) 63             fout.write("
" % data['title'].encode('utf-8')) 64             fout.write("
" % data['summary'].encode('utf-8'))#old error:didn't have encode 65             fout.write("
") 66  67         fout.write("
                        %s          %s          %s         
       
      
") 68         fout.write("") 69         fout.write("") 70  71         fout.close() 72  73  74 管理器 75  76  77 class UrlManager(object): 78  79     def __init__(self): 80         self.new_urls = set() 81         self.old_urls = set() 82  83     def add_new_url(self, url): 84         if url is None: 85             return 86         if url not in self.new_urls and url not in self.old_urls: 87             self.new_urls.add(url) 88  89     def add_new_urls(self, urls): 90         if urls is None or len(urls) == 0: 91             return 92         for url in urls: 93             self.add_new_url(url) 94  95     def has_new_url(self): 96         return len(self.new_urls) != 0 97  98     def get_new_url(self): 99         new_url = self.new_urls.pop()100         self.old_urls.add(new_url)101         return new_url102 103 104 解析器105 106 from bs4 import BeautifulSoup107 import re108 import urlparse109 110 class HtmlParser(object):111 112     def _get_new_urls(self,page_url, soup):#old error:didn't have self113         new_urls = set()114 115         links = soup.find_all('a', href=re.compile(r"/view/\d+\.htm"))#old error:re.complie116         for link in links:117             new_url = link['href']118             new_full_url = urlparse.urljoin(page_url, new_url)119             new_urls.add(new_full_url)120 121         return new_urls122 123     def _get_new_data(self,page_url, soup):124         res_data = {}125 126         # url127         res_data['url'] = page_url128 129         # 
      
       Python
130         title_node = soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find("h1")131         res_data['title'] = title_node.get_text()132 133         # 
       
        134         summary_node = soup.find('div', class_="lemma-summary")135         res_data['summary'] = summary_node.get_text()136 137         return res_data138 139     def parse(self, page_url, html_cont):140         if page_url is None or html_cont is None:141             return142 143         soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')144         new_urls = self._get_new_urls(page_url, soup)145         new_data = self._get_new_data(page_url, soup)#old error:new_urls =146         return new_urls, new_data147 148 149 150 下载器151 152 import urllib2153 154 class HtmlDownloader(object):155 156     def download(self, url):157         if url is None:158             return None159 160         response = urllib2.urlopen(url)161 162         if response.getcode() != 200:163             return None164 165         return response.read()

View Code

转载于:https://www.cnblogs.com/JSL2018/p/6067543.html

你可能感兴趣的文章

The difference between hard and soft links

Python学习日记---字符串

脚本入门之算术运算

授之以渔-运维平台Saltstack Web 管理一（Returnner篇）

PHP操作XML（二）——单词翻译功能

Android系统的智能指针（轻量级指针、强指针和弱指针）的实现原理分析（5）...

Struts2、Spring和Hibernate应用实例

计算webView的高度

把常见的编码类型文件(ASCI、Unicode、utf-8)读出到std::string中

linux下tomcat服务的相关命令

webstorm 配置autoprefixer 【转】

Netty中TCP粘包问题代码示例与分析

Windows下的常用快捷键

我的友情链接

水平分库如何做到平滑扩展

设置nexus远程Maven仓库索引下载

重新思考如何使用SIEM产品

再谈SIEM和安全管理平台项目的失败因素（1）

获取手机设备信息

nginx 防盗链配置

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！-- 愿君每日到此一游！

当前时间: 2025-02-09 05:21:37 当前IP: 3.140.188.75 联系邮箱:javaeecc@qq.com Copyright © 2020 - 2022 baihongyu.com 京ICP备2021015314号-2

强烈建议你试试无所不能的CHAT-GPT，快点击我

强烈建议你试试无所不能的CHAT-GPT，快点击我