精通Python网络爬虫-四五章基本使用

Bkfish

2019-03-21

爬虫

一、时代变迁下的比较:

Python2.x时代	Python3.x时代
import urllib2	import urllib.request,urllib.error
import urllib	import urllib2.request,urllib.error,urllib.parse
import urlparse	import urllib.parse
urllib2.urlopen	urllib.request.urlopen
urllib.urlencode	urllib.parse.urlencode
urllib.quote	urllib.request.quote
cookielib.CookieJar	http.CookieJar
urllib2.Request	urllib.request.Request

1.1、快速使用

//导入模块
>>> import urllib.request
//打开网页
>>> file=urllib.request.urlopen("http://baidu.com")
//读取全部内容.内容赋给字符串变量
>>> data=file.read()
//读取全部内容，内容赋给列表变量
>>> data=file.readlines()
//读取一行内容
>>> dataline=file.readline()
>>> print dataline
>>> print data
>>> print(file.info())
>>> print(file.getcode())
>>> print(file.geturl())

URL标准中一般只允许一部分ASCII字符，如数字、字母、部分符号等。若是特殊字符，如中文、：、或者&等，需要编码。编码格式：

1 2	urllib.request.quote(“http://www.baidu.com") urllib.request.unquote(“http%3A//www.baidu.com")

保存本地方法1

1
2
3

>>> fhandle=open("demo.html")
>>> fhandle.write(data)
>>> fhandle.close()

保存本地方法2

1	filename=urllib.request.urlretrieve("http://www.baidu.com",filename="demo.html")

1.2、模拟浏览器 – Headers 属性

方法1、build_opener()修改报头

import urllib.request
url="http://blog.csdn.net/weiwei_pig/article/details/51178226"
headers=("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()
print(data)

方法2、用add_header()添加报头

import urllib.request
url="http://blog.csdn.net/weiwei_pig/article/details/51178226"
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36")
data = urllib.request.urlopen(req).read()

1.3、超时设置,比如gayhun

import urllib.request
for i in range(1,100):
	try:
		file = urllib.request.urlopen("https://github.com/",timeout=1)
		data = file.read()
		print(len(data))
	except Exception as e:
		print("出现异常–>"+str(e))

1.4、POST请求

import urllib.request
import urllib.parse
url = "http://www.iqianyue.com/mypost"
postdata = urllib.parse.urlencode({
"name":"Kitty",
"pass":"haha"
}).encode("utf-8") 
req = urllib.request.Request(url,postdata)
req.add_header("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36")
data = urllib.request.urlopen(req).read().decode('utf-8')
print(data)

二、异常处理

2.1、DebugLog开启

import urllib.request
httphd=urllib.request.HTTPHandler(debuglevel=1)
httpshd=urllib.request.HTTPSHandler(debuglevel=1)
opener=urllib.request.build_opener(httphd,httpshd)
urllib.request.install_opener(opener)
data=urllib.request.urlopen("http://edu.51cto.com")
print(data)

2.2、URLError

import urllib.request
import urllib.error
try:
    file=urllib.request.urlopen("http://www.baiduddd.com")
    print(file.read())
except urllib.error.URLError as e:
    if hasattr(e,"code"):
        print(e.code)
    if hasattr(e,"reason"):
        print(e.reason)
# except urllib.error.URLError as e:
#     print(e.reason)
# except urllib.error.HTTPError as e:
#     print(e.code)
#     print(e.reason)

三、正则表达式的使用

如果对正则表达式没概念，请移步正则表达式必知必会

3.1、正则表达式常用的功能函数包括：match、search、findall、sub

1、1、re.match()函数

函数语法：

1	re.match(pattern, string, flags=0)

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

函数参数说明：

pattern：匹配的正则表达式
string：要匹配的字符串
flag：标志位，用于控制正则表达式的匹配方式（是否匹配大小写、多行匹配等）
作用：match()函数只在字符串的开始位置尝试匹配正则表达式，即从位置0开始匹配。如果匹配成功，则返回一个匹配的对象；如果字符串开始不符合正则表达式，则匹配失败，函数返回None。
1
2
3
4
import re
test = 'http://news.163.com/17/0624/10/CNMHVBJP0001899N.html'
print(re.match(r'http',test)) # <_sre.SRE_Match object; span=(0, 4), match='http'>
print(re.match(r'news',test)) # None

2、re.search()函数

函数语法：

1	1 re.search(pattern, string[, flags])

1 def search(pattern, string, flags=0):
2     """Scan through string looking for a match to the pattern, returning
3     a match object, or None if no match was found."""
4     return _compile(pattern, flags).search(string)

re.search()匹配整个字符串，直到找到第一个匹配的，如果字符串中没有匹配的，则返回None。

1 import  re
2 test = 'I am a loving child to learn.'
3 print(re.search(r'I',test)) # <_sre.SRE_Match object; span=(0, 1), match='I'>
4 print(re.search(r'learn',test)) # <_sre.SRE_Match object; span=(23, 28), match='learn'>
5 print(re.search(r'alina',test)) # None

3、re.sub()函数

函数语法：

1	1 re.sub(pattern,repl,string,count,flags)

1 def sub(pattern, repl, string, count=0, flags=0):
2     """Return the string obtained by replacing the leftmost
3     non-overlapping occurrences of the pattern in string by the
4     replacement repl.  repl can be either a string or a callable;
5     if a string, backslash escapes in it are processed.  If it is
6     a callable, it's passed the match object and must return
7     a replacement string to be used."""
8     return _compile(pattern, flags).sub(repl, string, count)

函数参数说明：

pattern：匹配的正则表达式
repl：替换的字符串
String：要被查找替换的原始字符串
count：匹配后替换的最大次数，默认0表示途欢所有的匹配

re.sub()函数用于替换字符串中的匹配项。

1
2
3

1 import re
2 test = 'I am a loving child to learn.'
3 print(re.sub(r'child','MMMMM',test)) # 替换字符串，将child 替换成MMMMM

4、re.findall()函数

函数语法：

1	1 re.findall(pattern,string,flags)

1 def findall(pattern, string, flags=0):
2     """Return a list of all non-overlapping matches in the string.
3 
4     If one or more capturing groups are present in the pattern, return
5     a list of groups; this will be a list of tuples if the pattern
6     has more than one group.
7 
8     Empty matches are included in the result."""
9     return _compile(pattern, flags).findall(string)

re.findall()可以获取字符串中所有匹配的字符串

1
2
3

1 import re
2 test = '<a href="http://www.educity.cn/zhibo/" target="_blank">直播课堂</a>'
3 print(re.findall(r'<a href="(.*)" target="_blank">(.*)</a>',test)) #[('http://www.educity.cn/zhibo/', '直播课堂')]

3.2、常见匹配

import re
pattern = "\w+([.+-]\w+)*@\w+([.-]\w+)*\.\w+([.-]\w+)*"
string = "<a href='http://www.baidu.com'>百度</a><br><a href='w.linkings@gail.com'>电邮</a>"
result = re.search(pattern,string)
print(result)
print(result.group(0))

电话

import re
pattern = "\d{4}-\d{7}|\d{3}-\d{8}"
string = "0551-54321234513451451345"
result1 = re.search(pattern,string)
print(result1)
print(result1.group(0))

网站

import re
pattern = "[a-zA-Z]+://[^\s]*[.com|.cn]"
string = "<a href='http://www.baidu.com'>百度首页</a>"
result1 = re.search(pattern,string)
print(result1)
print(result1.group(0))

四、Session

一道CTF简答的题目,讲表达式结果迅速返回到服务器（秋名山老司机

import requests
import re
url = 'http://120.24.86.145:8002/qiumingshan/'
s = requests.Session()
source = s.get(url)
expression = re.search(r'(\d+[+\-*])+(\d+)', source.text).group()
result = eval(expression)
post = {'value': result}
print(s.post(url, data = post).text)

五、更优雅的包，requests

5.1、简单使用

>>> import requests
然后，尝试获取某个网页。本例子中，我们来获取Github的公共时间线
>>> r = requests.get('https://github.com/timeline.json')
>>> r = requests.post("http://httpbin.org/post")
>>> r = requests.put("http://httpbin.org/put")
>>> r = requests.delete("http://httpbin.org/delete")
>>> r = requests.head("http://httpbin.org/get")
>>> r = requests.options("http://httpbin.org/get")

5.2、为URL传递参数

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get("http://httpbin.org/get", params=payload)
通过打印输出该URL，你能看到URL已被正确编码:

>>> print r.url
u'http://httpbin.org/get?key2=value2&key1=value1'

5.3、响应内容

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
'[{"repository":{"open_issues":0,"url":"https://github.com/...

Requests会自动解码来自服务器的内容。大多数unicode字符集都能被无缝地解码。

请求发出后，Requests会基于HTTP头部对响应的编码作出有根据的推测。当你访问r.text 之时，Requests会使用其推测的文本编码。你可以找出Requests使用了什么编码，并且能够使用 r.encoding 属性来改变它:

1
2
3

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

如果你改变了编码，每当你访问 r.text ，Request都将会使用 r.encoding 的新值。

5.4、二进制响应内容

你也能以字节的方式访问请求响应体，对于非文本请求:

1 2	>>> r.content b'[{"repository":{"open_issues":0,"url":"https://github.com/...

Requests会自动为你解码 gzip 和 deflate 传输编码的响应数据。

例如，以请求返回的二进制数据创建一张图片，你可以使用如下代码:

1
2
3

>>> from PIL import Image
>>> from StringIO import StringIO
>>> i = Image.open(StringIO(r.content))

5.5、JSON响应内容

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.json()
[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

如果JSON解码失败， r.json 就会抛出一个异常。

5.6、原始响应内容

在罕见的情况下你可能想获取来自服务器的原始套接字响应，那么你可以访问 r.raw 。如果你确实想这么干，那请你确保在初始请求中设置了 stream=True 。具体的你可以这么做:

>>> r = requests.get('https://github.com/timeline.json', stream=True)
>>> r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>
>>> r.raw.read(10)
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

5.7、定制请求头

如果你想为请求添加HTTP头部，只要简单地传递一个 dict 给 headers 参数就可以了。

例如，在前一个示例中我们没有指定content-type:

>>> import json
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
>>> headers = {'content-type': 'application/json'}
>>> r = requests.post(url, data=json.dumps(payload), headers=headers)

六、BeautifulSoup4的基本使用

安装

1	pip install beautifulsoup4

使用

from bs4 import BeautifulSoup

html_str = """
    <ul>
        <li>
            <a href="http://www.baidu.com/">百度一下</a>
        </li>
        <li>合适的话发多少</li>
        <li>
            <a class="baidu" href="http://www.baidu.com/">不会发生看到</a>
        </li>
        <li>
            <a  id="lagou" href="http://www.lagou.com/">lagou</a>
        </li>
        <li>
            <label class="enterText enterArea">列表图预览：</label>
            <p class="enterImg">
                <img id="previewImage" title='mmm' src="http://www.google.com/logo.png"/>
            </p>
            <div class="Validform_checktip">范德萨范德萨</div>
        </li>
    </ul>
"""

soup = BeautifulSoup(html_str,'html.parser')

#html对象  text文本去掉标签
# print(soup)
# print(soup.text)


# <class 'bs4.BeautifulSoup'> 对象类型
# print(type(soup))

# 查找a标签 .text打印a的内容
# print(soup.find('a'))
# print(soup.find('a').text)

# 查找a标签 class=baidu的
# print(soup.find('a',class_='baidu'))

# 查找id=lagou
# print(soup.find(id='lagou'))

# 查找title='mmm' 前边可以写具体找哪个标签
# print(soup.find(title='mmm'))

# find_all 找所有  返回一个list 数组类型
# print(soup.find_all('a'))
# print(soup.find_all('a')[0]) #第一个
all_a = soup.find_all('a')
for item in all_a:
    if item:
        # print(item.attrs)
        print(item.attrs['href']) #dict类型