python爬虫

First Post:

Last Update:

介绍

requestslxml 是 Python 中非常流行的两个库,主要用于网络请求和 XML/HTML 的解析。

  1. requests:用于发送 HTTP 请求的库,使 HTTP 请求变得简单易用。使用 requests 发送各种类型的 HTTP 请求(如 GET、POST、DELETE 等),并处理返回的响应。requests 支持自定义 headers、参数、cookies,以及处理重定向、超时等。

    PYTHON - 5 lines
    1
    2
    3
    4
    5
    import requests

    response = requests.get('https://www.example.com')
    print(response.status_code)
    print(response.text)
  2. lxml:用于解析 XML 和 HTML 的库,使用 lxml 来解析、查询和操作 XML 和 HTML 文档。

    PYTHON - 12 lines
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    from lxml import etree

    html = """
    <html>
    <head><title>Example</title></head>
    <body>
    <p>Hello, world!</p>
    </body>
    </html>
    """
    tree = etree.fromstring(html)
    print(tree.findtext('.//p'))

    这两个库经常一起使用,特别是在网络爬虫和数据抓取的场景中。使用 requests 获取网页内容,然后使用 lxml 来解析和提取需要的数据。

练习记录

PYTHON - 8 lines
1
2
3
4
5
6
7
8
import requests

target = "http://www.spiderbuf.cn"

url = f"{target}/s01"
html = requests.get(url).text
print(html)

查看获取页面源码

PYTHON - 13 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
import requests

target = "http://www.spiderbuf.cn"

url = f"{target}/s01"
html = requests.get(url).text

f = open('S01.html', 'w', encoding='utf-8')
f.write(html)
f.close()

print(html)

将页面源码保存到本地,防止被封,可以在本地读取文件进行调试

PYTHON - 16 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests
from lxml import etree

target = "http://www.spiderbuf.cn"

url = f"{target}/s01"
html = requests.get(url).text

f = open('S01.html', 'w', encoding='utf-8')
f.write(html)
f.close()

root = etree.HTML(html)
trs = root.xpath('//tr')
print(trs)

通过xpath获取<tr>元素的列表

PYTHON - 17 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import requests
from lxml import etree

target = "http://www.spiderbuf.cn"

url = f"{target}/s01"
html = requests.get(url).text

f = open('S01.html', 'w', encoding='utf-8')
f.write(html)
f.close()

root = etree.HTML(html)
trs = root.xpath('//tr')
for tr in trs:
print(tr.xpath('./td'))

通过xpath获取每个trtd列表

PYTHON - 19 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import requests
from lxml import etree

target = "http://www.spiderbuf.cn"

url = f"{target}/s01"
html = requests.get(url).text

f = open('S01.html', 'w', encoding='utf-8')
f.write(html)
f.close()

root = etree.HTML(html)
trs = root.xpath('//tr')
for tr in trs:
tds = tr.xpath('./td')
for td in tds:
print(td.text)

td列表打印出来

PYTHON - 21 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import requests
from lxml import etree

target = "http://www.spiderbuf.cn"

url = f"{target}/s01"
html = requests.get(url).text

f = open('S01.html', 'w', encoding='utf-8')
f.write(html)
f.close()

root = etree.HTML(html)
trs = root.xpath('//tr')
for tr in trs:
tds = tr.xpath('./td')
s = ''
for td in tds:
s = s + td.text + '|'
print(s)

格式化输出td的信息,利用空串拼接,但td列表存在空数据,利用str()显式转换去处理。

PYTHON - 4 lines
1
2
3
4
Traceback (most recent call last):
File "D:\PyBatch\cyber\S01.py", line 19, in <module>
s = s + td.text + '|'
TypeError: can only concatenate str (not "NoneType") to str
PYTHON - 21 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import requests
from lxml import etree

target = "http://www.spiderbuf.cn"

url = f"{target}/s01"
html = requests.get(url).text

f = open('S01.html', 'w', encoding='utf-8')
f.write(html)
f.close()

root = etree.HTML(html)
trs = root.xpath('//tr')
for tr in trs:
tds = tr.xpath('./td')
s = ''
for td in tds:
s = s + str(td.text) + '|'
print(s)

PYTHON - 25 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import requests
from lxml import etree

target = "http://www.spiderbuf.cn"

url = f"{target}/s01"
html = requests.get(url).text

f = open('S01.html', 'w', encoding='utf-8')
f.write(html)
f.close()

f = open('datas01.txt', 'w', encoding='utf-8')
root = etree.HTML(html)
trs = root.xpath('//tr')
for tr in trs:
tds = tr.xpath('./td')
s = ''
for td in tds:
s = s + str(td.text) + '|'
print(s)
if s != '':
f.write(s + '\n')
f.close()

将获取的数据存入到本地