urllib.request.urlopen 怎樣處理了服務器返回的 gzip 亂碼數據

最近更新時間 2020-12-02 17:50:04

urlopen 函數不會自動處理服務器返回的 gzip 數據,出現類似 \x1f\x8b\x08\x00 的亂碼。

Request header 中的 Accept-Encoding 屬性如果為 gzip,服務器返回的是 gzip 壓縮後的數據,如下所示:

import urllib.request

req = urllib.request.Request(url)
req.add_header('Accept', '*/*')
req.add_header('Accept-Encoding', 'gzip')
req.add_header('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')
req.add_header('Cache-Control', 'max-age=0')
req.add_header('Connection', 'keep-alive')
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36')

f = urllib.request.urlopen(req)
content = f.read()
\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xdd=k\x8f$Wu\xdf-\xf9?Tfcyw5\xd5\xd3\xef\xee\x19\x0b\x048v\x90\x12\x92...

解壓 gzip

使用 gzip 模塊解壓字符串,如下所示:

import urllib.request
import gzip


req = urllib.request.Request(url)
req.add_header('Accept', '*/*')
req.add_header('Accept-Encoding', 'gzip')
req.add_header('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')
req.add_header('Cache-Control', 'max-age=0')
req.add_header('Connection', 'keep-alive')
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36')


f = urllib.request.urlopen(req)

# 處理 gzip 壓縮的字符串
encoding = f.info().get('Content-Encoding')
if encoding == 'gzip':
    content = gzip.decompress(f.read())
else:
    content = f.read()

rss_feed