關於爬取網站資料儲存為docx出現的一些問題

最近同事在爬取某網站資料，想將爬取的資料儲存為docx。在爬取資料過程中一切很順利，但是在儲存資料時卻提示以下錯誤。

File “src\lxml\etree。pyx”， line 1024， in lxml。etree。_Element。text。__set__ File “src\lxml\apihelpers。pxi”， line 747， in lxml。etree。_setNodeText File “src\lxml\apihelpers。pxi”， line 735， in lxml。etree。_createTextNode File “src\lxml\apihelpers。pxi”， line 1540， in lxml。etree。_utf8ValueError： All strings must be XML compatible： Unicode or ASCII， no NULL bytes or control characters

資料儲存使用的是python-docx模組。大致意思是字元不相容，所有字串必須與XML相容。

那麼接下來面向百度或谷歌程式設計，根據百度搜索結果。大致意思是python-docx與中文字元不相容，需要將字元轉為Unicode。讓我們先來看看字元的資料內容，和字元資料格式。

def save（doc_title， doc_content_list）： document = Document（） # 測試標題 heading = document。add_heading（doc_title， 0） # 居中顯示 heading。alignment = WD_PARAGRAPH_ALIGNMENT。CENTER #列印字符集 print（chardet。detect（doc_content_list。encode（））） #列印資料內容 print（doc_content_list） # 測試內容，這裡轉為Unicode document。add_paragraph（doc_content_list） # 字元分割，用於儲存檔名 t_title = doc_title。split（）［0］ # 執行 document。save（‘下載-%s。docx’ % t_title）

執行結果如下所示

字符集和資料內容

資料內容好像沒問題，字符集為utf-8。

難道是python-docx與中文字元真的不相容？著手寫了一個測試如下

from docx。enum。text import WD_PARAGRAPH_ALIGNMENT # 用來居中顯示標題from docx import Documentdocument = Document（）#測試標題，注意這裡忘了轉為Unicodeheading = document。add_heading（“測試標題”， 0）heading。alignment = WD_PARAGRAPH_ALIGNMENT。CENTER # 居中顯示#測試內容，這裡轉為Unicodedocument。add_paragraph（u‘測試內容’）document。save（‘測試文件。docx’）

最後執行一切正常，儲存成功。

測試結果

在寫測試標題的時候忘了將字元轉為Unicode，但是也能夠正常儲存，說明python-docx是能夠支援utf-8字符集。而測試內容轉為了Unicode，但是在文件中也能正常顯示，說明python-docx在儲存Unicode的時候會預設轉為utf-8。

按理論上了來說，python-docx在儲存資料的時候是沒有問題的，那為什麼會報錯呢？那我們將網站上爬取的資料轉為Unicode試試。大致程式碼如下

def save（doc_title， doc_content_list）： document = Document（） # 測試標題 heading = document。add_heading（doc_title， 0） # 居中顯示 heading。alignment = WD_PARAGRAPH_ALIGNMENT。CENTER # 測試內容，這裡轉為Unicode document。add_paragraph（json。dumps（doc_content_list）） # 字元分割，用於儲存檔名 t_title = doc_title。split（）［0］ # 在當前指令碼路徑儲存docx檔案 document。save（‘下載-%s。docx’ % t_title）

使用json。dumps將字串轉為Unicode，加上這一步操作後，執行過程中沒有任何異常，但是執行結果卻不是我們所想要的。大致執行結果如下圖所示

執行結果

當場我就納悶了，怎麼標題儲存沒問題，但是內容儲存卻是Unicode，按理論來說內容應該會直接轉為utf-8啊。

懵逼之後，我整理了下思路：

資料可以列印，說明資料獲取沒問題

資料格式為utf-8

python-docx可以直接儲存utf-8資料集，也可以儲存Unicode格式

python-docx將資料儲存為Unicode不報錯，但是顯示有問題

python-docx將資料直接以utf-8儲存，報錯

整理思路，說明我們的資料可能有問題。我們回過頭來看下錯誤提示

All strings must be XML compatible： Unicode or ASCII， no NULL bytes or control characters

所有字串必須與XML相容：Unicode或ASCII，不能是空位元組或控制字元。

接下來我們嘗試將得到的資料清理下，將所有非utf-8的字元去掉。大致程式碼如下

from docx。enum。text import WD_PARAGRAPH_ALIGNMENTfrom docx import Documentimport re# 清理所有非utf-8的字元def cleantxt（raw）： # utf-8字符集範圍u4e00-u9fa5 fil = re。compile（u‘［^0-9a-zA-Z\u4e00-\u9fa5。，，。？“”《》_（）！；：］+’， re。UNICODE） return fil。sub（‘ ’， raw）def save（doc_title， doc_content_list）： document = Document（） # 測試標題 heading = document。add_heading（doc_title， 0） # 居中顯示 heading。alignment = WD_PARAGRAPH_ALIGNMENT。CENTER # 測試內容，清理異常資料 document。add_paragraph（cleantxt（doc_content_list）） # 字元分割，用於儲存檔名 t_title = doc_title。split（）［0］ # 在當前指令碼路徑儲存docx檔案 document。save（‘下載-%s。docx’ % t_title）

執行一切正常，接下來到了激動人心的時刻了。