AI說書 - 從0開始 - 104 | 資料清洗一次教

2024/07/26 更新2024/07/26 發佈閱讀 7 分鐘

我想要一天分享一點「LLM從底層堆疊的技術」，並且每篇文章長度控制在三分鐘以內，讓大家不會壓力太大，但是又能夠每天成長一點。

延續 AI說書 - 從0開始 - 103 所載入的資料集，現在要來進行資料前置處理，首先載入需要的依賴：

import pickle
from pickle import dump

接著定義一個載入文件的函數：

# load doc into memory
def load_doc(filename):
	# open the file as read only 
	file = open(filename, mode = 'rt', encoding = 'utf-8')
	
	# read all text 
	text = file.read()
	
	# close the file
	file.close()
	return text

接著定義一個將文本分割為句子的函數：

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

再定義一個檢索句子長度的函數：

# shortest and longest sentence lengths
def sentence_lengths(sentences):
	lengths = [len(s.split()) for s in sentences]
	return min(lengths), max(lengths)

導入的句子必須進行清理，以避免訓練無用和噪音的標記，這些行將被正規化，在空白處進行標記化，並轉換為小寫，每個標記中的標點符號會被移除，不可顯示的字符會被去除，包含數字的標記會被排除，清理後的行將被存儲為字符串。

以下程式運行清理功能並返回清理後的附加字符串：

import re
import string
import unicodedata

def clean_lines(lines):
	cleaned = list()
	
	# prepare regex for char filtering
	re_print = re.compile('[^%s]' % re.escape(string.printable))
	
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	
	for line in lines:
		# normalize unicode characters
		line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore')
		line = line.decode('UTF-8')
	
		# tokenize on white space
		line = line.split()
	
		# convert to lower case
		line = [word.lower() for word in line]
	
		# remove punctuation from each token
		line = [word.translate(table) for word in line]
	
		# remove non-printable chars from each token
		line = [re_print.sub('', w) for w in line]
	
		# remove tokens with numbers in them
		line = [word for word in line if word.isalpha()]
	
		# store as string
		cleaned.append(' '.join(line))
	return cleaned

註解說明為：

re_print = re.compile('[^%s]' % re.escape(string.printable))

re：是 Python 中的正則表達式模組（re module）。
re.compile()：這個函數用來編譯一個正則表達式模式，返回一個正則表達式對象，該對象可以用來匹配字符。
string.printable：這是Python標準庫string模組中的一個字符串，它包含所有的可打印字符（包括字母、數字、標點符號和空白字符）。
re.escape()：這個函數會轉譯字符串中的所有非字母數字字符，這樣它們就可以在正則表達式中被安全地使用。
'[^%s]'：這是一個正則表達式模式，表示匹配所有不在string.printable中的字符。[^...]是在正則表達式中表示“否定字符集”，即匹配任何不在方括號中的字符。

table = str.maketrans('', '', string.punctuation)

str.maketrans(): 這是一個用於創建字符映射表的內建函數，這個映射表可以用來替換或刪除字符串中的字符。
str.maketrans()可以接受三個參數：

第一個參數是一個字符串，其中的每個字符將被替換為第二個參數中對應位置的字符。

第二個參數是一個字符串，包含將要替換第一個參數中字符的新字符。

第三個參數是一個字符串，包含要被刪除的字符。

在這段代碼中，第一個參數和第二個參數都為空字符串，表示不進行替換操作，第三個參數為string.punctuation，它是一個包含所有標點符號的字符串，因此，str.maketrans('', '', string.punctuation)創建了一個映射表，用於刪除所有標點符號。

line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore')

unicodedata.normalize('NFD', line)：unicodedata 是 Python 標準庫中的一個模組，用於處理Unicode字符數據，normalize 是 unicodedata 模組中的一個函數，用於將Unicode字符串正規化。'NFD' 是正規化形式之一，表示 "Normalization Form D"（分解正規化形式），將字符分解為基字符和組合字符。例如，將字符 "é" 分解為 "e" 和重音符。
.encode('ascii', 'ignore')：.encode('ascii', 'ignore') 將正規化後的字符串編碼為ASCII，並忽略所有非ASCII字符。'ascii' 表示將字符串轉換為ASCII編碼。'ignore' 是錯誤處理方案，表示忽略無法編碼的字符。