（備忘録-python）TransformerとBERTの使い方（テキスト分類用コードまとめ）

以前まとめた自然言語の中で、TransformerとBERTについて簡潔にまとめてみました。原理を理解しなくても使うだけなら、コピペでも大丈夫なようにコードを載せてあります。

また、話題になっているGPT(Generative Pretrained Transformer)にも関連する話になっていると思います。

※コードメインの簡易的な説明になっています。分類するテキストは英文です。

※自分用の備忘録でもありますので、生暖かく見ていってください

Transformer
BERT
まとめ
参考記事：ChatGPTの使い方

Transformer

Transformerの特徴

BERTはAttention機構をベースとした構造を持っています。Attentionは、文中の単語の意味を理解する際に、注目すべき単語を示すスコアとして定義されます。BERTでは、Attention機構を使用して単語間の関連性を捉え、文脈を理解するための情報を抽出します。

具体的には、BERTではSelf-Attentionと呼ばれるメカニズムを使用します。Self-Attentionは、単語同士の関連性を計算する際に、単語同士の距離や位置関係に依存せずに注目すべき単語を選択する能力を持っています。これにより、BERTは文中の単語の重要性を動的に判断し、重要な情報に集中することができます。

（備忘録-python）自然言語処理超入門：Transformerの前に...Attention機構の仕組みを学び・使う（準初心者向け）

自然言語処理におけるAttention機構の仕組みと応用を、Pythonで実装しながら解説します。TransformerやBERTなどの最新のモデルにも使われるAttentionの基礎を理解するための入門記事です。

Self-Attentionは、実際に使用されている形態では、入力ベクトルのみから出力を並列計算で導出することが可能であり、計算の依存関係がないため、非常に高速な処理速度を実現します。Self-Attentionは、RNNやCNNのセルよりも計算に要する要素が少なく、効率的な計算が可能です。

Attention機構がメイン(RNNやLSTM、CNN層は一切使っていない)
Attention機構のおかげで並列計算が可能＋計算量が少ない
汎用性が高い→BERT,GPT等につながる
勉強しておいて損がない存在である

全体の構成

Transformerはエンコーダ-デコーダモデルの基本的な構造を持っています。エンコーダは左側に位置し、デコーダは右側に位置します。

エンコーダの役割は、入力として単語の列を受け取り、それらの単語の文脈情報を捉えた埋め込み表現（ベクトル）を生成することです。

デコーダの役割は、エンコーダによって生成された埋め込み表現と翻訳したい言語の単語列を入力として受け取り、次の単語を予測することです。デコーダは、自己注意機構（self-attention）と位置ごとの全結合層（Position-wise FFN）を使用して、文脈情報を活用しながら次の単語の予測を行います。

ただし、デコーダの詳細な仕組みや予測方法について、説明が不足しているため、完全な要約とは言えません。Transformerは自然言語処理の様々なタスクにおいて優れた性能を発揮するモデルであり、エンコーダ-デコーダアーキテクチャとself-attentionメカニズムの組み合わせがその特徴的な要素となっています。

Transformerは、エンコーダとデコーダの積み重ねから構成されます。エンコーダはN=6層で構成されており、各層は同じ構造を持ちます。各層は、Multi-Head Attention層とPosition-wise全結合層（FNN）の2つのサブ層で構成されています。これらのサブ層の後には、Add & Norm（残差結合と正規化層）が続きます。

デコーダも同様にN=6層で構成されており、各層は同じ構造を持ちます。デコーダでは、エンコーダの出力を受け取るMulti-Head Attention層が追加されています。デコーダの最初の層は、図に示されているようにMasked Multi-Head Attentionを行います。これは、対象の単語よりも未来の単語に対してAttentionが加わらないようにするためのもので、翻訳タスクにおいてカンニングを防ぐ役割があります。

Add & Normは、残差結合（skip connection）と正規化層を組み合わせたものです。残差結合は、層を通過する情報の流れを直接的に維持するための機構であり、正規化層はデータのスケールを調整する役割を果たします。

以上がTransformerの基本的な構造と要素です。

詳しくは以下を参考にしてください。

（備忘録-python）自然言語処理超入門：(やっと)Transformerの仕組みを学び・使う（準初心者向け）

自然言語処理の必須知識となったTransformerについて、その誕生背景からモデル構造まで一から解説します。Attention機構を最大限に活用したモデルの仕組みと性能を理解しましょう。(実例あり)

前処理まで

データのロードとマッピング

from tensorflow.keras.datasets import imdb

# imdb = imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data()

#単語が整数にマッピングされた辞書を取得
word_index = imdb.get_word_index()

# 最初の要素を予約（単語を登録）
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # 不明な単語
word_index["<UNUSED>"] = 3

# 整数を単語にマッピングする辞書を作成
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

import pandas as pd
train_df=pd.DataFrame(train_data)[0].map(decode_review).reset_index(drop=True)
test_df=pd.DataFrame(test_data)[0].map(decode_review).reset_index(drop=True)

# all_df=pd.concat([train_df,test_df])[::10].reset_index(drop=True)
# all_df

前処理

import re
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
#nltk.download('wordnet')
#nltk.download('omw-1.4')

from nltk.stem.porter import PorterStemmer 
stemmer = PorterStemmer()

import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords')

def clean_text(x):
    #ノイズ除去
    soup = BeautifulSoup(x, 'html.parser')
    text= soup.get_text()
    
    #アルファベット以外をスペースに置き換え
    text_ = re.sub(r'[^a-zA-Z]', ' ', text)
    
    #単語長が短いものものは削除(中身による)+その後の処理のために分割
    text_ = [word for word in text_.split() if len(word) > 2]
    
    #形態素=>動詞
    text_ = [lemmatizer.lemmatize(word.lower(), pos="v") for word in text_]
    
    #ステミング
#     text_ = [stemmer.stem(word) for word in text_]
    
    #stopword除去
    A = [word for word in text_ if word not in stopwords.words('english')]
    
    #単語同士をスペースでつなぎ, 文章に戻す
    #その後の処理で戻す必要ない場合はコメントアウト
    clean_text = ' '.join(A)
    return clean_text


#軽くするために10個飛ばし
clean_text_df=train_df[::10].map(clean_text)
clean_text_df

clean_text_test_df=test_df[::10].map(clean_text)
clean_text_test_df

clean_text_df=train_df[::10].map(clean_text)
clean_text_df

clean_text_test_df=test_df[::10].map(clean_text)
clean_text_test_df

配列のid化

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

all_text=pd.concat([train_texts,test_texts]).reset_index(drop=True)

sentences = []
for text in all_text:
    text_list = text.split(' ')
    sentences.append(text_list)

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)


sequences_tk = tokenizer.texts_to_sequences(sentences)
#pd.DataFrame(sequences_tk)

MAX_SEQUENCE_LENGTH =int(pd.DataFrame(sequences_tk).shape[1])
MAX_SEQUENCE_LENGTH

#要素の合わない配列に対して、0 で埋めるなどして配列のサイズを一致させる。
X=pad_sequences(sequences_tk, maxlen=MAX_SEQUENCE_LENGTH, truncating='post')


X_train=X[:train_texts.shape[0]]
X_test=X[train_texts.shape[0]:]


word_index = tokenizer.word_index
num_words = len(word_index)
print(num_words)

学習

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import regularizers

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)


class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions


embed_dim = 128  # Embedding size for each token
num_heads = 8  # Number of attention heads
ff_dim = 128  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)

x = layers.Dense(64, kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01), activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)



# Preparing callbacks.
adam=optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(optimizer=adam,
              loss='sparse_categorical_crossentropy',
              metrics=['acc'])

callbacks = [
    EarlyStopping(patience=10)
]

seed_everything(42)

# Train the model.
history=model.fit(x=X_train,
                  y=train_label,
                  batch_size=64,
                  epochs=100,
                  validation_split=0.2,
                  callbacks=callbacks,
                  shuffle=True)


hist_df = pd.DataFrame(history.history)
# 可視化
plt.figure()
hist_df[['acc', 'val_acc']].plot()
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()

plt.figure()
hist_df[['loss', 'val_loss']].plot()
plt.ylabel('loss')
plt.xlabel('epoch')
plt.show()

以上で分類までできます。

BERT

特徴

BERT

＝Bidirectional Encoder Representations from Transformers

＝「Transformerによる双方向のエンコード表現」

名の通り、TransformerのEncoderを使ったモデル
様々な自然言語処理タスクをこなすことができる
- 出力層を付け加えるだけで簡単にファインチューニングが可能。
- 翻訳、文書分類、質問応答などが高精度で予測可能
事前学習として２つのタスクを学習する
- 文章を双方向（文頭と文末）から学習する、文章ごとの関係も学習することが可能
- 「文脈を読むことが可能になった」ともよばれるくらい高性能

原理・構成

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations fro...

自然言語処理では、単語を高次元のベクトルに変換する分散表現という技術が使用されます。この分散表現は、単語の意味や文脈を捉えた表現として機能します。また、自然言語処理の入力となるデータは、文章やテキストデータの並びをシーケンスと呼びます。

BERT（Bidirectional Encoder Representations from Transformers）は、入力されたシーケンスから別のシーケンスを予測することができるモデルです。BERTは、事前学習とファインチューニングの2つの段階で学習が行われます。

まず、事前学習では、ラベルなしのデータを使用してモデルを学習します。この段階では、複数のタスクを使ってモデルを訓練します。例えば、欠損単語の予測や文の順序の予測などのタスクを用いて、モデルが言語的な特徴や文脈を理解する能力を獲得します。

次に、ファインチューニングでは、事前学習の重みを初期値として、ラベル付きのデータを使用してモデルを微調整します。ファインチューニングの目的は、特定のタスクにおいてモデルを最適化し、予測の精度を向上させることです。例えば、文書分類や情感分析などの具体的な自然言語処理タスクにおいて、モデルを最適化します。

要約すると、BERTは事前学習とファインチューニングの2つの段階で学習が行われ、事前学習ではラベルなしデータを使用し、ファインチューニングではラベルありデータを使用してモデルを最適化します。これにより、BERTは幅広い自然言語処理タスクにおいて優れた性能を発揮することができます。

E[Embedding]:入力の埋め込み表現,
C[CLS]：トークンの隠れベクトル
Ti:：文章のi番目のトークンの隠れベクトル
[CLS]はすべての入力文頭に追加される特別な記号
[SEP]は特別なセパレータトークン（例：質問と回答の区切り）、文の間にあるもの

詳しくは以下の記事にまとめてあります。

（備忘録-python）自然言語処理超入門：(やっと)BERTの仕組みを学び・使う(英文)（準初心者向け）

自然言語処理の必須知識となったBERTについて、初心者でもわかりやすく解説します。BERTの特徴や仕組み、応用例などを紹介します。BERTを使って自然言語処理のスキルを向上させましょう。

データのロードと前処理

from bs4 import BeautifulSoup    # importする

from tensorflow.keras.datasets import imdb

# imdb = imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data()

#単語が整数にマッピングされた辞書を取得
word_index = imdb.get_word_index()

# 最初の要素を予約（単語を登録）
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # 不明な単語
word_index["<UNUSED>"] = 3

# 整数を単語にマッピングする辞書を作成
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

import pandas as pd
train_df=pd.DataFrame(train_data)[0].map(decode_review).reset_index(drop=True)#データを軽くするため
test_df=pd.DataFrame(test_data)[0].map(decode_review).reset_index(drop=True)#データを軽くするため



#前処理
import re
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
#nltk.download('wordnet')
#nltk.download('omw-1.4')

from nltk.stem.porter import PorterStemmer 
stemmer = PorterStemmer()

import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords')

def clean_text(x):
    #ノイズ除去
    soup = BeautifulSoup(x, 'html.parser')
    text= soup.get_text()
    
    #アルファベット以外をスペースに置き換え
    text_ = re.sub(r'[^a-zA-Z]', ' ', text)
    
    #単語長が短いものものは削除(中身による)+その後の処理のために分割
    text_ = [word for word in text_.split()]# if len(word) > 2]
    
    #形態素=>動詞
    text_ = [lemmatizer.lemmatize(word.lower(), pos="v") for word in text_]
    
    #ステミング
#     text_ = [stemmer.stem(word) for word in text_]
    
    #stopword除去
#     A = [word for word in text_ if word not in stopwords.words('english')]
    
    #単語同士をスペースでつなぎ, 文章に戻す
    #その後の処理で戻す必要ない場合はコメントアウト
#     clean_text = ' '.join(A)
    clean_text = ' '.join(text_)
    return clean_text



#データ数減らして処理を軽くしたい...
clean_text_df=train_df[::5].map(clean_text)
clean_text_df

clean_text_test_df=test_df[::10].map(clean_text)
clean_text_test_df


train_texts=clean_text_df.reset_index(drop=True)
train_label=train_labels[::5]

test_texts=clean_text_test_df.reset_index(drop=True)
test_label=test_labels[::10]

np.shape(train_texts),np.shape(train_label),np.shape(test_texts),np.shape(test_label)

Tokenizer:BERT

BERTのモデルに入れられるようにデータを成形する。

import tqdm as notebook_tqdm

#from transformers import AutoTokenizer,glue_convert_examples_to_features,BertTokenizer,DistilBertTokenizer
#from tensorflow.keras.preprocessing.text import Tokenizer
#from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


# テキストのリストをtransformers用の入力データに変換(BERTの、他のBERTシリーズでは使えない可能性あり)

def to_features(texts, max_length):
    shape = (len(texts), max_length)
    # input_idsやattention_mask, token_type_idsの説明はglossaryに記載(cf. https://huggingface.co/transformers/glossary.html)
    input_ids = np.zeros(shape, dtype="int32")
    attention_mask = np.zeros(shape, dtype="int32")
    token_type_ids = np.zeros(shape, dtype="int32")
    for i, text in enumerate(texts):
        encoded_dict = tokenizer.encode_plus(text, max_length=max_length, pad_to_max_length=True,truncation=True)
        input_ids[i] = encoded_dict["input_ids"]
        attention_mask[i] = encoded_dict["attention_mask"]
        token_type_ids[i] = encoded_dict["token_type_ids"]
    return [tf.cast(input_ids, tf.int32), tf.cast(attention_mask, tf.int32), tf.cast(token_type_ids, tf.int32)]


max_length=500

x_train = to_features(train_texts, max_length)
# y_train = tf.keras.utils.to_categorical(train_labels, num_classes=4)
y_train = tf.cast(train_label, tf.int32)

x_valid = to_features(test_texts, max_length)
# y_valid = tf.keras.utils.to_categorical(valid_labels, num_classes=4)
y_valid = tf.cast(test_label, tf.int32)

BERTにおける分類器の学習

from transformers import TFBertModel
bert = TFBertModel.from_pretrained('bert-base-uncased')

# 層をfreeze(学習させないように)する
# bert.trainable= not True

from tensorflow import keras
from tensorflow.keras import optimizers, losses, metrics
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

def make_model(bert, num_classes, max_length, bert_frozen=True):
    # bertモデルはリストになっているので、取り出す

    # 層をfreeze(学習させないように)する、消せばfine-tune
    #bert.layers[0].trainable = not bert_frozen

    # input
    input_ids = Input(shape=(max_length, ), dtype='int32', name='input_ids')
    attention_mask = Input(shape=(max_length, ), dtype='int32', name='attention_mask')
    token_type_ids = Input(shape=(max_length, ), dtype='int32', name='token_type_ids')
    inputs = [input_ids, attention_mask,token_type_ids]

    # bert
    x = bert.layers[0](inputs)
    # x: sequence_output, pooled_output
    # 2種類の出力がある。

    # TFBertForSequenceClassificationにならってpooled_outputのみ使用
    out = x[1]

    # fc layer(add layers for transfer learning)
    #out = Dropout(0.25)(out)
    out = Dense(128, activation='relu')(out)
    out = Dropout(0.4)(out)
    out = Dense(num_classes, activation='softmax')(out)
    return Model(inputs=inputs, outputs=out)



seed_everything(42)
adam=optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

epochs = 10
max_length = 500
batch_size = 64
num_classes = 2

model = make_model(bert, num_classes, max_length)

model.compile(optimizer=adam,
              loss='sparse_categorical_crossentropy',
              metrics=['acc'])

# Train the model.
seed_everything(42)

callbacks = [
    EarlyStopping(patience=3,restore_best_weights=True)
]

result=model.fit(x=x_train,
              y=y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(x_valid, y_valid),#遅いから、いったん学習のみで
              callbacks=callbacks,
              use_multiprocessing=True,
              workers=-1,
              )

このような感じで分類までを行うことができます。