進階微調 Mistral-7B 模型的方法：直接偏好優化

瀏覽次數: 299

提升你的監督式精調模型的表現

預訓練的大型語言模型（LLM）只能進行下一個詞預測，使得它們無法回答問題。這就是為什麼這些基礎模型之後需要通過指令和答案的配對進行精調，以充當有用的助手。然而，這個過程仍可能存在缺陷：精調後的 LLM 可能會有偏見、有毒害、有害等。這就是人類反饋中的強化學習（RLHF）發揮作用的地方。

RLHF 向 LLM 提供不同的答案，這些答案根據期望的行為（有用性、有毒害等）進行排名。模型學會從這些候選答案中輸出最佳答案，從而模仿我們想要灌輸的行為。這個過程常被視為一種審查模型的方式，最近因為提高性能而變得流行，如 neural-chat-7b-v3–1 所示。

在這篇文章中，我們將通過使用類似 RLHF 的技術：直接偏好優化（DPO）來精調OpenHermes-2.5，創建 NeuralHermes-2.5。為此，我們將介紹一個偏好數據集，描述 DPO 算法的工作方式，並將其應用於我們的模型。我們會看到它顯著提高了基礎模型在 Open LLM 排行榜上的表現。

本文大綱

Choose the Preference datasets

偏好數據集沒有標準化，但它們通常由一系列由人類排名的答案組成。這個排名是必須的，因為 RLHF 過程精調 LLM 以輸出首選答案。例如以下這份偏好數據集 Anthropic/hh-rlhf ：

數據集的結構很直接：每行有一個選擇的（首選）答案和一個被拒絕的答案。RLHF 的目標是引導模型輸出首選答案。

偏好數據集製作成本高昂且困難，因為它們需要從人類收集手動反饋。這個反饋也是主觀的，很容易偏向於自信（但錯誤）的答案或自相矛盾（不同的標註者有不同的價值觀）。隨著時間的推移，提出了幾種解決這些問題的方法，例如用 AI 反饋替換人類反饋（RLAIF）。

這些數據集也往往比精調數據集小得多。舉例來說，當它發布時在 Open LLM 排行榜上表現最好的 neural-chat-7b-v3–1（最好的7B LLM）使用了 518k 樣本進行精調（Open-Orca/SlimOrca），但只用了12.9k樣本進行RLHF（Intel/orca_dpo_pairs）。在這種情況下，作者用 GPT-4/3.5 生成首選答案，用 Llama-2-13b-chat 生成被拒絕的回答。這是一種巧妙的方式，只依賴於不同表現水平的模型來繞過人類反饋。

直接偏好優化（Direct Perference Optimization）

雖然RLHF的概念在機器人技術中已經使用了很長時間，但它在OpenAI的論文《 Fine-Tuning Language Models from Human Preferences 》中被推廣至LLM。在這篇論文中，作者們提出了一個框架，其中一個獎勵模型被訓練來近似人類反饋。然後使用這個獎勵模型來使用近端策略優化（PPO，Proximal Policy Optimization）算法優化精調模型的策略。

PPO 的核心概念圍繞著進行更小、增量的策略更新，因為較大的更新可能導致不穩定或次優解。根據經驗，這種技術不幸的是仍然不穩定（損失發散），難以復現（眾多超參數，對隨機種子敏感），且計算成本高。

這就是直接偏好優化（DPO）發揮作用的地方。DPO通過將任務視為一個分類問題來簡化控制。具體來說，它使用兩個模型：訓練好的模型（或政策模型）和一個稱為參考模型的副本。在訓練過程中，目標是確保訓練模型對首選答案的輸出概率高於參考模型。相反，我們也希望它對被拒絕的答案輸出更低的概率。這意味著我們在獎勵LLM對好的答案，並因壞的答案而懲罰它。

通過將LLM本身作為獎勵模型並採用二元交叉熵目標，DPO有效地使模型的輸出與人類偏好一致，無需進行大量採樣、獎勵模型擬合或複雜的超參數調整。它導致了一個更穩定、更有效率、計算需求更低的過程。

數據格式化

在這個例子中，我們將微調 OpenHermes-2.5-Mistral-7B，這是一個只進行了監督式精調的Mistral-7b 模型。為此，我們將使用 Intel/orca_dpo_pairs 數據集來對齊我們的模型並提高其表現。我們將這個新模型稱為 NeuralHermes-2.5-Mistral-7B。

首先，安裝所需的套件：

pip install -q datasets trl peft bitsandbytes sentencepiece wandb

接著，在Google Colab的 secrets 頁面中加入你的 HuggingFace Access Token。

import os
import gc
import torch

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer
import bitsandbytes as bnb
from google.colab import userdata
import wandb

# Defined in the secrets tab in Google Colab
hf_token = userdata.get('huggingface')
wb_token = userdata.get('wandb')
wandb.login(key=wb_token)

model_name = "teknium/OpenHermes-2.5-Mistral-7B"
new_model = "NeuralHermes-2.5-Mistral-7B"

OpenHermes-2.5-Mistral-7B 使用了一個特定的聊天模板，稱為 ChatML。這裡是使用這個模板的對話範例：

<|im_start|>system
You are a helpful chatbot assistant.<|im_end|>
<|im_start|>user
Hi<|im_end|>
<|im_start|>assistant
Hi, how can I help you?<|im_end|>

ChatML 定義了不同的角色（系統、用戶、助手）並附加了特殊 token 來分隔它們。此外，DPOTrainer 還需要一個具有三列的特定格式：prompt, chosen, rejected。

我們的數據集包含四列：system、question、chatgpt 及 llama2–13b-chat。我們將簡單地將 system 和 question 列連接到提示列。我們還將 chatgpt 列映射到 “chosen” ，以及將 llama2–13b-chat 列映射到 “rejected”。

為了以可靠的方式格式化數據集，我們將使用 tokenizer 的 apply_chat_template() 函數，它已經使用了ChatML。

def chatml_format(example):
    # Format system
    if len(example['system']) > 0:
        message = {"role": "system", "content": example['system']}
        system = tokenizer.apply_chat_template([message], tokenize=False)
    else:
        system = ""

    # Format instruction
    message = {"role": "user", "content": example['question']}
    prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)

    # Format chosen answer
    chosen = example['chosen'] + "<|im_end|>\n"

    # Format rejected answer
    rejected = example['rejected'] + "<|im_end|>\n"

    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

# Load dataset
dataset = load_dataset("Intel/orca_dpo_pairs")['train']

# Save columns
original_columns = dataset.column_names

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Format dataset
dataset = dataset.map(
    chatml_format,
    remove_columns=original_columns
)

讓我們印出一個格式化後的數據集樣本：

{
    'prompt': '<|im_start|>system\nYou are an AI assistant. You will be given a task. You must generate a detailed and long answer.<|im_end|>\n<|im_start|>user\nGenerate an approximately fifteen-word sentence that describes all this data: Midsummer House eatType restaurant; Midsummer House food Chinese; Midsummer House priceRange moderate; Midsummer House customer rating 3 out of 5; Midsummer House near All Bar One<|im_end|>\n<|im_start|>assistant\n',
    'chosen': 'Midsummer House is a moderately priced Chinese restaurant with a 3/5 customer rating, located near All Bar One.<|im_end|>\n',
    'rejected': ' Sure! Here\'s a sentence that describes all the data you provided:\n\n"Midsummer House is a moderately priced Chinese restaurant with a customer rating of 3 out of 5, located near All Bar One, offering a variety of delicious dishes."<|im_end|>\n'
}

我們可以看到，提示結合了系統和用戶指令。由於 add_generation_prompt=True 參數，它還附加了助手答案的開始。如果你想跳過這一步，你可以直接使用預處理的數據集mlabonne/chatml_dpo_pairs。

使用DPO訓練模型

接下來，我們定義LoRA配置以訓練模型。正如Intel的blog文章所描述，我們將 rank 值設置為等於 lora_alpha，並對所有的線性模組添加 adapters。

# LoRA configuration
peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

我們現在準備好加載要用DPO精調的模型了。在這種情況下，需要兩個模型：要精調的模型以及參考模型。這主要是為了可讀性，因為如果沒有提供參考模型，DPOTrainer對象會自動創建一個。

# Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    load_in_4bit=True
)
model.config.use_cache = False

# Reference model
ref_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    load_in_4bit=True
)

最後一步是向TrainingArguments和DPOTrainer提供所有超參數：

beta參數對於DPO來說是獨特的，因為它控制了從初始策略的偏差（0.1是它的典型值）。
與Intel的blog文章中描述的值相比，我們降低了學習率（從5e-4降到5e-5）和步數（從1,000降到200）。這些值是我在幾次運行後手動優化的，以穩定訓練並達到最佳結果。

我們現在可以開始訓練模型了。請注意，它需要A100 GPU並且需要1小時完成訓練。

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    max_steps=200,
    save_strategy="no",
    logging_steps=1,
    output_dir=new_model,
    optim="paged_adamw_32bit",
    warmup_steps=100,
    bf16=True,
    report_to="wandb",
)

# Create DPO trainer
dpo_trainer = DPOTrainer(
    model,
    ref_model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    beta=0.1,
    max_prompt_length=1024,
    max_length=1536,
)

# Fine-tune model with DPO
dpo_trainer.train()

微調完成後，我們來分析一些指標：

有趣的是，儘管有100個熱身步驟，訓練損失很快就降到零（在50步之前）。同時，其他指標繼續演變。

train/rewards/chosen 和 train/rewards/rejected 圖表對應於由訓練模型和參考模型輸出的對數概率之間的平均差異。隨著時間的推移，它們分歧是有道理的，因為我們的訓練模型學會了首選答案。train/rewards/margins 圖表也顯示了這兩個圖表之間的差異。最後，train/reward/accuracies 圖表顯示了選擇首選答案的頻率。訓練模型很快達到了完美的準確率，這是一個好兆頭，但也可能意味著首選和拒絕答案之間的差異太明顯了。

現在我們可以將適配器與原始模型合併。接下來，我們保存合併後的模型和分詞器，然後將它們推送到 HuggingFace Hub。

# Save artifacts
dpo_trainer.model.save_pretrained("final_checkpoint")
tokenizer.save_pretrained("final_checkpoint")

# Flush memory
del dpo_trainer, model, ref_model
gc.collect()
torch.cuda.empty_cache()

# Reload model in FP16 (instead of NF4)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Merge base model with the adapter
model = PeftModel.from_pretrained(base_model, "final_checkpoint")
model = model.merge_and_unload()

# Save model and tokenizer
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

# Push them to the HF Hub
model.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)

讓我們看看我們的模型在真實測試中的表現如何。我們的測試問題：“What is a Large Language Model?”

# Format prompt
message = [
    {"role": "system", "content": "You are a helpful assistant chatbot."},
    {"role": "user", "content": "What is a Large Language Model?"}
]
tokenizer = AutoTokenizer.from_pretrained(new_model)
prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

# Create pipeline
pipeline = transformers.pipeline(
    "text-generation",
    model=new_model,
    tokenizer=tokenizer
)

# Generate text
sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    num_return_sequences=1,
    max_length=200,
)
print(sequences[0]['generated_text'])

模型回答：

A large language model is a type of artificial intelligence (AI) system that has been trained on vast amounts of text data. These models are designed to understand and generate human language, allowing them to perform various natural language processing tasks, such as text generation, language translation, and question answering. Large language models typically use deep learning techniques, like recurrent neural networks (RNNs) or transformers, to learn patterns and relationships in the data, enabling them to generate coherent and contextually relevant responses. The size of these models, in terms of the number of parameters and the volume of data they are trained on, plays a significant role in their ability to comprehend and produce complex language structures.

接下來我們來評估合併後的模型。由於這是一個通用模型，我們可以利用 lm-evaluation-harness 來評估它。由於這個過程相當耗資源，我們也可以直接提交它進行 Open LLM排行榜的評估。

與原始模型相比，NeuralHermes-2–5-Mistral-7B 模型在平均得分上提高了 6.7 分（特別是在GSM8K上）。這是一個意外的大幅改進，展示了直接偏好優化的力量。

結論

在這篇文章中，我們使用DPO精調了一個已經監督式精調的模型，創建了我們自己的NeuralHermes-2.5 模型。通過利用高質量的偏好數據集，我們創建了一個樣本效率高的精調管道，在Open LLM排行榜上產生了顯著的改進。

需要注意的是，我們的精調管道仍然可以以不同的方式進行改進。例如，偏好數據集仍然相當原始，可以通過更多過濾和使用不同模型進行改進。此外，仍然可以調整許多超參數以達到更好的結果。特別是，學習率仍然可以降低，以在更多步驟上訓練模型並注入更多偏好數據。

參考資料來源

Fine-tune a Mistral-7b model with Direct Preference Optimization (Maxime Labonne)

用DPO微調Llama 2 (Kashif Rasul, Younes Belkada, Leandro von Werra)
在Intel Gaudi2上進行監督式微調和直接偏好優化 (Kaokao Lv, Wenxin Zhang, Haihao Shen)
llama2-fine-tune (mzbac)

進階微調 Mistral-7B 模型的方法：直接偏好優化

Choose the Preference datasets

直接偏好優化（Direct Perference Optimization）

數據格式化

使用DPO訓練模型

結論

參考資料來源

請按讚：

你可能感興趣

Choose the Preference datasets

直接偏好優化（Direct Perference Optimization）

數據格式化

使用DPO訓練模型

結論

參考資料來源

分享給你所有愛學習的小夥伴：

請按讚：

你可能感興趣