訓練智慧體¶

當我們談論訓練強化學習 (RL) 智慧體時，我們是指透過經驗教會它做出良好決策。與監督學習中我們展示正確答案的例子不同，RL 智慧體透過嘗試不同的動作並觀察結果來學習。這就像學習騎腳踏車——你嘗試不同的動作，摔倒幾次，然後逐漸學會什麼有效。

目標是制定一個**策略**——一個告訴智慧體在每種情況下采取何種動作以最大化長期獎勵的策略。

直觀理解 Q 學習¶

在本教程中，我們將使用 Q 學習來解決二十一點環境。但首先，讓我們概念性地理解 Q 學習是如何工作的。

Q 學習構建一個巨大的“作弊表”，稱為 Q 表，它告訴智慧體在每種情況下每個動作有多好。

行 = 智慧體可能遇到的不同情況（狀態）
列 = 智慧體可以採取的不同動作
值 = 在那種情況下該動作有多好（預期的未來獎勵）

對於二十一點

狀態：你的手牌點數，莊家翻開的牌，你是否擁有可用 A
動作：要牌 (再拿一張牌) 或停牌 (保留當前手牌)
Q 值：在每個狀態下每個動作的預期獎勵

學習過程¶

嘗試一個動作，看看會發生什麼（獎勵 + 新狀態）
更新你的作弊表：“那個動作比我想象的要好/差”
逐漸改進透過嘗試動作和更新估計
平衡探索與利用：嘗試新事物與利用已知有效的方法

為什麼它有效：隨著時間的推移，好的動作會獲得更高的 Q 值，壞的動作會獲得更低的 Q 值。智慧體學會選擇具有最高預期獎勵的動作。

本頁面簡要概述瞭如何為 Gymnasium 環境訓練智慧體。我們將使用表格 Q 學習來解決 Blackjack-v1。有關其他環境和演算法的完整教程，請參閱訓練教程。請在此頁面之前閱讀基本用法。

關於環境：二十一點¶

二十一點是最受歡迎的賭場紙牌遊戲之一，非常適合學習強化學習，因為它具有

清晰的規則：在不超過 21 點的情況下，點數比莊家更接近 21
簡單的觀察：你的手牌點數，莊家翻開的牌，可用 A
離散動作：要牌 (拿牌) 或停牌 (保留當前手牌)
即時反饋：每手牌後輸贏或平局

此版本使用無限牌組（抽到的牌會放回），因此算牌無效——智慧體必須透過試錯學習最佳基本策略。

環境細節:

觀察：(玩家總點數, 莊家牌, 可用 A)
- player_sum: 當前手牌點數 (4-21)
- dealer_card: 莊家面朝上的牌 (1-10)
- usable_ace: 玩家是否擁有可用 A (真/假)
動作：0 = 停牌，1 = 要牌
獎勵：贏 +1，輸 -1，平局 0
回合結束：當玩家停牌或爆牌 (超過 21 點)

執行動作¶

從 env.reset() 接收到第一次觀察後，我們使用 env.step(action) 與環境互動。此函式接收一個動作並返回五個重要值

observation, reward, terminated, truncated, info = env.step(action)

observation: 智慧體在採取動作後看到的內容 (新的遊戲狀態)
reward: 該動作的即時反饋 (二十一點中為 +1, -1, 或 0)
terminated: 回合是否自然結束 (手牌已結束)
truncated: 回合是否被截斷 (時間限制 - 二十一點中未使用)
info: 額外的除錯資訊 (通常可以忽略)

關鍵的見解是，reward 告訴我們即時動作有多好，但智慧體需要學習長期後果。Q 學習透過估計總未來獎勵，而不僅僅是即時獎勵來處理這個問題。

構建 Q 學習智慧體¶

讓我們一步步構建我們的智慧體。我們需要以下功能：

選擇動作 (兼顧探索與利用)
從經驗中學習 (更新 Q 值)
管理探索 (隨時間減少隨機性)

探索與利用¶

這是強化學習中的一個根本挑戰

探索：嘗試新的動作以瞭解環境
利用：利用現有知識獲得最佳獎勵

我們使用epsilon-貪婪策略

以機率 epsilon：選擇一個隨機動作（探索）
以機率 1-epsilon：選擇已知最佳動作（利用）

從高 epsilon 開始（大量探索）並逐漸減少它（隨著學習而更多利用）在實踐中效果很好。

from collections import defaultdict
import gymnasium as gym
import numpy as np


class BlackjackAgent:
    def __init__(
        self,
        env: gym.Env,
        learning_rate: float,
        initial_epsilon: float,
        epsilon_decay: float,
        final_epsilon: float,
        discount_factor: float = 0.95,
    ):
        """Initialize a Q-Learning agent.

        Args:
            env: The training environment
            learning_rate: How quickly to update Q-values (0-1)
            initial_epsilon: Starting exploration rate (usually 1.0)
            epsilon_decay: How much to reduce epsilon each episode
            final_epsilon: Minimum exploration rate (usually 0.1)
            discount_factor: How much to value future rewards (0-1)
        """
        self.env = env

        # Q-table: maps (state, action) to expected reward
        # defaultdict automatically creates entries with zeros for new states
        self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))

        self.lr = learning_rate
        self.discount_factor = discount_factor  # How much we care about future rewards

        # Exploration parameters
        self.epsilon = initial_epsilon
        self.epsilon_decay = epsilon_decay
        self.final_epsilon = final_epsilon

        # Track learning progress
        self.training_error = []

    def get_action(self, obs: tuple[int, int, bool]) -> int:
        """Choose an action using epsilon-greedy strategy.

        Returns:
            action: 0 (stand) or 1 (hit)
        """
        # With probability epsilon: explore (random action)
        if np.random.random() < self.epsilon:
            return self.env.action_space.sample()

        # With probability (1-epsilon): exploit (best known action)
        else:
            return int(np.argmax(self.q_values[obs]))

    def update(
        self,
        obs: tuple[int, int, bool],
        action: int,
        reward: float,
        terminated: bool,
        next_obs: tuple[int, int, bool],
    ):
        """Update Q-value based on experience.

        This is the heart of Q-learning: learn from (state, action, reward, next_state)
        """
        # What's the best we could do from the next state?
        # (Zero if episode terminated - no future rewards possible)
        future_q_value = (not terminated) * np.max(self.q_values[next_obs])

        # What should the Q-value be? (Bellman equation)
        target = reward + self.discount_factor * future_q_value

        # How wrong was our current estimate?
        temporal_difference = target - self.q_values[obs][action]

        # Update our estimate in the direction of the error
        # Learning rate controls how big steps we take
        self.q_values[obs][action] = (
            self.q_values[obs][action] + self.lr * temporal_difference
        )

        # Track learning progress (useful for debugging)
        self.training_error.append(temporal_difference)

    def decay_epsilon(self):
        """Reduce exploration rate after each episode."""
        self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)

理解 Q 學習更新¶

核心學習發生在 update 方法中。讓我們分解一下數學

# Current estimate: Q(state, action)
current_q = self.q_values[obs][action]

# What we actually experienced: reward + discounted future value
target = reward + self.discount_factor * max(self.q_values[next_obs])

# How wrong were we?
error = target - current_q

# Update estimate: move toward the target
new_q = current_q + learning_rate * error

這就是著名的**貝爾曼方程**的實際應用——它表明狀態-動作對的值應該等於即時獎勵加上最佳下一個動作的折扣值。

訓練智慧體¶

現在讓我們訓練我們的智慧體。過程如下：

重置環境以開始新回合
玩一整手牌（一回合），選擇動作並從每一步中學習
更新探索率（減少 epsilon）
重複多個回合，直到智慧體學會好的策略

# Training hyperparameters
learning_rate = 0.01        # How fast to learn (higher = faster but less stable)
n_episodes = 100_000        # Number of hands to practice
start_epsilon = 1.0         # Start with 100% random actions
epsilon_decay = start_epsilon / (n_episodes / 2)  # Reduce exploration over time
final_epsilon = 0.1         # Always keep some exploration

# Create environment and agent
env = gym.make("Blackjack-v1", sab=False)
env = gym.wrappers.RecordEpisodeStatistics(env, buffer_length=n_episodes)

agent = BlackjackAgent(
    env=env,
    learning_rate=learning_rate,
    initial_epsilon=start_epsilon,
    epsilon_decay=epsilon_decay,
    final_epsilon=final_epsilon,
)

訓練迴圈¶

from tqdm import tqdm  # Progress bar

for episode in tqdm(range(n_episodes)):
    # Start a new hand
    obs, info = env.reset()
    done = False

    # Play one complete hand
    while not done:
        # Agent chooses action (initially random, gradually more intelligent)
        action = agent.get_action(obs)

        # Take action and observe result
        next_obs, reward, terminated, truncated, info = env.step(action)

        # Learn from this experience
        agent.update(obs, action, reward, terminated, next_obs)

        # Move to next state
        done = terminated or truncated
        obs = next_obs

    # Reduce exploration rate (agent becomes less random over time)
    agent.decay_epsilon()

訓練期間的預期¶

早期回合 (0-10,000):

智慧體主要隨機行動 (高 epsilon)
約 43% 的手牌獲勝（由於策略不佳，略遜於隨機）
由於 Q 值非常不準確，學習誤差較大

中期回合 (10,000-50,000):

智慧體開始找到好的策略
勝率提高到 45-48%
隨著估計的改善，學習誤差減小

後期回合 (50,000+):

智慧體收斂到接近最優的策略
勝率穩定在 49% 左右（該遊戲的理論最大值）
Q 值穩定後，學習誤差很小

分析訓練結果¶

讓我們視覺化訓練進度

from matplotlib import pyplot as plt

def get_moving_avgs(arr, window, convolution_mode):
    """Compute moving average to smooth noisy data."""
    return np.convolve(
        np.array(arr).flatten(),
        np.ones(window),
        mode=convolution_mode
    ) / window

# Smooth over a 500-episode window
rolling_length = 500
fig, axs = plt.subplots(ncols=3, figsize=(12, 5))

# Episode rewards (win/loss performance)
axs[0].set_title("Episode rewards")
reward_moving_average = get_moving_avgs(
    env.return_queue,
    rolling_length,
    "valid"
)
axs[0].plot(range(len(reward_moving_average)), reward_moving_average)
axs[0].set_ylabel("Average Reward")
axs[0].set_xlabel("Episode")

# Episode lengths (how many actions per hand)
axs[1].set_title("Episode lengths")
length_moving_average = get_moving_avgs(
    env.length_queue,
    rolling_length,
    "valid"
)
axs[1].plot(range(len(length_moving_average)), length_moving_average)
axs[1].set_ylabel("Average Episode Length")
axs[1].set_xlabel("Episode")

# Training error (how much we're still learning)
axs[2].set_title("Training Error")
training_error_moving_average = get_moving_avgs(
    agent.training_error,
    rolling_length,
    "same"
)
axs[2].plot(range(len(training_error_moving_average)), training_error_moving_average)
axs[2].set_ylabel("Temporal Difference Error")
axs[2].set_xlabel("Step")

plt.tight_layout()
plt.show()

解釋結果¶

獎勵圖：應顯示從約 -0.05（略為負值）到約 -0.01（接近最優）的逐漸改善。二十一點是一個困難的遊戲——即使完美髮揮，由於莊家優勢也會略微虧損。

回合長度：應穩定在每回合 2-3 個動作。回合太短表示智慧體過早停牌；回合太長表示要牌過於頻繁。

訓練誤差：應隨時間減少，表明智慧體的預測越來越準確。訓練早期出現的大幅峰值是正常的，因為智慧體會遇到新的情況。

常見訓練問題及解決方案¶

🚨 智慧體從未改進¶

症狀：獎勵保持不變，訓練誤差大原因：學習率過高/過低，獎勵設計不佳，更新邏輯中存在錯誤 解決方案

嘗試 0.001 到 0.1 之間的學習率
檢查獎勵是否具有意義（二十一點中為 -1, 0, +1）
驗證 Q 表是否確實在更新

🚨 訓練不穩定¶

症狀：獎勵波動劇烈，從不收斂原因：學習率過高，探索不足 解決方案

降低學習率（嘗試 0.01 而不是 0.1）
確保最小探索（final_epsilon ≥ 0.05）
訓練更多回合

🚨 智慧體陷入不良策略¶

症狀：改進過早停止，最終表現不理想原因：探索過少，學習率過低 解決方案

增加探索時間（更慢的 epsilon 衰減）
初期嘗試更高的學習率
使用不同的探索策略（樂觀初始化）

🚨 學習太慢¶

症狀：智慧體有改進，但非常緩慢原因：學習率過低，探索過多 解決方案

提高學習率（但要注意不穩定性）
更快的 epsilon 衰減（更少隨機探索）
更集中地訓練困難狀態

測試訓練好的智慧體¶

訓練完成後，測試智慧體的表現

# Test the trained agent
def test_agent(agent, env, num_episodes=1000):
    """Test agent performance without learning or exploration."""
    total_rewards = []

    # Temporarily disable exploration for testing
    old_epsilon = agent.epsilon
    agent.epsilon = 0.0  # Pure exploitation

    for _ in range(num_episodes):
        obs, info = env.reset()
        episode_reward = 0
        done = False

        while not done:
            action = agent.get_action(obs)
            obs, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            done = terminated or truncated

        total_rewards.append(episode_reward)

    # Restore original epsilon
    agent.epsilon = old_epsilon

    win_rate = np.mean(np.array(total_rewards) > 0)
    average_reward = np.mean(total_rewards)

    print(f"Test Results over {num_episodes} episodes:")
    print(f"Win Rate: {win_rate:.1%}")
    print(f"Average Reward: {average_reward:.3f}")
    print(f"Standard Deviation: {np.std(total_rewards):.3f}")

# Test your agent
test_agent(agent, env)

二十一點的良好表現

勝率：42-45%（莊家優勢使得 >50% 不可能）
平均獎勵：-0.02 到 +0.01
一致性：低標準差表示策略可靠

下一步¶

恭喜！你已成功訓練了你的第一個強化學習智慧體。接下來可以探索以下內容：

嘗試其他環境：CartPole (小車杆), MountainCar (登山車), LunarLander (月球著陸器)
實驗超引數：學習率、探索策略
實現其他演算法：SARSA, Expected SARSA, Monte Carlo 方法
新增函式逼近：用於更大狀態空間的神經網路
建立自定義環境：設計你自己的強化學習問題

更多資訊，請參閱

基本用法 - 理解 Gymnasium 基礎知識
自定義環境 - 構建你自己的強化學習問題
完整訓練教程 - 更多演算法和環境
記錄智慧體行為 - 儲存影片和效能資料

本教程的關鍵在於，強化學習智慧體透過試錯學習，逐漸積累關於在不同情境下哪些動作效果最佳的知識。Q 學習提供了一種系統地學習這些知識的方法，平衡了新可能性的探索和現有知識的利用。

繼續實驗，請記住，強化學習既是科學也是藝術——找到合適的超引數和環境設計通常需要耐心和直覺！