記錄智慧體¶

為什麼要記錄您的智慧體？¶

記錄智慧體行為在強化學習開發中具有幾個重要目的

🎥 視覺理解：精確地看到您的智慧體正在做什麼——有時一個10秒的影片就能揭示盯著獎勵圖數小時都發現不了的問題。

📊 效能跟蹤：收集關於回合獎勵、時長和時間安排的系統資料，以瞭解訓練進度。

🐛 除錯：識別特定的故障模式、異常行為，或者您的智慧體遇到困難的環境。

📈 評估：客觀地比較不同的訓練執行、演算法或超引數。

🎓 交流：與合作者分享結果，納入論文，或建立教育內容。

何時記錄¶

評估期間（記錄每個回合）

測試訓練好的智慧體的最終效能
建立演示影片
對特定行為進行詳細分析

訓練期間（定期記錄）

隨時間監控學習進度
及早發現訓練問題
建立學習的延時影片

Gymnasium 提供了兩個重要的封裝器用於記錄：RecordEpisodeStatistics 用於數值資料記錄，以及 RecordVideo 用於影片記錄。前者跟蹤回合指標，如總獎勵、回合長度和耗時。後者使用環境渲染生成智慧體行為的 MP4 影片。

我們將展示如何在兩種常見場景中使用這些封裝器：記錄每個回合的資料（通常在評估期間）和定期記錄資料（在訓練期間）。

記錄每個回合（評估）¶

在評估訓練好的智慧體時，通常需要記錄多個回合，以瞭解平均效能和一致性。下面是如何使用 RecordEpisodeStatistics 和 RecordVideo 進行設定。

import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
import numpy as np

# Configuration
num_eval_episodes = 4
env_name = "CartPole-v1"  # Replace with your environment

# Create environment with recording capabilities
env = gym.make(env_name, render_mode="rgb_array")  # rgb_array needed for video recording

# Add video recording for every episode
env = RecordVideo(
    env,
    video_folder="cartpole-agent",    # Folder to save videos
    name_prefix="eval",               # Prefix for video filenames
    episode_trigger=lambda x: True    # Record every episode
)

# Add episode statistics tracking
env = RecordEpisodeStatistics(env, buffer_length=num_eval_episodes)

print(f"Starting evaluation for {num_eval_episodes} episodes...")
print(f"Videos will be saved to: cartpole-agent/")

for episode_num in range(num_eval_episodes):
    obs, info = env.reset()
    episode_reward = 0
    step_count = 0

    episode_over = False
    while not episode_over:
        # Replace this with your trained agent's policy
        action = env.action_space.sample()  # Random policy for demonstration

        obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        step_count += 1

        episode_over = terminated or truncated

    print(f"Episode {episode_num + 1}: {step_count} steps, reward = {episode_reward}")

env.close()

# Print summary statistics
print(f'\nEvaluation Summary:')
print(f'Episode durations: {list(env.time_queue)}')
print(f'Episode rewards: {list(env.return_queue)}')
print(f'Episode lengths: {list(env.length_queue)}')

# Calculate some useful metrics
avg_reward = np.sum(env.return_queue)
avg_length = np.sum(env.length_queue)
std_reward = np.std(env.return_queue)

print(f'\nAverage reward: {avg_reward:.2f} ± {std_reward:.2f}')
print(f'Average episode length: {avg_length:.1f} steps')
print(f'Success rate: {sum(1 for r in env.return_queue if r > 0) / len(env.return_queue):.1%}')

理解輸出¶

執行此程式碼後，您將找到：

影片檔案：cartpole-agent/eval-episode-0.mp4、eval-episode-1.mp4 等。

每個檔案顯示一個從開始到結束的完整回合
有助於精確瞭解您的智慧體如何表現
可以共享、嵌入簡報或逐幀分析

控制檯輸出：逐回合效能加上彙總統計資料

Episode 1: 23 steps, reward = 23.0
Episode 2: 15 steps, reward = 15.0
Episode 3: 200 steps, reward = 200.0
Episode 4: 67 steps, reward = 67.0

Average reward: 76.25 ± 78.29
Average episode length: 76.2 steps
Success rate: 100.0%

統計佇列：每個回合的時間、獎勵和長度資料

env.time_queue：每個回合耗時（實際時間）
env.return_queue：每個回合的總獎勵
env.length_queue：每個回合的步數

在上面的指令碼中，RecordVideo 封裝器將影片儲存為“eval-episode-0.mp4”等檔名到指定資料夾。 episode_trigger=lambda x: True 確保每個回合都被記錄。

RecordEpisodeStatistics 封裝器跟蹤內部佇列中的效能指標，我們可以在評估後訪問這些佇列來計算平均值和其他統計資料。

為了評估時的計算效率，可以使用向量環境來實現，以並行而非順序地評估 N 個回合。

訓練期間記錄（定期）¶

在訓練期間，您將執行數百或數千個回合，因此記錄每個回合是不切實際的。相反，定期記錄以跟蹤學習進度。

import logging
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo

# Training configuration
training_period = 250           # Record video every 250 episodes
num_training_episodes = 10_000  # Total training episodes
env_name = "CartPole-v1"

# Set up logging for episode statistics
logging.basicConfig(level=logging.INFO, format='%(message)s')

# Create environment with periodic video recording
env = gym.make(env_name, render_mode="rgb_array")

# Record videos periodically (every 250 episodes)
env = RecordVideo(
    env,
    video_folder="cartpole-training",
    name_prefix="training",
    episode_trigger=lambda x: x % training_period == 0  # Only record every 250th episode
)

# Track statistics for every episode (lightweight)
env = RecordEpisodeStatistics(env)

print(f"Starting training for {num_training_episodes} episodes")
print(f"Videos will be recorded every {training_period} episodes")
print(f"Videos saved to: cartpole-training/")

for episode_num in range(num_training_episodes):
    obs, info = env.reset()
    episode_over = False

    while not episode_over:
        # Replace with your actual training agent
        action = env.action_space.sample()  # Random policy for demonstration
        obs, reward, terminated, truncated, info = env.step(action)
        episode_over = terminated or truncated

    # Log episode statistics (available in info after episode ends)
    if "episode" in info:
        episode_data = info["episode"]
        logging.info(f"Episode {episode_num}: "
                    f"reward={episode_data['r']:.1f}, "
                    f"length={episode_data['l']}, "
                    f"time={episode_data['t']:.2f}s")

        # Additional analysis for milestone episodes
        if episode_num % 1000 == 0:
            # Look at recent performance (last 100 episodes)
            recent_rewards = list(env.return_queue)[-100:]
            if recent_rewards:
                avg_recent = sum(recent_rewards) / len(recent_rewards)
                print(f"  -> Average reward over last 100 episodes: {avg_recent:.1f}")

env.close()

訓練記錄的好處¶

進度影片：觀看您的智慧體隨時間改進

training-episode-0.mp4：隨機初始行為
training-episode-250.mp4：一些模式開始出現
training-episode-500.mp4：明顯改進
training-episode-1000.mp4：表現出色

學習曲線：繪製隨時間變化的每回合統計資料

import matplotlib.pyplot as plt

# Plot learning progress
episodes = range(len(env.return_queue))
rewards = list(env.return_queue)

plt.figure(figsize=(10, 6))
plt.plot(episodes, rewards, alpha=0.3, label='Episode Rewards')

# Add moving average for clearer trend
window = 100
if len(rewards) > window:
    moving_avg = [sum(rewards[i:i+window])/window
                  for i in range(len(rewards)-window+1)]
    plt.plot(range(window-1, len(rewards)), moving_avg,
             label=f'{window}-Episode Moving Average', linewidth=2)

plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Learning Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

與實驗跟蹤整合¶

對於更復雜的專案，與實驗跟蹤工具整合。

# Example with Weights & Biases (wandb)
import wandb

# Initialize experiment tracking
wandb.init(project="cartpole-training", name="q-learning-run-1")

# Log episode statistics
for episode_num in range(num_training_episodes):
    # ... training code ...

    if "episode" in info:
        episode_data = info["episode"]
        wandb.log({
            "episode": episode_num,
            "reward": episode_data['r'],
            "length": episode_data['l'],
            "episode_time": episode_data['t']
        })

        # Upload videos periodically
        if episode_num % training_period == 0:
            video_path = f"cartpole-training/training-episode-{episode_num}.mp4"
            if os.path.exists(video_path):
                wandb.log({"training_video": wandb.Video(video_path)})

最佳實踐總結¶

對於評估:

記錄每個回合以獲得完整的效能概覽
使用多個種子以確保統計顯著性
同時儲存影片和數值資料
計算指標的置信區間

對於訓練:

定期記錄（每100-1000回合）
訓練期間側重於回合統計資料而非影片
對有趣的回合使用自適應記錄觸發器
監控長時間訓練的記憶體使用情況

對於分析:

建立移動平均線以平滑嘈雜的學習曲線
在成功和失敗的回合中尋找模式
比較智慧體在訓練不同階段的行為
儲存原始資料以備後續分析和比較