記錄智慧體

為什麼要記錄您的智慧體?

記錄智慧體行為在強化學習開發中具有幾個重要目的

🎥 視覺理解:精確地看到您的智慧體正在做什麼——有時一個10秒的影片就能揭示盯著獎勵圖數小時都發現不了的問題。

📊 效能跟蹤:收集關於回合獎勵、時長和時間安排的系統資料,以瞭解訓練進度。

🐛 除錯:識別特定的故障模式、異常行為,或者您的智慧體遇到困難的環境。

📈 評估:客觀地比較不同的訓練執行、演算法或超引數。

🎓 交流:與合作者分享結果,納入論文,或建立教育內容。

何時記錄

評估期間(記錄每個回合)

  • 測試訓練好的智慧體的最終效能

  • 建立演示影片

  • 對特定行為進行詳細分析

訓練期間(定期記錄)

  • 隨時間監控學習進度

  • 及早發現訓練問題

  • 建立學習的延時影片

Gymnasium 提供了兩個重要的封裝器用於記錄:RecordEpisodeStatistics 用於數值資料記錄,以及 RecordVideo 用於影片記錄。前者跟蹤回合指標,如總獎勵、回合長度和耗時。後者使用環境渲染生成智慧體行為的 MP4 影片。

我們將展示如何在兩種常見場景中使用這些封裝器:記錄每個回合的資料(通常在評估期間)和定期記錄資料(在訓練期間)。

記錄每個回合(評估)

在評估訓練好的智慧體時,通常需要記錄多個回合,以瞭解平均效能和一致性。下面是如何使用 RecordEpisodeStatisticsRecordVideo 進行設定。

import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
import numpy as np

# Configuration
num_eval_episodes = 4
env_name = "CartPole-v1"  # Replace with your environment

# Create environment with recording capabilities
env = gym.make(env_name, render_mode="rgb_array")  # rgb_array needed for video recording

# Add video recording for every episode
env = RecordVideo(
    env,
    video_folder="cartpole-agent",    # Folder to save videos
    name_prefix="eval",               # Prefix for video filenames
    episode_trigger=lambda x: True    # Record every episode
)

# Add episode statistics tracking
env = RecordEpisodeStatistics(env, buffer_length=num_eval_episodes)

print(f"Starting evaluation for {num_eval_episodes} episodes...")
print(f"Videos will be saved to: cartpole-agent/")

for episode_num in range(num_eval_episodes):
    obs, info = env.reset()
    episode_reward = 0
    step_count = 0

    episode_over = False
    while not episode_over:
        # Replace this with your trained agent's policy
        action = env.action_space.sample()  # Random policy for demonstration

        obs, reward, terminated, truncated, info = env.step(action)
        episode_reward += reward
        step_count += 1

        episode_over = terminated or truncated

    print(f"Episode {episode_num + 1}: {step_count} steps, reward = {episode_reward}")

env.close()

# Print summary statistics
print(f'\nEvaluation Summary:')
print(f'Episode durations: {list(env.time_queue)}')
print(f'Episode rewards: {list(env.return_queue)}')
print(f'Episode lengths: {list(env.length_queue)}')

# Calculate some useful metrics
avg_reward = np.sum(env.return_queue)
avg_length = np.sum(env.length_queue)
std_reward = np.std(env.return_queue)

print(f'\nAverage reward: {avg_reward:.2f} ± {std_reward:.2f}')
print(f'Average episode length: {avg_length:.1f} steps')
print(f'Success rate: {sum(1 for r in env.return_queue if r > 0) / len(env.return_queue):.1%}')

理解輸出

執行此程式碼後,您將找到:

影片檔案cartpole-agent/eval-episode-0.mp4eval-episode-1.mp4 等。

  • 每個檔案顯示一個從開始到結束的完整回合

  • 有助於精確瞭解您的智慧體如何表現

  • 可以共享、嵌入簡報或逐幀分析

控制檯輸出:逐回合效能加上彙總統計資料

Episode 1: 23 steps, reward = 23.0
Episode 2: 15 steps, reward = 15.0
Episode 3: 200 steps, reward = 200.0
Episode 4: 67 steps, reward = 67.0

Average reward: 76.25 ± 78.29
Average episode length: 76.2 steps
Success rate: 100.0%

統計佇列:每個回合的時間、獎勵和長度資料

  • env.time_queue:每個回合耗時(實際時間)

  • env.return_queue:每個回合的總獎勵

  • env.length_queue:每個回合的步數

在上面的指令碼中,RecordVideo 封裝器將影片儲存為“eval-episode-0.mp4”等檔名到指定資料夾。 episode_trigger=lambda x: True 確保每個回合都被記錄。

RecordEpisodeStatistics 封裝器跟蹤內部佇列中的效能指標,我們可以在評估後訪問這些佇列來計算平均值和其他統計資料。

為了評估時的計算效率,可以使用向量環境來實現,以並行而非順序地評估 N 個回合。

訓練期間記錄(定期)

在訓練期間,您將執行數百或數千個回合,因此記錄每個回合是不切實際的。相反,定期記錄以跟蹤學習進度。

import logging
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo

# Training configuration
training_period = 250           # Record video every 250 episodes
num_training_episodes = 10_000  # Total training episodes
env_name = "CartPole-v1"

# Set up logging for episode statistics
logging.basicConfig(level=logging.INFO, format='%(message)s')

# Create environment with periodic video recording
env = gym.make(env_name, render_mode="rgb_array")

# Record videos periodically (every 250 episodes)
env = RecordVideo(
    env,
    video_folder="cartpole-training",
    name_prefix="training",
    episode_trigger=lambda x: x % training_period == 0  # Only record every 250th episode
)

# Track statistics for every episode (lightweight)
env = RecordEpisodeStatistics(env)

print(f"Starting training for {num_training_episodes} episodes")
print(f"Videos will be recorded every {training_period} episodes")
print(f"Videos saved to: cartpole-training/")

for episode_num in range(num_training_episodes):
    obs, info = env.reset()
    episode_over = False

    while not episode_over:
        # Replace with your actual training agent
        action = env.action_space.sample()  # Random policy for demonstration
        obs, reward, terminated, truncated, info = env.step(action)
        episode_over = terminated or truncated

    # Log episode statistics (available in info after episode ends)
    if "episode" in info:
        episode_data = info["episode"]
        logging.info(f"Episode {episode_num}: "
                    f"reward={episode_data['r']:.1f}, "
                    f"length={episode_data['l']}, "
                    f"time={episode_data['t']:.2f}s")

        # Additional analysis for milestone episodes
        if episode_num % 1000 == 0:
            # Look at recent performance (last 100 episodes)
            recent_rewards = list(env.return_queue)[-100:]
            if recent_rewards:
                avg_recent = sum(recent_rewards) / len(recent_rewards)
                print(f"  -> Average reward over last 100 episodes: {avg_recent:.1f}")

env.close()

訓練記錄的好處

進度影片:觀看您的智慧體隨時間改進

  • training-episode-0.mp4:隨機初始行為

  • training-episode-250.mp4:一些模式開始出現

  • training-episode-500.mp4:明顯改進

  • training-episode-1000.mp4:表現出色

學習曲線:繪製隨時間變化的每回合統計資料

import matplotlib.pyplot as plt

# Plot learning progress
episodes = range(len(env.return_queue))
rewards = list(env.return_queue)

plt.figure(figsize=(10, 6))
plt.plot(episodes, rewards, alpha=0.3, label='Episode Rewards')

# Add moving average for clearer trend
window = 100
if len(rewards) > window:
    moving_avg = [sum(rewards[i:i+window])/window
                  for i in range(len(rewards)-window+1)]
    plt.plot(range(window-1, len(rewards)), moving_avg,
             label=f'{window}-Episode Moving Average', linewidth=2)

plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Learning Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

與實驗跟蹤整合

對於更復雜的專案,與實驗跟蹤工具整合。

# Example with Weights & Biases (wandb)
import wandb

# Initialize experiment tracking
wandb.init(project="cartpole-training", name="q-learning-run-1")

# Log episode statistics
for episode_num in range(num_training_episodes):
    # ... training code ...

    if "episode" in info:
        episode_data = info["episode"]
        wandb.log({
            "episode": episode_num,
            "reward": episode_data['r'],
            "length": episode_data['l'],
            "episode_time": episode_data['t']
        })

        # Upload videos periodically
        if episode_num % training_period == 0:
            video_path = f"cartpole-training/training-episode-{episode_num}.mp4"
            if os.path.exists(video_path):
                wandb.log({"training_video": wandb.Video(video_path)})

最佳實踐總結

對於評估:

  • 記錄每個回合以獲得完整的效能概覽

  • 使用多個種子以確保統計顯著性

  • 同時儲存影片和數值資料

  • 計算指標的置信區間

對於訓練:

  • 定期記錄(每100-1000回合)

  • 訓練期間側重於回合統計資料而非影片

  • 對有趣的回合使用自適應記錄觸發器

  • 監控長時間訓練的記憶體使用情況

對於分析:

  • 建立移動平均線以平滑嘈雜的學習曲線

  • 在成功和失敗的回合中尋找模式

  • 比較智慧體在訓練不同階段的行為

  • 儲存原始資料以備後續分析和比較

更多資訊

記錄智慧體行為是強化學習實踐者的必備技能。它有助於您理解智慧體實際學習了什麼,除錯訓練問題,並有效傳達結果。從簡單的記錄設定開始,隨著專案複雜性的增加逐步新增更復雜的分析!