Spaces:
Running
Running
| # Trackio Integration for TRL Training | |
| **Trackio** is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure. | |
| β οΈ **IMPORTANT**: For Jobs training (remote cloud GPUs): | |
| - Training happens on ephemeral cloud runners (not your local machine) | |
| - Trackio syncs metrics to a Hugging Face Space for real-time monitoring | |
| - Without a Space, metrics are lost when the job completes | |
| - The Space dashboard persists your training metrics permanently | |
| ## Setting Up Trackio for Jobs | |
| **Step 1: Add trackio dependency** | |
| ```python | |
| # /// script | |
| # dependencies = [ | |
| # "trl>=0.12.0", | |
| # "trackio", # Required! | |
| # ] | |
| # /// | |
| ``` | |
| **Step 2: Create a Trackio Space (one-time setup)** | |
| **Option A: Let Trackio auto-create (Recommended)** | |
| Pass a `space_id` to `trackio.init()` and Trackio will automatically create the Space if it doesn't exist. | |
| **Option B: Create manually** | |
| - Create Space via Hub UI at https://huggingface.co/new-space | |
| - Select Gradio SDK | |
| - OR use command: `huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio` | |
| **Step 3: Initialize Trackio with space_id** | |
| ```python | |
| import trackio | |
| trackio.init( | |
| project="my-training", | |
| space_id="username/trackio", # CRITICAL for Jobs! Replace 'username' with your HF username | |
| config={ | |
| "model": "Qwen/Qwen2.5-0.5B", | |
| "dataset": "trl-lib/Capybara", | |
| "learning_rate": 2e-5, | |
| } | |
| ) | |
| ``` | |
| **Step 4: Configure TRL to use Trackio** | |
| ```python | |
| SFTConfig( | |
| report_to="trackio", | |
| # ... other config | |
| ) | |
| ``` | |
| **Step 5: Finish tracking** | |
| ```python | |
| trainer.train() | |
| trackio.finish() # Ensures final metrics are synced | |
| ``` | |
| ## What Trackio Tracks | |
| Trackio automatically logs: | |
| - β Training loss | |
| - β Learning rate | |
| - β GPU utilization | |
| - β Memory usage | |
| - β Training throughput | |
| - β Custom metrics | |
| ## How It Works with Jobs | |
| 1. **Training runs** β Metrics logged to local SQLite DB | |
| 2. **Every 5 minutes** β Trackio syncs DB to HF Dataset (Parquet) | |
| 3. **Space dashboard** β Reads from Dataset, displays metrics in real-time | |
| 4. **Job completes** β Final sync ensures all metrics persisted | |
| ## Default Configuration Pattern | |
| **Use sensible defaults for trackio configuration unless user requests otherwise.** | |
| ### Recommended Defaults | |
| ```python | |
| import trackio | |
| trackio.init( | |
| project="qwen-capybara-sft", | |
| name="baseline-run", # Descriptive name user will recognize | |
| space_id="username/trackio", # Default space: {username}/trackio | |
| config={ | |
| # Keep config minimal - hyperparameters and model/dataset info only | |
| "model": "Qwen/Qwen2.5-0.5B", | |
| "dataset": "trl-lib/Capybara", | |
| "learning_rate": 2e-5, | |
| "num_epochs": 3, | |
| } | |
| ) | |
| ``` | |
| **Key principles:** | |
| - **Space ID**: Use `{username}/trackio` with "trackio" as default space name | |
| - **Run naming**: Unless otherwise specified, name the run in a way the user will recognize | |
| - **Config**: Keep minimal - don't automatically capture job metadata unless requested | |
| - **Grouping**: Optional - only use if user requests organizing related experiments | |
| ## Grouping Runs (Optional) | |
| The `group` parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together: | |
| ```python | |
| # Example: Group runs by experiment type | |
| trackio.init(project="my-project", run_name="baseline-run-1", group="baseline") | |
| trackio.init(project="my-project", run_name="augmented-run-1", group="augmented") | |
| trackio.init(project="my-project", run_name="tuned-run-1", group="tuned") | |
| ``` | |
| Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter: | |
| ```python | |
| # Hyperparameter sweep - group by learning rate | |
| trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001") | |
| trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01") | |
| ``` | |
| ## Environment Variables for Jobs | |
| You can configure trackio using environment variables instead of passing parameters to `trackio.init()`. This is useful for managing configuration across multiple jobs. | |
| **`HF_TOKEN`** | |
| Required for creating Spaces and writing to datasets (passed via `secrets`): | |
| ```python | |
| hf_jobs("uv", { | |
| "script": "...", | |
| "secrets": { | |
| "HF_TOKEN": "$HF_TOKEN" # Enables Space creation and Hub push | |
| } | |
| }) | |
| ``` | |
| ### Example with Environment Variables | |
| ```python | |
| hf_jobs("uv", { | |
| "script": """ | |
| # Training script - trackio config from environment | |
| import trackio | |
| from datetime import datetime | |
| # Auto-generate run name | |
| timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M") | |
| run_name = f"sft_qwen25_{timestamp}" | |
| # Project and space_id can come from environment variables | |
| trackio.init(run_name=run_name, group="SFT") | |
| # ... training code ... | |
| trackio.finish() | |
| """, | |
| "flavor": "a10g-large", | |
| "timeout": "2h", | |
| "secrets": {"HF_TOKEN": "$HF_TOKEN"} | |
| }) | |
| ``` | |
| **When to use environment variables:** | |
| - Managing multiple jobs with same configuration | |
| - Keeping training scripts portable across projects | |
| - Separating configuration from code | |
| **When to use direct parameters:** | |
| - Single job with specific configuration | |
| - When clarity in code is preferred | |
| - When each job has different project/space | |
| ## Viewing the Dashboard | |
| After starting training: | |
| 1. Navigate to the Space: `https://huggingface.co/spaces/username/trackio` | |
| 2. The Gradio dashboard shows all tracked experiments | |
| 3. Filter by project, compare runs, view charts with smoothing | |
| ## Recommendation | |
| - **Trackio**: Best for real-time monitoring during long training runs | |
| - **Weights & Biases**: Best for team collaboration, requires account | |