llm-trainer / references /trackio_guide.md
burtenshaw's picture
burtenshaw HF Staff
Upload folder using huggingface_hub
6ab17a7 verified
# Trackio Integration for TRL Training
**Trackio** is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure.
⚠️ **IMPORTANT**: For Jobs training (remote cloud GPUs):
- Training happens on ephemeral cloud runners (not your local machine)
- Trackio syncs metrics to a Hugging Face Space for real-time monitoring
- Without a Space, metrics are lost when the job completes
- The Space dashboard persists your training metrics permanently
## Setting Up Trackio for Jobs
**Step 1: Add trackio dependency**
```python
# /// script
# dependencies = [
# "trl>=0.12.0",
# "trackio", # Required!
# ]
# ///
```
**Step 2: Create a Trackio Space (one-time setup)**
**Option A: Let Trackio auto-create (Recommended)**
Pass a `space_id` to `trackio.init()` and Trackio will automatically create the Space if it doesn't exist.
**Option B: Create manually**
- Create Space via Hub UI at https://huggingface.co/new-space
- Select Gradio SDK
- OR use command: `huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio`
**Step 3: Initialize Trackio with space_id**
```python
import trackio
trackio.init(
project="my-training",
space_id="username/trackio", # CRITICAL for Jobs! Replace 'username' with your HF username
config={
"model": "Qwen/Qwen2.5-0.5B",
"dataset": "trl-lib/Capybara",
"learning_rate": 2e-5,
}
)
```
**Step 4: Configure TRL to use Trackio**
```python
SFTConfig(
report_to="trackio",
# ... other config
)
```
**Step 5: Finish tracking**
```python
trainer.train()
trackio.finish() # Ensures final metrics are synced
```
## What Trackio Tracks
Trackio automatically logs:
- βœ… Training loss
- βœ… Learning rate
- βœ… GPU utilization
- βœ… Memory usage
- βœ… Training throughput
- βœ… Custom metrics
## How It Works with Jobs
1. **Training runs** β†’ Metrics logged to local SQLite DB
2. **Every 5 minutes** β†’ Trackio syncs DB to HF Dataset (Parquet)
3. **Space dashboard** β†’ Reads from Dataset, displays metrics in real-time
4. **Job completes** β†’ Final sync ensures all metrics persisted
## Default Configuration Pattern
**Use sensible defaults for trackio configuration unless user requests otherwise.**
### Recommended Defaults
```python
import trackio
trackio.init(
project="qwen-capybara-sft",
name="baseline-run", # Descriptive name user will recognize
space_id="username/trackio", # Default space: {username}/trackio
config={
# Keep config minimal - hyperparameters and model/dataset info only
"model": "Qwen/Qwen2.5-0.5B",
"dataset": "trl-lib/Capybara",
"learning_rate": 2e-5,
"num_epochs": 3,
}
)
```
**Key principles:**
- **Space ID**: Use `{username}/trackio` with "trackio" as default space name
- **Run naming**: Unless otherwise specified, name the run in a way the user will recognize
- **Config**: Keep minimal - don't automatically capture job metadata unless requested
- **Grouping**: Optional - only use if user requests organizing related experiments
## Grouping Runs (Optional)
The `group` parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together:
```python
# Example: Group runs by experiment type
trackio.init(project="my-project", run_name="baseline-run-1", group="baseline")
trackio.init(project="my-project", run_name="augmented-run-1", group="augmented")
trackio.init(project="my-project", run_name="tuned-run-1", group="tuned")
```
Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter:
```python
# Hyperparameter sweep - group by learning rate
trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001")
trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01")
```
## Environment Variables for Jobs
You can configure trackio using environment variables instead of passing parameters to `trackio.init()`. This is useful for managing configuration across multiple jobs.
**`HF_TOKEN`**
Required for creating Spaces and writing to datasets (passed via `secrets`):
```python
hf_jobs("uv", {
"script": "...",
"secrets": {
"HF_TOKEN": "$HF_TOKEN" # Enables Space creation and Hub push
}
})
```
### Example with Environment Variables
```python
hf_jobs("uv", {
"script": """
# Training script - trackio config from environment
import trackio
from datetime import datetime
# Auto-generate run name
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
run_name = f"sft_qwen25_{timestamp}"
# Project and space_id can come from environment variables
trackio.init(run_name=run_name, group="SFT")
# ... training code ...
trackio.finish()
""",
"flavor": "a10g-large",
"timeout": "2h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
**When to use environment variables:**
- Managing multiple jobs with same configuration
- Keeping training scripts portable across projects
- Separating configuration from code
**When to use direct parameters:**
- Single job with specific configuration
- When clarity in code is preferred
- When each job has different project/space
## Viewing the Dashboard
After starting training:
1. Navigate to the Space: `https://huggingface.co/spaces/username/trackio`
2. The Gradio dashboard shows all tracked experiments
3. Filter by project, compare runs, view charts with smoothing
## Recommendation
- **Trackio**: Best for real-time monitoring during long training runs
- **Weights & Biases**: Best for team collaboration, requires account