Spaces:

hf-skills
/

llm-trainer

Running

App Files Files Community

llm-trainer / references /trackio_guide.md

burtenshaw HF Staff

Upload folder using huggingface_hub

6ab17a7 verified 11 days ago

preview code

raw

history blame contribute delete

5.8 kB

	# Trackio Integration for TRL Training

	Trackio is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure.

	⚠️ IMPORTANT: For Jobs training (remote cloud GPUs):
	- Training happens on ephemeral cloud runners (not your local machine)
	- Trackio syncs metrics to a Hugging Face Space for real-time monitoring
	- Without a Space, metrics are lost when the job completes
	- The Space dashboard persists your training metrics permanently

	## Setting Up Trackio for Jobs

	Step 1: Add trackio dependency
	```python
	# /// script
	# dependencies = [
	# "trl>=0.12.0",
	# "trackio", # Required!
	# ]
	# ///
	```

	Step 2: Create a Trackio Space (one-time setup)

	Option A: Let Trackio auto-create (Recommended)
	Pass a `space_id` to `trackio.init()` and Trackio will automatically create the Space if it doesn't exist.

	Option B: Create manually
	- Create Space via Hub UI at https://huggingface.co/new-space
	- Select Gradio SDK
	- OR use command: `huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio`

	Step 3: Initialize Trackio with space_id
	```python
	import trackio

	trackio.init(
	project="my-training",
	space_id="username/trackio", # CRITICAL for Jobs! Replace 'username' with your HF username
	config={
	"model": "Qwen/Qwen2.5-0.5B",
	"dataset": "trl-lib/Capybara",
	"learning_rate": 2e-5,
	}
	)
	```

	Step 4: Configure TRL to use Trackio
	```python
	SFTConfig(
	report_to="trackio",
	# ... other config
	)
	```

	Step 5: Finish tracking
	```python
	trainer.train()
	trackio.finish() # Ensures final metrics are synced
	```

	## What Trackio Tracks

	Trackio automatically logs:
	- ✅ Training loss
	- ✅ Learning rate
	- ✅ GPU utilization
	- ✅ Memory usage
	- ✅ Training throughput
	- ✅ Custom metrics

	## How It Works with Jobs

	1. Training runs → Metrics logged to local SQLite DB
	2. Every 5 minutes → Trackio syncs DB to HF Dataset (Parquet)
	3. Space dashboard → Reads from Dataset, displays metrics in real-time
	4. Job completes → Final sync ensures all metrics persisted

	## Default Configuration Pattern

	Use sensible defaults for trackio configuration unless user requests otherwise.

	### Recommended Defaults

	```python
	import trackio

	trackio.init(
	project="qwen-capybara-sft",
	name="baseline-run", # Descriptive name user will recognize
	space_id="username/trackio", # Default space: {username}/trackio
	config={
	# Keep config minimal - hyperparameters and model/dataset info only
	"model": "Qwen/Qwen2.5-0.5B",
	"dataset": "trl-lib/Capybara",
	"learning_rate": 2e-5,
	"num_epochs": 3,
	}
	)
	```

	Key principles:
	- Space ID: Use `{username}/trackio` with "trackio" as default space name
	- Run naming: Unless otherwise specified, name the run in a way the user will recognize
	- Config: Keep minimal - don't automatically capture job metadata unless requested
	- Grouping: Optional - only use if user requests organizing related experiments

	## Grouping Runs (Optional)

	The `group` parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together:

	```python
	# Example: Group runs by experiment type
	trackio.init(project="my-project", run_name="baseline-run-1", group="baseline")
	trackio.init(project="my-project", run_name="augmented-run-1", group="augmented")
	trackio.init(project="my-project", run_name="tuned-run-1", group="tuned")
	```

	Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter:

	```python
	# Hyperparameter sweep - group by learning rate
	trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001")
	trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01")
	```

	## Environment Variables for Jobs

	You can configure trackio using environment variables instead of passing parameters to `trackio.init()`. This is useful for managing configuration across multiple jobs.



	`HF_TOKEN`
	Required for creating Spaces and writing to datasets (passed via `secrets`):
	```python
	hf_jobs("uv", {
	"script": "...",
	"secrets": {
	"HF_TOKEN": "$HF_TOKEN" # Enables Space creation and Hub push
	}
	})
	```

	### Example with Environment Variables

	```python
	hf_jobs("uv", {
	"script": """
	# Training script - trackio config from environment
	import trackio
	from datetime import datetime

	# Auto-generate run name
	timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
	run_name = f"sft_qwen25_{timestamp}"

	# Project and space_id can come from environment variables
	trackio.init(run_name=run_name, group="SFT")

	# ... training code ...
	trackio.finish()
	""",
	"flavor": "a10g-large",
	"timeout": "2h",
	"secrets": {"HF_TOKEN": "$HF_TOKEN"}
	})
	```

	When to use environment variables:
	- Managing multiple jobs with same configuration
	- Keeping training scripts portable across projects
	- Separating configuration from code

	When to use direct parameters:
	- Single job with specific configuration
	- When clarity in code is preferred
	- When each job has different project/space

	## Viewing the Dashboard

	After starting training:
	1. Navigate to the Space: `https://huggingface.co/spaces/username/trackio`
	2. The Gradio dashboard shows all tracked experiments
	3. Filter by project, compare runs, view charts with smoothing

	## Recommendation

	- Trackio: Best for real-time monitoring during long training runs
	- Weights & Biases: Best for team collaboration, requires account