Spaces:

hf-skills
/

llm-trainer

Running

App Files Files Community

llm-trainer / references /hub_saving.md

burtenshaw HF Staff

Upload folder using huggingface_hub

6ab17a7 verified 10 days ago

preview code

raw

history blame contribute delete

8.52 kB

	# Saving Training Results to Hugging Face Hub

	⚠️ CRITICAL: Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.

	## Why Hub Push is Required

	When running on Hugging Face Jobs:
	- Environment is temporary
	- All files deleted on job completion
	- No local disk persistence
	- Cannot access results after job ends

	Without Hub push, training is completely wasted.

	## Required Configuration

	### 1. Training Configuration

	In your SFTConfig or trainer config:

	```python
	SFTConfig(
	push_to_hub=True, # Enable Hub push
	hub_model_id="username/model-name", # Target repository
	)
	```

	### 2. Job Configuration

	When submitting the job:

	```python
	hf_jobs("uv", {
	"script": "train.py",
	"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Provide authentication
	})
	```

	The `$HF_TOKEN` placeholder is automatically replaced with your Hugging Face token.

	## Complete Example

	```python
	# train.py
	# /// script
	# dependencies = ["trl"]
	# ///

	from trl import SFTTrainer, SFTConfig
	from datasets import load_dataset

	dataset = load_dataset("trl-lib/Capybara", split="train")

	# Configure with Hub push
	config = SFTConfig(
	output_dir="my-model",
	num_train_epochs=3,

	# ✅ CRITICAL: Hub push configuration
	push_to_hub=True,
	hub_model_id="myusername/my-trained-model",

	# Optional: Push strategy
	push_to_hub_model_id="myusername/my-trained-model",
	push_to_hub_organization=None,
	push_to_hub_token=None, # Uses environment token
	)

	trainer = SFTTrainer(
	model="Qwen/Qwen2.5-0.5B",
	train_dataset=dataset,
	args=config,
	)

	trainer.train()

	# ✅ Push final model
	trainer.push_to_hub()

	print("✅ Model saved to: https://huggingface.co/myusername/my-trained-model")
	```

	Submit with authentication:

	```python
	hf_jobs("uv", {
	"script": "train.py",
	"flavor": "a10g-large",
	"timeout": "2h",
	"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required!
	})
	```

	## What Gets Saved

	When `push_to_hub=True`:

	1. Model weights - Final trained parameters
	2. Tokenizer - Associated tokenizer
	3. Configuration - Model config (config.json)
	4. Training arguments - Hyperparameters used
	5. Model card - Auto-generated documentation
	6. Checkpoints - If `save_strategy="steps"` enabled

	## Checkpoint Saving

	Save intermediate checkpoints during training:

	```python
	SFTConfig(
	output_dir="my-model",
	push_to_hub=True,
	hub_model_id="username/my-model",

	# Checkpoint configuration
	save_strategy="steps",
	save_steps=100, # Save every 100 steps
	save_total_limit=3, # Keep only last 3 checkpoints
	)
	```

	Benefits:
	- Resume training if job fails
	- Compare checkpoint performance
	- Use intermediate models

	Checkpoints are pushed to: `username/my-model` (same repo)

	## Authentication Methods

	### Method 1: Automatic Token (Recommended)

	```python
	"secrets": {"HF_TOKEN": "$HF_TOKEN"}
	```

	Uses your logged-in Hugging Face token automatically.

	### Method 2: Explicit Token

	```python
	"secrets": {"HF_TOKEN": "hf_abc123..."}
	```

	Provide token explicitly (not recommended for security).

	### Method 3: Environment Variable

	```python
	"env": {"HF_TOKEN": "hf_abc123..."}
	```

	Pass as regular environment variable (less secure than secrets).

	Always prefer Method 1 for security and convenience.

	## Verification Checklist

	Before submitting any training job, verify:

	- [ ] `push_to_hub=True` in training config
	- [ ] `hub_model_id` is specified (format: `username/model-name`)
	- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
	- [ ] Repository name doesn't conflict with existing repos
	- [ ] You have write access to the target namespace

	## Repository Setup

	### Automatic Creation

	If repository doesn't exist, it's created automatically when first pushing.

	### Manual Creation

	Create repository before training:

	```python
	from huggingface_hub import HfApi

	api = HfApi()
	api.create_repo(
	repo_id="username/model-name",
	repo_type="model",
	private=False, # or True for private repo
	)
	```

	### Repository Naming

	Valid names:
	- `username/my-model`
	- `username/model-name`
	- `organization/model-name`

	Invalid names:
	- `model-name` (missing username)
	- `username/model name` (spaces not allowed)
	- `username/MODEL` (uppercase discouraged)

	## Troubleshooting

	### Error: 401 Unauthorized

	Cause: HF_TOKEN not provided or invalid

	Solutions:
	1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
	2. Check you're logged in: `huggingface-cli whoami`
	3. Re-login: `huggingface-cli login`

	### Error: 403 Forbidden

	Cause: No write access to repository

	Solutions:
	1. Check repository namespace matches your username
	2. Verify you're a member of organization (if using org namespace)
	3. Check repository isn't private (if accessing org repo)

	### Error: Repository not found

	Cause: Repository doesn't exist and auto-creation failed

	Solutions:
	1. Manually create repository first
	2. Check repository name format
	3. Verify namespace exists

	### Error: Push failed during training

	Cause: Network issues or Hub unavailable

	Solutions:
	1. Training continues but final push fails
	2. Checkpoints may be saved
	3. Re-run push manually after job completes

	### Issue: Model saved but not visible

	Possible causes:
	1. Repository is private—check https://huggingface.co/username
	2. Wrong namespace—verify `hub_model_id` matches login
	3. Push still in progress—wait a few minutes

	## Manual Push After Training

	If training completes but push fails, push manually:

	```python
	from transformers import AutoModel, AutoTokenizer

	# Load from local checkpoint
	model = AutoModel.from_pretrained("./output_dir")
	tokenizer = AutoTokenizer.from_pretrained("./output_dir")

	# Push to Hub
	model.push_to_hub("username/model-name", token="hf_abc123...")
	tokenizer.push_to_hub("username/model-name", token="hf_abc123...")
	```

	Note: Only possible if job hasn't completed (files still exist).

	## Best Practices

	1. Always enable `push_to_hub=True`
	2. Use checkpoint saving for long training runs
	3. Verify Hub push in logs before job completes
	4. Set appropriate `save_total_limit` to avoid excessive checkpoints
	5. Use descriptive repo names (e.g., `qwen-capybara-sft` not `model1`)
	6. Add model card with training details
	7. Tag models with relevant tags (e.g., `text-generation`, `fine-tuned`)

	## Monitoring Push Progress

	Check logs for push progress:

	```python
	hf_jobs("logs", {"job_id": "your-job-id"})
	```

	Look for:
	```
	Pushing model to username/model-name...
	Upload file pytorch_model.bin: 100%
	✅ Model pushed successfully
	```

	## Example: Full Production Setup

	```python
	# production_train.py
	# /// script
	# dependencies = ["trl>=0.12.0", "peft>=0.7.0"]
	# ///

	from datasets import load_dataset
	from peft import LoraConfig
	from trl import SFTTrainer, SFTConfig
	import os

	# Verify token is available
	assert "HF_TOKEN" in os.environ, "HF_TOKEN not found in environment!"

	# Load dataset
	dataset = load_dataset("trl-lib/Capybara", split="train")
	print(f"✅ Dataset loaded: {len(dataset)} examples")

	# Configure with comprehensive Hub settings
	config = SFTConfig(
	output_dir="qwen-capybara-sft",

	# Hub configuration
	push_to_hub=True,
	hub_model_id="myusername/qwen-capybara-sft",
	hub_strategy="checkpoint", # Push checkpoints

	# Checkpoint configuration
	save_strategy="steps",
	save_steps=100,
	save_total_limit=3,

	# Training settings
	num_train_epochs=3,
	per_device_train_batch_size=4,

	# Logging
	logging_steps=10,
	logging_first_step=True,
	)

	# Train with LoRA
	trainer = SFTTrainer(
	model="Qwen/Qwen2.5-0.5B",
	train_dataset=dataset,
	args=config,
	peft_config=LoraConfig(r=16, lora_alpha=32),
	)

	print("🚀 Starting training...")
	trainer.train()

	print("💾 Pushing final model to Hub...")
	trainer.push_to_hub()

	print("✅ Training complete!")
	print(f"Model available at: https://huggingface.co/myusername/qwen-capybara-sft")
	```

	Submit:

	```python
	hf_jobs("uv", {
	"script": "production_train.py",
	"flavor": "a10g-large",
	"timeout": "6h",
	"secrets": {"HF_TOKEN": "$HF_TOKEN"}
	})
	```

	## Key Takeaway

	Without `push_to_hub=True` and `secrets={"HF_TOKEN": "$HF_TOKEN"}`, all training results are permanently lost.

	Always verify both are configured before submitting any training job.