Spaces:
Running
Running
File size: 8,517 Bytes
6ab17a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 |
# Saving Training Results to Hugging Face Hub
**⚠️ CRITICAL:** Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.
## Why Hub Push is Required
When running on Hugging Face Jobs:
- Environment is temporary
- All files deleted on job completion
- No local disk persistence
- Cannot access results after job ends
**Without Hub push, training is completely wasted.**
## Required Configuration
### 1. Training Configuration
In your SFTConfig or trainer config:
```python
SFTConfig(
push_to_hub=True, # Enable Hub push
hub_model_id="username/model-name", # Target repository
)
```
### 2. Job Configuration
When submitting the job:
```python
hf_jobs("uv", {
"script": "train.py",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Provide authentication
})
```
**The `$HF_TOKEN` placeholder is automatically replaced with your Hugging Face token.**
## Complete Example
```python
# train.py
# /// script
# dependencies = ["trl"]
# ///
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
# Configure with Hub push
config = SFTConfig(
output_dir="my-model",
num_train_epochs=3,
# ✅ CRITICAL: Hub push configuration
push_to_hub=True,
hub_model_id="myusername/my-trained-model",
# Optional: Push strategy
push_to_hub_model_id="myusername/my-trained-model",
push_to_hub_organization=None,
push_to_hub_token=None, # Uses environment token
)
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
args=config,
)
trainer.train()
# ✅ Push final model
trainer.push_to_hub()
print("✅ Model saved to: https://huggingface.co/myusername/my-trained-model")
```
**Submit with authentication:**
```python
hf_jobs("uv", {
"script": "train.py",
"flavor": "a10g-large",
"timeout": "2h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required!
})
```
## What Gets Saved
When `push_to_hub=True`:
1. **Model weights** - Final trained parameters
2. **Tokenizer** - Associated tokenizer
3. **Configuration** - Model config (config.json)
4. **Training arguments** - Hyperparameters used
5. **Model card** - Auto-generated documentation
6. **Checkpoints** - If `save_strategy="steps"` enabled
## Checkpoint Saving
Save intermediate checkpoints during training:
```python
SFTConfig(
output_dir="my-model",
push_to_hub=True,
hub_model_id="username/my-model",
# Checkpoint configuration
save_strategy="steps",
save_steps=100, # Save every 100 steps
save_total_limit=3, # Keep only last 3 checkpoints
)
```
**Benefits:**
- Resume training if job fails
- Compare checkpoint performance
- Use intermediate models
**Checkpoints are pushed to:** `username/my-model` (same repo)
## Authentication Methods
### Method 1: Automatic Token (Recommended)
```python
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
```
Uses your logged-in Hugging Face token automatically.
### Method 2: Explicit Token
```python
"secrets": {"HF_TOKEN": "hf_abc123..."}
```
Provide token explicitly (not recommended for security).
### Method 3: Environment Variable
```python
"env": {"HF_TOKEN": "hf_abc123..."}
```
Pass as regular environment variable (less secure than secrets).
**Always prefer Method 1** for security and convenience.
## Verification Checklist
Before submitting any training job, verify:
- [ ] `push_to_hub=True` in training config
- [ ] `hub_model_id` is specified (format: `username/model-name`)
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
- [ ] Repository name doesn't conflict with existing repos
- [ ] You have write access to the target namespace
## Repository Setup
### Automatic Creation
If repository doesn't exist, it's created automatically when first pushing.
### Manual Creation
Create repository before training:
```python
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(
repo_id="username/model-name",
repo_type="model",
private=False, # or True for private repo
)
```
### Repository Naming
**Valid names:**
- `username/my-model`
- `username/model-name`
- `organization/model-name`
**Invalid names:**
- `model-name` (missing username)
- `username/model name` (spaces not allowed)
- `username/MODEL` (uppercase discouraged)
## Troubleshooting
### Error: 401 Unauthorized
**Cause:** HF_TOKEN not provided or invalid
**Solutions:**
1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
2. Check you're logged in: `huggingface-cli whoami`
3. Re-login: `huggingface-cli login`
### Error: 403 Forbidden
**Cause:** No write access to repository
**Solutions:**
1. Check repository namespace matches your username
2. Verify you're a member of organization (if using org namespace)
3. Check repository isn't private (if accessing org repo)
### Error: Repository not found
**Cause:** Repository doesn't exist and auto-creation failed
**Solutions:**
1. Manually create repository first
2. Check repository name format
3. Verify namespace exists
### Error: Push failed during training
**Cause:** Network issues or Hub unavailable
**Solutions:**
1. Training continues but final push fails
2. Checkpoints may be saved
3. Re-run push manually after job completes
### Issue: Model saved but not visible
**Possible causes:**
1. Repository is private—check https://huggingface.co/username
2. Wrong namespace—verify `hub_model_id` matches login
3. Push still in progress—wait a few minutes
## Manual Push After Training
If training completes but push fails, push manually:
```python
from transformers import AutoModel, AutoTokenizer
# Load from local checkpoint
model = AutoModel.from_pretrained("./output_dir")
tokenizer = AutoTokenizer.from_pretrained("./output_dir")
# Push to Hub
model.push_to_hub("username/model-name", token="hf_abc123...")
tokenizer.push_to_hub("username/model-name", token="hf_abc123...")
```
**Note:** Only possible if job hasn't completed (files still exist).
## Best Practices
1. **Always enable `push_to_hub=True`**
2. **Use checkpoint saving** for long training runs
3. **Verify Hub push** in logs before job completes
4. **Set appropriate `save_total_limit`** to avoid excessive checkpoints
5. **Use descriptive repo names** (e.g., `qwen-capybara-sft` not `model1`)
6. **Add model card** with training details
7. **Tag models** with relevant tags (e.g., `text-generation`, `fine-tuned`)
## Monitoring Push Progress
Check logs for push progress:
```python
hf_jobs("logs", {"job_id": "your-job-id"})
```
**Look for:**
```
Pushing model to username/model-name...
Upload file pytorch_model.bin: 100%
✅ Model pushed successfully
```
## Example: Full Production Setup
```python
# production_train.py
# /// script
# dependencies = ["trl>=0.12.0", "peft>=0.7.0"]
# ///
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
import os
# Verify token is available
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found in environment!"
# Load dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
print(f"✅ Dataset loaded: {len(dataset)} examples")
# Configure with comprehensive Hub settings
config = SFTConfig(
output_dir="qwen-capybara-sft",
# Hub configuration
push_to_hub=True,
hub_model_id="myusername/qwen-capybara-sft",
hub_strategy="checkpoint", # Push checkpoints
# Checkpoint configuration
save_strategy="steps",
save_steps=100,
save_total_limit=3,
# Training settings
num_train_epochs=3,
per_device_train_batch_size=4,
# Logging
logging_steps=10,
logging_first_step=True,
)
# Train with LoRA
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
args=config,
peft_config=LoraConfig(r=16, lora_alpha=32),
)
print("🚀 Starting training...")
trainer.train()
print("💾 Pushing final model to Hub...")
trainer.push_to_hub()
print("✅ Training complete!")
print(f"Model available at: https://huggingface.co/myusername/qwen-capybara-sft")
```
**Submit:**
```python
hf_jobs("uv", {
"script": "production_train.py",
"flavor": "a10g-large",
"timeout": "6h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
## Key Takeaway
**Without `push_to_hub=True` and `secrets={"HF_TOKEN": "$HF_TOKEN"}`, all training results are permanently lost.**
Always verify both are configured before submitting any training job.
|