File size: 4,416 Bytes
7b442a3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# HuggingFace Space Deployment Instructions
## 1. Create Space on HuggingFace
1. Go to https://huggingface.co/new-space
2. Fill in details:
- **Owner**: `appsmithery` (or your organization)
- **Space name**: `code-chef-modelops-trainer`
- **License**: `apache-2.0`
- **SDK**: `Gradio`
- **Hardware**: `t4-small` (upgrade to `a10g-large` for 3-7B models)
- **Visibility**: `Private` (recommended) or `Public`
## 2. Configure Secrets
In Space Settings > Variables and secrets:
1. Add secret: `HF_TOKEN`
- Value: Your HuggingFace write access token from https://huggingface.co/settings/tokens
- Required permissions: `write` (for pushing trained models)
## 3. Upload Files
Upload these files to the Space repository:
```
code-chef-modelops-trainer/
βββ app.py # Main application
βββ requirements.txt # Python dependencies
βββ README.md # Space documentation
```
**Option A: Via Web UI**
- Drag and drop files to Space Files tab
**Option B: Via Git**
```bash
# Clone the Space repo
git clone https://huggingface.co/spaces/appsmithery/code-chef-modelops-trainer
cd code-chef-modelops-trainer
# Copy files
cp deploy/huggingface-spaces/modelops-trainer/* .
# Commit and push
git add .
git commit -m "Initial ModelOps trainer deployment"
git push
```
## 4. Verify Deployment
1. Wait for Space to build (2-3 minutes)
2. Check logs for errors
3. Test health endpoint:
```bash
curl https://appsmithery-code-chef-modelops-trainer.hf.space/health
```
Expected response:
```json
{
"status": "healthy",
"service": "code-chef-modelops-trainer",
"autotrain_available": true,
"hf_token_configured": true
}
```
## 5. Update code-chef Configuration
Add Space URL to `config/env/.env`:
```bash
# ModelOps - HuggingFace Space
MODELOPS_SPACE_URL=https://appsmithery-code-chef-modelops-trainer.hf.space
MODELOPS_SPACE_TOKEN=your_hf_token_here
```
## 6. Test from code-chef
Use the client example:
```python
from deploy.huggingface_spaces.modelops_trainer.client_example import ModelOpsTrainerClient
client = ModelOpsTrainerClient(
space_url=os.environ["MODELOPS_SPACE_URL"],
hf_token=os.environ["MODELOPS_SPACE_TOKEN"]
)
# Health check
health = client.health_check()
print(health)
# Submit demo job
result = client.submit_training_job(
agent_name="feature_dev",
base_model="Qwen/Qwen2.5-Coder-7B",
dataset_csv_path="/tmp/demo.csv",
demo_mode=True
)
print(f"Job ID: {result['job_id']}")
```
## 7. Hardware Upgrades
For larger models (3-7B), upgrade hardware:
1. Go to Space Settings
2. Change Hardware to `a10g-large`
3. Note: Cost increases from ~$0.75/hr to ~$2.20/hr
## 8. Monitoring
- **Logs**: Check Space logs for errors
- **TensorBoard**: Each job provides a TensorBoard URL
- **LangSmith**: Client example includes `@traceable` for observability
## 9. Production Considerations
- **Persistence**: Jobs stored in `/tmp` - lost on restart. Use persistent storage or external DB for production
- **Queuing**: Current version runs jobs sequentially. Add job queue (Celery/Redis) for concurrent training
- **Authentication**: Add API key auth for production use
- **Rate Limiting**: Add rate limits to prevent abuse
- **Monitoring**: Set up alerts for failed jobs
## 10. Cost Optimization
- **Auto-scaling**: Set Space to sleep after inactivity
- **Demo mode**: Always test with demo mode first ($0.50 vs $15)
- **Batch jobs**: Train multiple agents in sequence to maximize GPU utilization
- **Local development**: Test locally before deploying to Space
## Troubleshooting
**Space won't build**:
- Check requirements.txt versions
- Verify Python version compatibility (3.9+ recommended)
- Check Space logs for build errors
**Training fails**:
- Verify HF_TOKEN has write permissions
- Check dataset format (must have `text` and `response` columns)
- Ensure model repo exists on HuggingFace Hub
**Out of memory**:
- Enable demo mode to test with smaller dataset
- Use quantization: `int4` or `int8`
- Upgrade to larger GPU (`a10g-large`)
- Reduce `max_seq_length` in config
**Connection timeout**:
- Space may be sleeping - first request wakes it (30s delay)
- Increase client timeout to 60s for first request
|