Spaces:

alextorelli
/

code-chef-modelops-trainer

Sleeping

File size: 4,416 Bytes

7b442a3

# HuggingFace Space Deployment Instructions

## 1. Create Space on HuggingFace

1. Go to https://huggingface.co/new-space
2. Fill in details:
   - **Owner**: `appsmithery` (or your organization)
   - **Space name**: `code-chef-modelops-trainer`
   - **License**: `apache-2.0`
   - **SDK**: `Gradio`
   - **Hardware**: `t4-small` (upgrade to `a10g-large` for 3-7B models)
   - **Visibility**: `Private` (recommended) or `Public`

## 2. Configure Secrets

In Space Settings > Variables and secrets:

1. Add secret: `HF_TOKEN`
   - Value: Your HuggingFace write access token from https://huggingface.co/settings/tokens
   - Required permissions: `write` (for pushing trained models)

## 3. Upload Files

Upload these files to the Space repository:

```

code-chef-modelops-trainer/

├── app.py                  # Main application

├── requirements.txt        # Python dependencies

└── README.md              # Space documentation

```

**Option A: Via Web UI**

- Drag and drop files to Space Files tab

**Option B: Via Git**

```bash

# Clone the Space repo

git clone https://huggingface.co/spaces/appsmithery/code-chef-modelops-trainer

cd code-chef-modelops-trainer



# Copy files

cp deploy/huggingface-spaces/modelops-trainer/* .



# Commit and push

git add .

git commit -m "Initial ModelOps trainer deployment"

git push

```

## 4. Verify Deployment

1. Wait for Space to build (2-3 minutes)
2. Check logs for errors
3. Test health endpoint:
   ```bash

   curl https://appsmithery-code-chef-modelops-trainer.hf.space/health

   ```

Expected response:

```json

{

  "status": "healthy",

  "service": "code-chef-modelops-trainer",

  "autotrain_available": true,

  "hf_token_configured": true

}

```

## 5. Update code-chef Configuration

Add Space URL to `config/env/.env`:

```bash

# ModelOps - HuggingFace Space

MODELOPS_SPACE_URL=https://appsmithery-code-chef-modelops-trainer.hf.space

MODELOPS_SPACE_TOKEN=your_hf_token_here

```

## 6. Test from code-chef

Use the client example:

```python

from deploy.huggingface_spaces.modelops_trainer.client_example import ModelOpsTrainerClient



client = ModelOpsTrainerClient(

    space_url=os.environ["MODELOPS_SPACE_URL"],

    hf_token=os.environ["MODELOPS_SPACE_TOKEN"]

)



# Health check

health = client.health_check()

print(health)



# Submit demo job

result = client.submit_training_job(

    agent_name="feature_dev",

    base_model="Qwen/Qwen2.5-Coder-7B",

    dataset_csv_path="/tmp/demo.csv",

    demo_mode=True

)



print(f"Job ID: {result['job_id']}")

```

## 7. Hardware Upgrades

For larger models (3-7B), upgrade hardware:

1. Go to Space Settings
2. Change Hardware to `a10g-large`
3. Note: Cost increases from ~$0.75/hr to ~$2.20/hr

## 8. Monitoring

- **Logs**: Check Space logs for errors
- **TensorBoard**: Each job provides a TensorBoard URL
- **LangSmith**: Client example includes `@traceable` for observability

## 9. Production Considerations

- **Persistence**: Jobs stored in `/tmp` - lost on restart. Use persistent storage or external DB for production
- **Queuing**: Current version runs jobs sequentially. Add job queue (Celery/Redis) for concurrent training
- **Authentication**: Add API key auth for production use
- **Rate Limiting**: Add rate limits to prevent abuse
- **Monitoring**: Set up alerts for failed jobs

## 10. Cost Optimization

- **Auto-scaling**: Set Space to sleep after inactivity
- **Demo mode**: Always test with demo mode first ($0.50 vs $15)
- **Batch jobs**: Train multiple agents in sequence to maximize GPU utilization
- **Local development**: Test locally before deploying to Space

## Troubleshooting

**Space won't build**:

- Check requirements.txt versions
- Verify Python version compatibility (3.9+ recommended)
- Check Space logs for build errors

**Training fails**:

- Verify HF_TOKEN has write permissions

- Check dataset format (must have `text` and `response` columns)

- Ensure model repo exists on HuggingFace Hub



**Out of memory**:



- Enable demo mode to test with smaller dataset

- Use quantization: `int4` or `int8`

- Upgrade to larger GPU (`a10g-large`)

- Reduce `max_seq_length` in config



**Connection timeout**:



- Space may be sleeping - first request wakes it (30s delay)

- Increase client timeout to 60s for first request