File size: 4,416 Bytes
7b442a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# HuggingFace Space Deployment Instructions

## 1. Create Space on HuggingFace

1. Go to https://huggingface.co/new-space
2. Fill in details:
   - **Owner**: `appsmithery` (or your organization)
   - **Space name**: `code-chef-modelops-trainer`
   - **License**: `apache-2.0`
   - **SDK**: `Gradio`
   - **Hardware**: `t4-small` (upgrade to `a10g-large` for 3-7B models)
   - **Visibility**: `Private` (recommended) or `Public`

## 2. Configure Secrets

In Space Settings > Variables and secrets:

1. Add secret: `HF_TOKEN`
   - Value: Your HuggingFace write access token from https://huggingface.co/settings/tokens
   - Required permissions: `write` (for pushing trained models)

## 3. Upload Files

Upload these files to the Space repository:

```

code-chef-modelops-trainer/

β”œβ”€β”€ app.py                  # Main application

β”œβ”€β”€ requirements.txt        # Python dependencies

└── README.md              # Space documentation

```

**Option A: Via Web UI**

- Drag and drop files to Space Files tab

**Option B: Via Git**

```bash

# Clone the Space repo

git clone https://huggingface.co/spaces/appsmithery/code-chef-modelops-trainer

cd code-chef-modelops-trainer



# Copy files

cp deploy/huggingface-spaces/modelops-trainer/* .



# Commit and push

git add .

git commit -m "Initial ModelOps trainer deployment"

git push

```

## 4. Verify Deployment

1. Wait for Space to build (2-3 minutes)
2. Check logs for errors
3. Test health endpoint:
   ```bash

   curl https://appsmithery-code-chef-modelops-trainer.hf.space/health

   ```

Expected response:

```json

{

  "status": "healthy",

  "service": "code-chef-modelops-trainer",

  "autotrain_available": true,

  "hf_token_configured": true

}

```

## 5. Update code-chef Configuration

Add Space URL to `config/env/.env`:

```bash

# ModelOps - HuggingFace Space

MODELOPS_SPACE_URL=https://appsmithery-code-chef-modelops-trainer.hf.space

MODELOPS_SPACE_TOKEN=your_hf_token_here

```

## 6. Test from code-chef

Use the client example:

```python

from deploy.huggingface_spaces.modelops_trainer.client_example import ModelOpsTrainerClient



client = ModelOpsTrainerClient(

    space_url=os.environ["MODELOPS_SPACE_URL"],

    hf_token=os.environ["MODELOPS_SPACE_TOKEN"]

)



# Health check

health = client.health_check()

print(health)



# Submit demo job

result = client.submit_training_job(

    agent_name="feature_dev",

    base_model="Qwen/Qwen2.5-Coder-7B",

    dataset_csv_path="/tmp/demo.csv",

    demo_mode=True

)



print(f"Job ID: {result['job_id']}")

```

## 7. Hardware Upgrades

For larger models (3-7B), upgrade hardware:

1. Go to Space Settings
2. Change Hardware to `a10g-large`
3. Note: Cost increases from ~$0.75/hr to ~$2.20/hr

## 8. Monitoring

- **Logs**: Check Space logs for errors
- **TensorBoard**: Each job provides a TensorBoard URL
- **LangSmith**: Client example includes `@traceable` for observability

## 9. Production Considerations

- **Persistence**: Jobs stored in `/tmp` - lost on restart. Use persistent storage or external DB for production
- **Queuing**: Current version runs jobs sequentially. Add job queue (Celery/Redis) for concurrent training
- **Authentication**: Add API key auth for production use
- **Rate Limiting**: Add rate limits to prevent abuse
- **Monitoring**: Set up alerts for failed jobs

## 10. Cost Optimization

- **Auto-scaling**: Set Space to sleep after inactivity
- **Demo mode**: Always test with demo mode first ($0.50 vs $15)
- **Batch jobs**: Train multiple agents in sequence to maximize GPU utilization
- **Local development**: Test locally before deploying to Space

## Troubleshooting

**Space won't build**:

- Check requirements.txt versions
- Verify Python version compatibility (3.9+ recommended)
- Check Space logs for build errors

**Training fails**:

- Verify HF_TOKEN has write permissions

- Check dataset format (must have `text` and `response` columns)

- Ensure model repo exists on HuggingFace Hub



**Out of memory**:



- Enable demo mode to test with smaller dataset

- Use quantization: `int4` or `int8`

- Upgrade to larger GPU (`a10g-large`)

- Reduce `max_seq_length` in config



**Connection timeout**:



- Space may be sleeping - first request wakes it (30s delay)

- Increase client timeout to 60s for first request