Spaces:

lapnt3
/

my-gradio-app

Runtime error

File size: 6,668 Bytes

eeb0f9c

# Scripts Documentation 🚀

Automated scripts for HeoCare Chatbot setup and maintenance.

## 📋 Quick Start

### One-Command Setup (Recommended)

```bash
# Run everything in one command
bash scripts/setup_rag.sh
```

**What it does:**
1. ✅ Check Python & dependencies
2. ✅ Install required packages
3. ✅ Download 6 medical datasets from HuggingFace
4. ✅ Build ChromaDB vector stores (~160 MB)
5. ✅ Generate training data (200 conversations)
6. ✅ Optional: Fine-tune agents

**Time:** ~15-20 minutes (depends on internet speed)

---

## 📜 Available Scripts

### 1. `setup_rag.sh` ⭐ Main Setup

```bash
bash scripts/setup_rag.sh
```

**Features:**
- Downloads 6 datasets from HuggingFace:
  - ViMedical (603 diseases)
  - MentalChat16K (16K conversations)
  - Nutrition recommendations
  - Vietnamese food nutrition
  - Fitness exercises (1.66K)
  - Medical Q&A (9.3K pairs)
- Builds ChromaDB vector stores
- Generates training data
- Optional fine-tuning

**Skip existing databases automatically!**

---

### 2. `generate_training_data.py` - Training Data

```bash
python scripts/generate_training_data.py
```

**What it does:**
- Generates 200 synthetic conversations
- 50 scenarios per agent (nutrition, symptom, exercise, mental_health)
- Uses GPT-4o-mini
- Output: `fine_tuning/training_data/*.jsonl`

**Cost:** ~$0.50 (OpenAI API)

---

### 3. `auto_finetune.py` - Batch Fine-tuning

```bash
python scripts/auto_finetune.py
```

**What it does:**
- Fine-tunes all 4 agents automatically
- Uploads training files
- Creates fine-tuning jobs
- Tracks progress
- Updates model config

**Requirements:** OpenAI official API (custom APIs not supported)

---

### 4. `fine_tune_agent.py` - Single Agent Fine-tuning

```bash
python scripts/fine_tune_agent.py nutrition_agent
```

**What it does:**
- Fine-tune one specific agent
- Manual control over the process
- Alternative to auto_finetune.py

**Agents:** `nutrition_agent`, `symptom_agent`, `exercise_agent`, `mental_health_agent`

---

### 5. `check_rag_status.py` - Diagnostic Tool

```bash
python scripts/check_rag_status.py
```

**What it checks:**
- ✅ ChromaDB folders exist
- 📊 Database sizes
- 📚 Document counts
- 🧪 Test queries

**Note:** May need updates for new vector store paths

---

## 📁 Directory Structure

```
scripts/
├── setup_rag.sh                   # ⭐ Main setup script
├── generate_training_data.py      # Generate synthetic data
├── auto_finetune.py               # Batch fine-tuning
├── fine_tune_agent.py             # Single agent fine-tuning
├── check_rag_status.py            # Diagnostic tool
└── README.md                      # This file

data_mining/                       # Dataset downloaders
├── mining_vimedical.py            # ViMedical diseases
├── mining_mentalchat.py           # Mental health conversations
├── mining_nutrition.py            # Nutrition recommendations
├── mining_vietnamese_food.py      # Vietnamese food data
├── mining_fitness.py              # Fitness exercises
└── mining_medical_qa.py           # Medical Q&A pairs

rag/vector_store/                  # ChromaDB (NOT committed)
├── medical_diseases/              # ViMedical (603 diseases)
├── mental_health/                 # MentalChat (16K conversations)
├── nutrition/                     # Nutrition plans
├── vietnamese_nutrition/          # Vietnamese foods (73)
├── fitness/                       # Exercises (1.66K)
├── symptom_qa/                    # Medical Q&A
└── general_health_qa/             # General health Q&A

fine_tuning/training_data/         # Generated data (NOT committed)
├── nutrition_training.jsonl
├── symptom_training.jsonl
├── exercise_training.jsonl
└── mental_health_training.jsonl
```

---

## 🔄 Team Workflow

### First Time Setup (New Team Member)

```bash
# 1. Clone repo
git clone <repo-url>
cd heocare-chatbot

# 2. Create .env file
cp .env.example .env
# Add your OPENAI_API_KEY

# 3. Setup everything (one command)
bash scripts/setup_rag.sh

# 4. Run app
python app.py
```

**Time:** ~15-20 minutes

---

### Daily Development

```bash
# Pull latest code
git pull

# If setup_rag.sh was updated, run it again
# (It will skip existing databases automatically)
bash scripts/setup_rag.sh

# Run app
python app.py
```

---

### Regenerate Training Data

```bash
# If you updated agent prompts or scenarios
python scripts/generate_training_data.py

# Optional: Fine-tune with new data
python scripts/auto_finetune.py
```

---

### Reset Everything

```bash
# Delete all generated data
rm -rf rag/vector_store/*
rm -rf fine_tuning/training_data/*
rm -rf data_mining/datasets/*
rm -rf data_mining/output/*

# Setup from scratch
bash scripts/setup_rag.sh
```

---

## 🐛 Troubleshooting

### Setup Failed

```bash
# Check Python version (need 3.8+)
python --version

# Check dependencies
pip install -r requirements.txt

# Check API key
echo $OPENAI_API_KEY
```

---

### Dataset Download Failed

```bash
# Check internet connection
ping huggingface.co

# Try manual download for specific dataset
python data_mining/mining_vimedical.py
python data_mining/mining_mentalchat.py
```

---

### ChromaDB Issues

```bash
# Check status
python scripts/check_rag_status.py

# Delete and rebuild specific database
rm -rf rag/vector_store/medical_diseases
python data_mining/mining_vimedical.py

# Move to correct location
mkdir -p rag/vector_store
mv data_mining/output/medical_chroma rag/vector_store/medical_diseases
```

---

### Fine-tuning 404 Error

```
Error: 404 - {'detail': 'Not Found'}
```

**Cause:** Custom API endpoint doesn't support fine-tuning

**Solution:**
1. Use OpenAI official API for fine-tuning
2. Or skip fine-tuning (app works fine with base model + RAG)

```bash
# Option 1: Update .env to use official API
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_API_KEY=sk-proj-your-official-key

# Option 2: Skip fine-tuning
# Just run the app without fine-tuning
python app.py
```

---

## 📊 Performance

| Task | Time | Size |
|------|------|------|
| Download datasets | ~5-8 min | ~50 MB |
| Build ChromaDB | ~5-7 min | ~160 MB |
| Generate training data | ~2-3 min | ~500 KB |
| Fine-tuning (optional) | ~30-60 min | - |
| **Total Setup** | **~15-20 min** | **~160 MB** |

---

## 🆘 Support

If you encounter issues:

1. Run `python scripts/check_rag_status.py` for diagnostics
2. Check console logs for errors
3. Verify `.gitignore` is correct
4. Try deleting and rebuilding specific databases
5. Check that `.env` has valid API key

---

**Happy Coding! 🚀**