AldawsariNLP commited on
Commit
3d910e2
·
1 Parent(s): 526c6c2

Remove chunks from API response and frontend display, and updating most of the files for final

Browse files
.gitignore CHANGED
@@ -49,3 +49,5 @@ yarn-error.log*
49
  Thumbs.db
50
 
51
  # Documents (tracked via Hugging Face Xet)
 
 
 
49
  Thumbs.db
50
 
51
  # Documents (tracked via Hugging Face Xet)
52
+ GITHUB_SETUP.md
53
+ QUICKSTART.md
QUICKSTART.md CHANGED
@@ -1,13 +1,19 @@
1
  # Quick Start Guide
2
 
 
 
3
  ## Prerequisites
4
 
5
  - Python 3.10 or 3.11 (required for faiss-cpu compatibility)
6
  - uv (fast Python package manager) - [Install uv](https://github.com/astral-sh/uv)
7
  - Node.js 16+ and npm
8
  - OpenAI API key
 
 
 
 
9
 
10
- ## 5-Minute Setup
11
 
12
  ### 1. Install uv (if not already installed)
13
 
@@ -24,7 +30,7 @@ powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | ie
24
  ### 2. Install Node.js (REQUIRED - if not already installed)
25
 
26
  **Check if Node.js is installed:**
27
- ```powershell
28
  node --version
29
  npm --version
30
  ```
@@ -41,19 +47,17 @@ npm --version
41
  - Complete the installation
42
 
43
  3. **CRITICAL: Restart Your Terminal**:
44
- - **Close PowerShell completely**
45
- - **Open a new PowerShell window**
46
  - This is required for PATH changes to take effect
47
 
48
  4. **Verify Installation**:
49
- ```powershell
50
  node --version
51
  npm --version
52
  ```
53
  Both should show version numbers.
54
 
55
- **For detailed Windows installation instructions, see: [INSTALL_NODEJS_WINDOWS.md](INSTALL_NODEJS_WINDOWS.md)**
56
-
57
  ### 3. Install Dependencies
58
 
59
  **Backend (using uv):**
@@ -75,11 +79,9 @@ Create `.env` in the project root:
75
  OPENAI_API_KEY=sk-your-actual-api-key-here
76
  ```
77
 
78
- ### 5. Add Documents / Processed Data
79
 
80
- - **Local development:** copy your PDF/TXT/DOC/DOCX files into the `documents/` folder before running `uv run python backend/main.py`.
81
- - **Deploying to Hugging Face Spaces:** large PDFs should be uploaded via the Space UI (Files & versions → Upload). Git pushes can’t include big binaries.
82
- - If you have a pre-generated `processed_documents.json`, keep it in the project root (it’s copied by the Dockerfile). The backend logs will print whether this file and the `documents/` folder exist at startup.
83
 
84
  ### 6. Run the Application
85
 
@@ -92,38 +94,184 @@ uv run python backend/main.py
92
  # macOS/Linux: source .venv/bin/activate && python backend/main.py
93
  # Windows: .venv\Scripts\activate && python backend\main.py
94
  ```
 
95
 
96
  **Terminal 2 - Frontend:**
97
  ```bash
98
  cd frontend
99
  npm start
100
  ```
 
101
 
102
  ### 7. Use the Application
103
 
104
  1. Open http://localhost:3000 in your browser
105
- 2. Click "Index Documents" to index files in the `documents/` folder
106
  3. Ask questions about your documents!
107
 
108
- ## Example Questions
109
 
110
  - "What are the key provisions in the contract?"
111
  - "What does the law say about [topic]?"
112
  - "Summarize the main points of the document"
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ## Troubleshooting
115
 
 
 
116
  **"OpenAI API key is required"**
117
  - Make sure you created `.env` in the project root with your API key
118
 
119
  **"No documents found"**
120
  - Check that files are in the `documents/` folder
121
  - Supported formats: PDF, TXT, DOCX, DOC
122
- - On Hugging Face Spaces, make sure you uploaded the PDFs (or a `processed_documents.json`) via the **Files and versions** tab. Watch the build/startup logs for messages such as `[RAG Init] processed_documents.json exists? True`.
123
-
124
- **"RAG system not initialized" (on Spaces)**
125
- - Ensure `processed_documents.json` is present in the repo **and** not excluded by `.dockerignore`.
126
- - Upload your source PDFs (or processed data) in the Space UI, then restart the Space so the startup hook can detect them.
127
 
128
  **Frontend can't connect to backend**
129
  - Ensure backend is running on port 8000
@@ -135,8 +283,42 @@ npm start
135
  - Restart your terminal after installation
136
  - Verify installation: `node --version` and `npm --version`
137
 
138
- ## Next Steps
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
- - See [README.md](README.md) for full documentation
141
- - See [README_HF_SPACES.md](README_HF_SPACES.md) for deployment instructions
 
 
 
 
 
142
 
 
 
 
 
1
  # Quick Start Guide
2
 
3
+ Complete guide for local development and deployment to Hugging Face Spaces.
4
+
5
  ## Prerequisites
6
 
7
  - Python 3.10 or 3.11 (required for faiss-cpu compatibility)
8
  - uv (fast Python package manager) - [Install uv](https://github.com/astral-sh/uv)
9
  - Node.js 16+ and npm
10
  - OpenAI API key
11
+ - Git installed (for deployment)
12
+ - Hugging Face account (for deployment) - [Sign up](https://huggingface.co)
13
+
14
+ ---
15
 
16
+ ## Part 1: Local Development
17
 
18
  ### 1. Install uv (if not already installed)
19
 
 
30
  ### 2. Install Node.js (REQUIRED - if not already installed)
31
 
32
  **Check if Node.js is installed:**
33
+ ```bash
34
  node --version
35
  npm --version
36
  ```
 
47
  - Complete the installation
48
 
49
  3. **CRITICAL: Restart Your Terminal**:
50
+ - **Close your terminal completely**
51
+ - **Open a new terminal window**
52
  - This is required for PATH changes to take effect
53
 
54
  4. **Verify Installation**:
55
+ ```bash
56
  node --version
57
  npm --version
58
  ```
59
  Both should show version numbers.
60
 
 
 
61
  ### 3. Install Dependencies
62
 
63
  **Backend (using uv):**
 
79
  OPENAI_API_KEY=sk-your-actual-api-key-here
80
  ```
81
 
82
+ ### 5. Add Documents
83
 
84
+ Copy your PDF/TXT/DOC/DOCX files into the `documents/` folder. The application will automatically process them when you start the backend.
 
 
85
 
86
  ### 6. Run the Application
87
 
 
94
  # macOS/Linux: source .venv/bin/activate && python backend/main.py
95
  # Windows: .venv\Scripts\activate && python backend\main.py
96
  ```
97
+ The API will run on `http://localhost:8000`
98
 
99
  **Terminal 2 - Frontend:**
100
  ```bash
101
  cd frontend
102
  npm start
103
  ```
104
+ The app will open at `http://localhost:3000`
105
 
106
  ### 7. Use the Application
107
 
108
  1. Open http://localhost:3000 in your browser
109
+ 2. The system will automatically detect and process documents from the `documents/` folder
110
  3. Ask questions about your documents!
111
 
112
+ ### Example Questions
113
 
114
  - "What are the key provisions in the contract?"
115
  - "What does the law say about [topic]?"
116
  - "Summarize the main points of the document"
117
 
118
+ ---
119
+
120
+ ## Part 2: Deployment to Hugging Face Spaces
121
+
122
+ ### 1. Create a New Space
123
+
124
+ 1. Go to https://huggingface.co/spaces
125
+ 2. Click "Create new Space"
126
+ 3. Fill in the details:
127
+ - **Space name**: `saudi-law-ai-assistant` (or your preferred name)
128
+ - **SDK**: Select **Docker**
129
+ - **Visibility**: Public or Private
130
+ 4. Click "Create Space"
131
+
132
+ ### 2. Prepare Your Code
133
+
134
+ 1. **Build the React frontend**:
135
+ ```bash
136
+ cd frontend
137
+ npm install
138
+ npm run build
139
+ cd ..
140
+ ```
141
+
142
+ 2. **Ensure all files are ready**:
143
+ - `app.py` - Main entry point
144
+ - `pyproject.toml` and `uv.lock` - Python dependencies
145
+ - `Dockerfile` - Docker configuration
146
+ - `backend/` - Backend code
147
+ - `frontend/build/` - Built React app (always run `npm run build` before pushing)
148
+ - `processed_documents.json` - Optional bundled data so the Space can answer immediately (make sure it is **not** ignored in `.dockerignore`)
149
+ - `vectorstore/` - Optional pre-built vectorstore folder (if it exists in your repo, it will be included in the Docker image)
150
+ - `documents/` — PDF sources that power preview/download. Because Hugging Face blocks large binaries in standard git pushes, you have two options:
151
+ - Use [HF Xet storage](https://huggingface.co/docs/hub/xet/using-xet-storage#git) for the `documents/` folder so it can live in the repo.
152
+ - Or keep the folder locally, and after every push upload the PDFs through the Space UI (**Files and versions → Upload files**) into `documents/`.
153
+
154
+ ### 3. Set Up Environment Variables
155
+
156
+ 1. In your Hugging Face Space, go to **Settings**
157
+ 2. Scroll to **Repository secrets**
158
+ 3. Add secrets:
159
+ - **Name**: `OPENAI_API_KEY`
160
+ - **Value**: Your OpenAI API key
161
+ - (Optional) **Name**: `HF_TOKEN` (if you need to upload files programmatically)
162
+
163
+ ### 4. Set Up Xet Storage (Recommended for PDFs)
164
+
165
+ If you want to store PDFs in the repository:
166
+
167
+ 1. **Enable Xet storage** on your Space:
168
+ - Go to Space Settings → Large file storage
169
+ - Enable "Hugging Face Xet" (or request access at https://huggingface.co/join/xet)
170
+
171
+ 2. **Install git-xet locally**:
172
+ ```bash
173
+ # macOS/Linux
174
+ curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/huggingface/xet-core/refs/heads/main/git_xet/install.sh | sh
175
+
176
+ # Or via Homebrew
177
+ brew tap huggingface/tap
178
+ brew install git-xet
179
+ git xet install
180
+ ```
181
+
182
+ 3. **Configure git to use Xet**:
183
+ ```bash
184
+ git lfs install
185
+ git lfs track "documents/*.pdf"
186
+ git add .gitattributes documents/*.pdf
187
+ git commit -m "Track PDFs with Xet"
188
+ ```
189
+
190
+ ### 5. Push to Hugging Face
191
+
192
+ 1. **Initialize git** (if not already done):
193
+ ```bash
194
+ git init
195
+ ```
196
+
197
+ 2. **Add Hugging Face remote**:
198
+ ```bash
199
+ git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
200
+ ```
201
+ Replace `YOUR_USERNAME` and `YOUR_SPACE_NAME` with your actual values.
202
+
203
+ 3. **Add and commit files**:
204
+ ```bash
205
+ git add .
206
+ git commit -m "Initial deployment"
207
+ ```
208
+
209
+ 4. **Push to Hugging Face**:
210
+ ```bash
211
+ git push hf main
212
+ ```
213
+
214
+ ### 6. Wait for Build
215
+
216
+ - Hugging Face will automatically build your Docker image
217
+ - This may take 5-10 minutes
218
+ - You can monitor the build logs in the Space's "Logs" tab
219
+
220
+ ### 7. Access Your Application
221
+
222
+ Once the build completes, your application will be available at:
223
+ ```
224
+ https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space
225
+ ```
226
+
227
+ ### 8. Upload Documents / Processed Data (if not using Xet)
228
+
229
+ If you didn't use Xet storage for PDFs:
230
+
231
+ - After the Space builds, open the **Files and versions** tab and click **Upload files** to add your `documents/*.pdf`
232
+ - If you have a prebuilt `processed_documents.json`, upload it as well so the backend can build the vectorstore immediately
233
+ - The startup logs will print whether `processed_documents.json` and `documents/` were detected inside the container
234
+
235
+ ### 9. Redeploy Checklist
236
+
237
+ When updating your Space:
238
+
239
+ 1. `cd frontend && npm install && npm run build && cd ..`
240
+ 2. `git add .`
241
+ 3. `git commit -m "Update application"`
242
+ 4. `git push hf main` (or `git push hf main --force` if needed)
243
+ 5. Watch the Space build logs and confirm the new startup logs show the presence of `processed_documents.json`/`documents`
244
+
245
+ ---
246
+
247
+ ## Important Notes
248
+
249
+ 1. **API Endpoints**: The frontend is configured to use `/api` prefix for backend calls. This is handled by the `app.py` file.
250
+
251
+ 2. **Documents Folder**: The `documents/` folder is automatically created if it doesn't exist. To bundle PDFs, either:
252
+ - Enable HF Xet storage for `documents/` (recommended)
253
+ - Or upload the files via the Space UI after each push
254
+
255
+ 3. **Processed Data**: `processed_documents.json` can be bundled with the repo. The backend tries to bootstrap from this file at startup, so make sure it reflects the content you expect the Space to serve.
256
+
257
+ 4. **Vectorstore**: The `vectorstore/` folder is included in the Docker image if it exists in your repo. If it doesn't exist, it will be created at runtime from `processed_documents.json`.
258
+
259
+ 5. **Port**: Hugging Face Spaces uses port 7860 by default, which is configured in `app.py`.
260
+
261
+ 6. **Dependencies**: This project uses `uv` for Python package management. Dependencies are defined in `pyproject.toml` and `uv.lock`.
262
+
263
+ ---
264
+
265
  ## Troubleshooting
266
 
267
+ ### Local Development
268
+
269
  **"OpenAI API key is required"**
270
  - Make sure you created `.env` in the project root with your API key
271
 
272
  **"No documents found"**
273
  - Check that files are in the `documents/` folder
274
  - Supported formats: PDF, TXT, DOCX, DOC
 
 
 
 
 
275
 
276
  **Frontend can't connect to backend**
277
  - Ensure backend is running on port 8000
 
283
  - Restart your terminal after installation
284
  - Verify installation: `node --version` and `npm --version`
285
 
286
+ ### Hugging Face Spaces Deployment
287
+
288
+ **Build Fails**
289
+ - Check the build logs in the Space's "Logs" tab
290
+ - Ensure all dependencies are in `pyproject.toml`
291
+ - Verify the Dockerfile is correct
292
+ - Make sure `frontend/build/` exists (run `npm run build`)
293
+
294
+ **"RAG system not initialized" (on Spaces)**
295
+ - Ensure `processed_documents.json` is present in the repo **and** not excluded by `.dockerignore`
296
+ - Upload your source PDFs (or processed data) in the Space UI, then restart the Space
297
+ - Check startup logs for initialization messages
298
+
299
+ **API Errors**
300
+ - Check that `OPENAI_API_KEY` is set correctly in Space secrets
301
+ - Verify the API key is valid and has credits
302
+ - Check the Space logs for detailed error messages
303
+
304
+ **Frontend Not Loading**
305
+ - Ensure `npm run build` was run successfully before pushing
306
+ - Check that `frontend/build/` directory exists and contains `index.html`
307
+ - Verify the build completed without errors
308
+
309
+ **Document Preview Not Working**
310
+ - Ensure PDFs are uploaded to the `documents/` folder in the Space
311
+ - Check that filenames match exactly (including encoding)
312
+ - Verify documents are accessible via the Space's file browser
313
 
314
+ **Push Rejected - Binary Files**
315
+ - Enable Xet storage for your Space (see Step 4 above)
316
+ - Or exclude PDFs from git and upload via Space UI
317
+
318
+ ---
319
+
320
+ ## Next Steps
321
 
322
+ - See [README.md](README.md) for full documentation and API details
323
+ - Check the Space logs for detailed startup and error information
324
+ - Monitor your OpenAI API usage to avoid unexpected charges
README.md CHANGED
@@ -35,141 +35,54 @@ A web application that allows users to ask questions about indexed legal documen
35
  ## Project Structure
36
 
37
  ```
38
- law_project1/
39
  ├── backend/
40
  │ ├── main.py # FastAPI application
41
  │ ├── rag_system.py # RAG implementation
42
- │ ├── requirements.txt # Python dependencies
43
- └── .env.example # Environment variables template
 
44
  ├── frontend/
45
  │ ├── src/
46
  │ │ ├── App.js # Main React component
47
  │ │ ├── App.css # Styles
48
  │ │ ├── index.js # React entry point
49
  │ │ └── index.css # Global styles
 
50
  │ ├── public/
51
  │ │ └── index.html # HTML template
52
  │ └── package.json # Node dependencies
53
- ├── documents/ # Place your documents here
 
54
  ├── app.py # Hugging Face Spaces entry point
55
  ├── Dockerfile # Docker configuration
56
- ├── requirements.txt # Main Python dependencies
57
- └── README.md # This file
 
 
 
58
  ```
59
 
60
- ## Setup Instructions
61
-
62
- ### Local Development
63
-
64
- 1. **Navigate to the project**:
65
- ```bash
66
- cd law_project1
67
- ```
68
-
69
- 2. **Set up the backend with uv**:
70
- ```bash
71
- # Install uv if you haven't already
72
- # On macOS/Linux: curl -LsSf https://astral.sh/uv/install.sh | sh
73
- # On Windows: powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
74
-
75
- # Install dependencies
76
- uv sync
77
- ```
78
-
79
- 3. **Set up environment variables**:
80
- ```bash
81
- # Create .env file in project root
82
- echo "OPENAI_API_KEY=your_openai_api_key_here" > .env
83
- ```
84
- Or manually create `.env` file in the project root with:
85
- ```
86
- OPENAI_API_KEY=your_openai_api_key_here
87
- ```
88
-
89
- 4. **Install Node.js** (REQUIRED - if not already installed):
90
- - **Windows**: Download from https://nodejs.org/ (click the green "LTS" button)
91
- - Run the installer (make sure "Add to PATH" is checked)
92
- - **CRITICAL**: Close and restart your terminal/PowerShell after installation
93
- - Verify: `node --version` and `npm --version`
94
- - **For detailed Windows instructions, see: [INSTALL_NODEJS_WINDOWS.md](INSTALL_NODEJS_WINDOWS.md)**
95
-
96
- 5. **Set up the frontend**:
97
- ```bash
98
- cd frontend
99
- npm install
100
- cd ..
101
- ```
102
-
103
- 6. **Add documents**:
104
- - Create a `documents` folder in the project root (if it doesn't exist)
105
- - Add your PDF, TXT, DOCX, or DOC files
106
-
107
- 7. **Run the backend**:
108
- ```bash
109
- # Using uv run (recommended)
110
- uv run python backend/main.py
111
-
112
- # Or activate the virtual environment first
113
- # On macOS/Linux:
114
- source .venv/bin/activate
115
- python backend/main.py
116
-
117
- # On Windows:
118
- # .venv\Scripts\activate
119
- # python backend\main.py
120
- ```
121
- The API will run on `http://localhost:8000`
122
-
123
- 8. **Run the frontend** (in a new terminal):
124
- ```bash
125
- cd frontend
126
- npm start
127
- ```
128
- The app will open at `http://localhost:3000`
129
-
130
- ### Usage
131
-
132
- 1. **Index Documents**:
133
- - Click the "Index Documents" button in the UI, or
134
- - Make a POST request to `http://localhost:8000/index` with:
135
- ```json
136
- {
137
- "folder_path": "documents"
138
- }
139
- ```
140
-
141
- 2. **Ask Questions**:
142
- - Type your question in the chat input
143
- - The system will retrieve relevant context and return exact text from the documents
144
-
145
- ## Hugging Face Spaces Deployment
146
-
147
- See [README_HF_SPACES.md](README_HF_SPACES.md) for detailed deployment instructions.
148
-
149
- ### Quick Deployment Steps
150
-
151
- 1. **Build the frontend**:
152
- ```bash
153
- cd frontend
154
- npm install
155
- npm run build
156
- cd ..
157
- ```
158
-
159
- 2. **Create a Hugging Face Space** (Docker SDK)
160
-
161
- 3. **Set environment variable**:
162
- - In Space Settings → Repository secrets
163
- - Add `OPENAI_API_KEY` with your API key
164
-
165
- 4. **Push to Hugging Face**:
166
- ```bash
167
- git init
168
- git add .
169
- git commit -m "Initial commit"
170
- git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
171
- git push -u origin main
172
- ```
173
 
174
  ## API Endpoints
175
 
@@ -197,28 +110,23 @@ See [README_HF_SPACES.md](README_HF_SPACES.md) for detailed deployment instructi
197
  - The system extracts exact text from documents, not generated responses
198
  - Supported document formats: PDF, TXT, DOCX, DOC
199
  - The vectorstore is saved locally and persists between sessions
200
- - Make sure to index documents before asking questions
201
  - For Hugging Face Spaces, the frontend automatically uses `/api` as the API URL
202
- - This project uses `uv` for Python package management - dependencies are defined in `pyproject.toml`
203
  - The `.env` file should be in the project root (not in the backend folder)
 
204
 
205
  ## Troubleshooting
206
 
207
- ### Backend Issues
208
-
209
- - **OpenAI API Key Error**: Make sure `OPENAI_API_KEY` is set in your environment or `.env` file
210
- - **No documents found**: Ensure documents are in the `documents/` folder with supported extensions
211
-
212
- ### Frontend Issues
213
-
214
- - **API Connection Error**: Check that the backend is running on port 8000
215
- - **CORS Errors**: The backend has CORS enabled for all origins in development
216
 
217
- ### Deployment Issues
218
 
219
- - **Build fails**: Ensure all dependencies are in `requirements.txt`
220
- - **Frontend not loading**: Make sure `npm run build` was run successfully
221
- - **API not working**: Verify `OPENAI_API_KEY` is set in Hugging Face Space secrets
 
 
222
 
223
  ## License
224
 
 
35
  ## Project Structure
36
 
37
  ```
38
+ KSAlaw-document-agent/
39
  ├── backend/
40
  │ ├── main.py # FastAPI application
41
  │ ├── rag_system.py # RAG implementation
42
+ │ ├── document_processor.py # Document processing logic
43
+ ├── embeddings.py # OpenAI embeddings wrapper
44
+ │ └── chat_history.py # Chat history management
45
  ├── frontend/
46
  │ ├── src/
47
  │ │ ├── App.js # Main React component
48
  │ │ ├── App.css # Styles
49
  │ │ ├── index.js # React entry point
50
  │ │ └── index.css # Global styles
51
+ │ ├── build/ # Built React app (for deployment)
52
  │ ├── public/
53
  │ │ └── index.html # HTML template
54
  │ └── package.json # Node dependencies
55
+ ├── documents/ # Place your PDF documents here
56
+ ├── vectorstore/ # FAISS vectorstore (auto-generated)
57
  ├── app.py # Hugging Face Spaces entry point
58
  ├── Dockerfile # Docker configuration
59
+ ├── pyproject.toml # Python dependencies (uv)
60
+ ├── uv.lock # Locked dependencies
61
+ ├── processed_documents.json # Processed document summaries
62
+ ├── QUICKSTART.md # Complete setup and deployment guide
63
+ └── README.md # This file
64
  ```
65
 
66
+ ## Quick Start
67
+
68
+ For complete setup and deployment instructions, see **[QUICKSTART.md](QUICKSTART.md)**.
69
+
70
+ ### Quick Overview
71
+
72
+ **Local Development:**
73
+ 1. Install dependencies: `uv sync` and `cd frontend && npm install`
74
+ 2. Create `.env` with your `OPENAI_API_KEY`
75
+ 3. Add documents to `documents/` folder
76
+ 4. Run backend: `uv run python backend/main.py`
77
+ 5. Run frontend: `cd frontend && npm start`
78
+
79
+ **Deployment to Hugging Face Spaces:**
80
+ 1. Build frontend: `cd frontend && npm run build`
81
+ 2. Set up Xet storage (recommended) or prepare to upload PDFs via UI
82
+ 3. Push to Hugging Face: `git push hf main`
83
+ 4. Set `OPENAI_API_KEY` in Space secrets
84
+
85
+ See [QUICKSTART.md](QUICKSTART.md) for detailed step-by-step instructions for both local development and deployment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ## API Endpoints
88
 
 
110
  - The system extracts exact text from documents, not generated responses
111
  - Supported document formats: PDF, TXT, DOCX, DOC
112
  - The vectorstore is saved locally and persists between sessions
113
+ - Documents are automatically processed on startup (no manual indexing needed)
114
  - For Hugging Face Spaces, the frontend automatically uses `/api` as the API URL
115
+ - This project uses `uv` for Python package management - dependencies are defined in `pyproject.toml` and `uv.lock`
116
  - The `.env` file should be in the project root (not in the backend folder)
117
+ - PDFs can be stored using Hugging Face Xet storage or uploaded via the Space UI
118
 
119
  ## Troubleshooting
120
 
121
+ For detailed troubleshooting, see the [Troubleshooting section in QUICKSTART.md](QUICKSTART.md#troubleshooting).
 
 
 
 
 
 
 
 
122
 
123
+ ### Common Issues
124
 
125
+ - **OpenAI API Key Error**: Make sure `OPENAI_API_KEY` is set in your `.env` file (local) or Space secrets (deployment)
126
+ - **No documents found**: Ensure documents are in the `documents/` folder with supported extensions (PDF, TXT, DOCX, DOC)
127
+ - **Frontend can't connect**: Check that the backend is running on port 8000
128
+ - **Build fails on Spaces**: Ensure `frontend/build/` exists (run `npm run build`), check Dockerfile, verify dependencies in `pyproject.toml`
129
+ - **RAG system not initialized**: Check Space logs, ensure `processed_documents.json` exists and is not ignored by `.dockerignore`
130
 
131
  ## License
132
 
README_HF_SPACES.md DELETED
@@ -1,148 +0,0 @@
1
- # Deploying to Hugging Face Spaces
2
-
3
- This guide will help you deploy the Law Document RAG application to Hugging Face Spaces.
4
-
5
- ## Prerequisites
6
-
7
- 1. A Hugging Face account (sign up at https://huggingface.co)
8
- 2. An OpenAI API key
9
- 3. Git installed on your machine
10
-
11
- ## Step-by-Step Deployment
12
-
13
- ### 1. Create a New Space
14
-
15
- 1. Go to https://huggingface.co/spaces
16
- 2. Click "Create new Space"
17
- 3. Fill in the details:
18
- - **Space name**: `law-document-rag` (or your preferred name)
19
- - **SDK**: Select **Docker**
20
- - **Visibility**: Public or Private
21
- 4. Click "Create Space"
22
-
23
- ### 2. Prepare Your Code
24
-
25
- 1. **Build the React frontend**:
26
- ```bash
27
- cd frontend
28
- npm install
29
- npm run build
30
- cd ..
31
- ```
32
-
33
- 2. **Ensure all files are ready**:
34
- - `app.py` - Main entry point
35
- - `requirements.txt` - Python dependencies
36
- - `Dockerfile` - Docker configuration
37
- - `backend/` - Backend code
38
- - `frontend/build/` - Built React app (always run `npm run build` before pushing)
39
- - `processed_documents.json` - Optional bundled data so the Space can answer immediately (make sure it is **not** ignored in `.dockerignore`; the backend now initializes at import time and expects this file if no PDFs are present)
40
- - `vectorstore/` - Optional pre-built vectorstore folder (if it exists in your repo, it will be included in the Docker image; otherwise it will be created at runtime from `processed_documents.json`. To ensure the folder exists even if empty, create it with: `mkdir -p vectorstore && touch vectorstore/.gitkeep`)
41
- - `documents/` — PDF sources that power preview/download. Because Hugging Face blocks large binaries in standard git pushes, you have two options:
42
- - Use [HF Xet storage](https://huggingface.co/docs/hub/xet/using-xet-storage#git) for the `documents/` folder so it can live in the repo.
43
- - Or keep the folder locally, and after every push upload the PDFs through the Space UI (**Files and versions → Upload files**) into `documents/`.
44
- The Dockerfile now copies `documents/` into the image when present, and still creates the folder if it’s empty.
45
-
46
- ### 3. Set Up Environment Variables
47
-
48
- 1. In your Hugging Face Space, go to **Settings**
49
- 2. Scroll to **Repository secrets**
50
- 3. Add a new secret:
51
- - **Name**: `OPENAI_API_KEY`
52
- - **Value**: Your OpenAI API key
53
-
54
- ### 4. Push to Hugging Face
55
-
56
- 1. **Initialize git** (if not already done):
57
- ```bash
58
- git init
59
- ```
60
-
61
- 2. **Add Hugging Face remote**:
62
- ```bash
63
- git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
64
- ```
65
- Replace `YOUR_USERNAME` and `YOUR_SPACE_NAME` with your actual values.
66
-
67
- 3. **Add and commit files**:
68
- ```bash
69
- git add .
70
- git commit -m "Initial deployment"
71
- ```
72
-
73
- 4. **Push to Hugging Face**:
74
- ```bash
75
- git push origin main
76
- ```
77
-
78
- ### 5. Wait for Build
79
-
80
- - Hugging Face will automatically build your Docker image
81
- - This may take 5-10 minutes
82
- - You can monitor the build logs in the Space's "Logs" tab
83
-
84
- ### 6. Access Your Application
85
-
86
- Once the build completes, your application will be available at:
87
- ```
88
- https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space
89
- ```
90
-
91
- ### 7. Upload Documents / Processed Data
92
-
93
- - Hugging Face blocks large binary files in git pushes. After the Space builds, open the **Files and versions** tab and click **Upload files** to add your `documents/*.pdf`. They will be available under `/data/Spaces/<space-name>/`.
94
- - If you have a prebuilt `processed_documents.json`, upload it as well so the backend can build the vectorstore immediately. The startup logs now print whether `processed_documents.json` and `documents/` were detected inside the container.
95
-
96
- ### 8. Redeploy Checklist
97
-
98
- 1. `cd frontend && npm install && npm run build && cd ..`
99
- 2. `git add .`
100
- 3. `git commit -m "Prepare deployment"`
101
- 4. `git push hf main --force` (authenticate with your HF access token)
102
- 5. Watch the Space build logs and confirm the new startup logs show the presence of `processed_documents.json`/`documents`.
103
-
104
- ## Important Notes
105
-
106
- 1. **API Endpoints**: The frontend is configured to use `/api` prefix for backend calls. This is handled by the `app.py` file.
107
- 2. **Documents Folder**: The `documents/` folder is automatically created if it doesn't exist. To bundle PDFs, either enable HF Xet storage for `documents/` or upload the files via the Space UI after each push (standard git pushes reject large binaries).
108
- 3. **Processed Data**: `processed_documents.json` can be bundled with the repo. Because the backend now tries to bootstrap from this file at import/startup, make sure it reflects the same content you expect the Space to serve (and keep it under version control if you rely on it).
109
- 4. **Vectorstore**: The `vectorstore/` folder is now included in the Docker image if it exists in your repo. If you have a pre-built vectorstore, include it in your repository and it will be copied to the Docker image. If the vectorstore folder doesn't exist in your repo, ensure an empty folder exists (create with `mkdir -p vectorstore && touch vectorstore/.gitkeep`) or the Docker build may fail. The vectorstore will be created at runtime from `processed_documents.json` if not pre-built.
110
- 5. **Port**: Hugging Face Spaces uses port 7860 by default, which is configured in `app.py`.
111
-
112
- ## Troubleshooting
113
-
114
- ### Build Fails
115
-
116
- - Check the build logs in the Space's "Logs" tab
117
- - Ensure all dependencies are in `requirements.txt`
118
- - Verify the Dockerfile is correct
119
-
120
- ### API Errors
121
-
122
- - Check that `OPENAI_API_KEY` is set correctly in Space secrets
123
- - Verify the API key is valid and has credits
124
-
125
- ### Frontend Not Loading
126
-
127
- - Ensure `npm run build` was run successfully
128
- - Check that `frontend/build/` directory exists and contains `index.html`
129
-
130
- ## Updating Your Space
131
-
132
- To update your deployed application:
133
-
134
- ```bash
135
- git add .
136
- git commit -m "Update description"
137
- git push origin main
138
- ```
139
-
140
- Hugging Face will automatically rebuild and redeploy.
141
-
142
-
143
-
144
-
145
-
146
-
147
-
148
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
backend/chat_history.py CHANGED
@@ -11,13 +11,30 @@ class ChatHistory:
11
  self.max_history = max_history
12
  self.history: List[Dict[str, str]] = []
13
 
14
- def add_message(self, role: str, content: str):
15
- """Add a message to chat history"""
16
- self.history.append({
 
 
 
 
 
 
 
17
  "role": role,
18
  "content": content,
19
  "timestamp": datetime.now().isoformat()
20
- })
 
 
 
 
 
 
 
 
 
 
21
 
22
  # Keep only last N messages
23
  if len(self.history) > self.max_history * 2: # *2 because we have user + assistant pairs
@@ -45,6 +62,59 @@ class ChatHistory:
45
  # Format for OpenAI API (remove timestamp)
46
  return [{"role": msg["role"], "content": msg["content"]} for msg in last_two]
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  def clear(self):
49
  """Clear chat history"""
50
  self.history = []
 
11
  self.max_history = max_history
12
  self.history: List[Dict[str, str]] = []
13
 
14
+ def add_message(self, role: str, content: str, source_document: Optional[str] = None, chunks: Optional[List[str]] = None):
15
+ """Add a message to chat history
16
+
17
+ Args:
18
+ role: Message role ("user" or "assistant")
19
+ content: Message content
20
+ source_document: Optional document filename used for assistant messages
21
+ chunks: Optional list of chunk texts used for assistant messages (in chunk mode)
22
+ """
23
+ message = {
24
  "role": role,
25
  "content": content,
26
  "timestamp": datetime.now().isoformat()
27
+ }
28
+
29
+ # Store document source for assistant messages
30
+ if role == "assistant" and source_document:
31
+ message["source_document"] = source_document
32
+
33
+ # Store chunks for assistant messages
34
+ if role == "assistant" and chunks:
35
+ message["chunks"] = chunks
36
+
37
+ self.history.append(message)
38
 
39
  # Keep only last N messages
40
  if len(self.history) > self.max_history * 2: # *2 because we have user + assistant pairs
 
62
  # Format for OpenAI API (remove timestamp)
63
  return [{"role": msg["role"], "content": msg["content"]} for msg in last_two]
64
 
65
+ def get_last_document(self) -> Optional[str]:
66
+ """Get the document filename used in the last assistant response
67
+
68
+ Returns:
69
+ Document filename if last message was assistant with a document, None otherwise
70
+ """
71
+ if not self.history:
72
+ return None
73
+
74
+ # Check last message
75
+ last_msg = self.history[-1]
76
+ if last_msg.get("role") == "assistant":
77
+ return last_msg.get("source_document")
78
+
79
+ return None
80
+
81
+ def get_last_turn_with_document(self) -> Optional[List[Dict[str, str]]]:
82
+ """Get the last chat turn that used a document (skipping general questions)
83
+
84
+ Returns:
85
+ List of messages from the last turn with a document, or None if no such turn exists
86
+ """
87
+ # Search backwards through history to find last assistant message with a document
88
+ for i in range(len(self.history) - 1, 0, -1):
89
+ msg = self.history[i]
90
+ if msg.get("role") == "assistant" and msg.get("source_document"):
91
+ # Found an assistant message with a document
92
+ # Get the turn (user + assistant pair)
93
+ if i >= 1 and self.history[i-1].get("role") == "user":
94
+ # Format for OpenAI API (remove timestamp, keep source_document in metadata)
95
+ return [
96
+ {"role": self.history[i-1]["role"], "content": self.history[i-1]["content"]},
97
+ {"role": msg["role"], "content": msg["content"]}
98
+ ]
99
+
100
+ return None
101
+
102
+ def get_last_chunks(self) -> Optional[List[str]]:
103
+ """Get the chunks used in the last assistant response
104
+
105
+ Returns:
106
+ List of chunk texts if last message was assistant with chunks, None otherwise
107
+ """
108
+ if not self.history:
109
+ return None
110
+
111
+ # Check last message
112
+ last_msg = self.history[-1]
113
+ if last_msg.get("role") == "assistant":
114
+ return last_msg.get("chunks")
115
+
116
+ return None
117
+
118
  def clear(self):
119
  """Clear chat history"""
120
  self.history = []
backend/document_processor.py CHANGED
@@ -1,5 +1,6 @@
1
  import os
2
  import json
 
3
  from pathlib import Path
4
  from typing import Dict, List, Optional
5
  from openai import OpenAI
@@ -21,10 +22,27 @@ class DocumentProcessor:
21
  raise ValueError("OpenAI API key is required")
22
 
23
  os.environ.setdefault("OPENAI_API_KEY", api_key)
24
- http_client = NoProxyHTTPClient(timeout=300.0)
25
  self.client = OpenAI(http_client=http_client)
26
  self.model = model
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  def process_pdf_with_llm(self, pdf_path: str) -> Dict[str, str]:
30
  """
@@ -48,13 +66,74 @@ class DocumentProcessor:
48
  purpose="user_data"
49
  )
50
 
51
- prompt = (
52
- "You are processing an Arabic legal document. "
53
- "Extract ONLY the main content text (remove headers, footers, page numbers, duplicate elements). "
54
- "Clean the text to remove formatting artifacts. "
55
- "Generate a concise summary in Arabic covering all important content. "
56
- '\nReturn ONLY valid JSON with exactly these fields: {"text": "...", "summary": "..."}'
57
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  # Use SDK responses API
60
  response = self.client.responses.create(
@@ -116,12 +195,22 @@ class DocumentProcessor:
116
 
117
  # Load existing processed documents
118
  existing_docs = []
119
- existing_filenames = set()
 
120
  if skip_existing:
121
  existing_docs = self.load_from_json()
122
- existing_filenames = {doc.get("filename") for doc in existing_docs if doc.get("filename")}
 
 
 
 
 
 
 
123
  if existing_filenames:
124
  print(f"Found {len(existing_filenames)} already processed documents")
 
 
125
 
126
  pdf_files = list(folder.glob("*.pdf"))
127
  new_processed_docs = []
@@ -129,13 +218,23 @@ class DocumentProcessor:
129
 
130
  for pdf_file in pdf_files:
131
  filename = pdf_file.name
 
132
 
133
- # Skip if already processed
134
- if skip_existing and filename in existing_filenames:
 
 
 
135
  print(f"⊘ Skipped (already processed): {filename}")
136
  skipped_count += 1
137
  continue
138
 
 
 
 
 
 
 
139
  # Process new document
140
  try:
141
  result = self.process_pdf_with_llm(str(pdf_file))
@@ -171,11 +270,16 @@ class DocumentProcessor:
171
  if append and json_path.exists():
172
  # Load existing and merge, avoiding duplicates
173
  existing_docs = self.load_from_json(json_path)
174
- existing_filenames = {doc.get("filename") for doc in existing_docs}
 
175
 
176
- # Add only new documents
177
  for doc in processed_docs:
178
- if doc.get("filename") not in existing_filenames:
 
 
 
 
179
  existing_docs.append(doc)
180
 
181
  processed_docs = existing_docs
 
1
  import os
2
  import json
3
+ import unicodedata
4
  from pathlib import Path
5
  from typing import Dict, List, Optional
6
  from openai import OpenAI
 
22
  raise ValueError("OpenAI API key is required")
23
 
24
  os.environ.setdefault("OPENAI_API_KEY", api_key)
25
+ http_client = NoProxyHTTPClient(timeout=900.0)
26
  self.client = OpenAI(http_client=http_client)
27
  self.model = model
28
 
29
+ @staticmethod
30
+ def _normalize_filename(filename: str) -> str:
31
+ """
32
+ Normalize filename for comparison (handle Unicode encoding variations).
33
+
34
+ Args:
35
+ filename: Original filename
36
+
37
+ Returns: Normalized filename (NFC form, lowercased, stripped)
38
+ """
39
+ if not filename:
40
+ return ""
41
+ # Normalize to NFC (composed form) to handle encoding variations
42
+ normalized = unicodedata.normalize("NFC", filename)
43
+ # Lowercase and strip for case-insensitive comparison
44
+ return normalized.lower().strip()
45
+
46
 
47
  def process_pdf_with_llm(self, pdf_path: str) -> Dict[str, str]:
48
  """
 
66
  purpose="user_data"
67
  )
68
 
69
+ prompt =("""
70
+ You are processing a legal PDF document (in Arabic) that has been uploaded as a file.
71
+
72
+ Your task has TWO parts:
73
+
74
+ 1) TEXT EXTRACTION & CLEANING
75
+ 2) GLOBAL SUMMARY IN ARABIC
76
+
77
+ ========================
78
+ 1) TEXT EXTRACTION & CLEANING
79
+ ========================
80
+ Extract ONLY the **main body text** of the entire document, in order, exactly as it appears logically in the statute, while cleaning away non-content noise.
81
+
82
+ INCLUDE:
83
+ - All legal text and provisions
84
+ - Article numbers and titles
85
+ - Section / chapter / part / الباب / الفصل headings
86
+ - Numbered clauses, subclauses, bullet points
87
+ - Any explanatory legal text that is part of the law itself
88
+
89
+ EXCLUDE (REMOVE COMPLETELY):
90
+ - Headers on each page (e.g., publication dates, التصنيف, نوع التشريع, حالة التشريع, etc.)
91
+ - Footers on each page
92
+ - Page numbers
93
+ - Any repeated boilerplate that appears identically on each page
94
+ - Scanning artifacts, junk characters, or layout noise
95
+ - Empty or whitespace-only lines that are not meaningful
96
+
97
+ IMPORTANT CLEANING RULES:
98
+ - Preserve the original language (Arabic). Do NOT translate the law.
99
+ - Preserve the logical order of the articles and sections as in the original law.
100
+ - Do NOT paraphrase, shorten, summarize, or reword the legal text. Copy the body text as-is (except for removing headers/footers/page numbers and cleaning artifacts).
101
+ - If the same header/footer text appears on many pages, remove all occurrences.
102
+ - If you are unsure whether a short line is a page number or header/footer (e.g. just a digit or date in the margin), treat it as NON-content and remove it.
103
+ - Keep reasonable line breaks and blank lines between titles, articles, and sections so the text is readable and structured, but do not insert additional commentary.
104
+ - Do NOT invent or hallucinate any missing articles or text. Only use what is actually present in the PDF content.
105
+
106
+ The final "text" field should contain the **full cleaned main body** of the law as ONE string, with newline characters where appropriate.
107
+
108
+ ========================
109
+ 2) GLOBAL SUMMARY (IN ARABIC)
110
+ ========================
111
+ After extracting the cleaned body text, generate a **concise summary in Arabic** that:
112
+
113
+ - Covers جميع الأبواب والفصول والمواد بشكل موجز
114
+ - يوضح موضوع النظام، نطاق تطبيقه، وأهم الأحكام (مثل: الزواج، الحقوق والواجبات، النفقة، النسب، الفرقة، العدة، الحضانة، الوصاية، الولاية، الوصية، المفقود، إلخ)
115
+ - يكون بصياغة عربية فصحى واضحة ومباشرة
116
+ - يكون في بضع فقرات قصيرة أو قائمة نقاط موجزة (بدون إطالة مفرطة)
117
+
118
+ لا تُدخل في الملخص أي تحليلات فقهية أو آراء، فقط وصف منظم لأهم الأحكام.
119
+
120
+
121
+ REQUIREMENTS:
122
+ - Do NOT wrap the JSON in Markdown.
123
+ - Do NOT add any extra keys or metadata.
124
+ - Do NOT add explanations before or after the JSON.
125
+ - Ensure the JSON is valid and parseable (proper quotes, commas, and escaping).
126
+
127
+
128
+ ========================
129
+ OUTPUT FORMAT (STRICT)
130
+ ========================
131
+ Return ONLY a single JSON object, with EXACTLY these two fields:
132
+
133
+ {
134
+ "text": "<the full cleaned main body text of the document as one string>",
135
+ "summary": "<the concise Arabic summary of the entire document>"
136
+ } """)
137
 
138
  # Use SDK responses API
139
  response = self.client.responses.create(
 
195
 
196
  # Load existing processed documents
197
  existing_docs = []
198
+ existing_filenames = set() # Original filenames for reference
199
+ existing_filenames_normalized = set() # Normalized filenames for comparison
200
  if skip_existing:
201
  existing_docs = self.load_from_json()
202
+ for doc in existing_docs:
203
+ original_filename = doc.get("filename")
204
+ if original_filename:
205
+ original_filename = original_filename.strip()
206
+ normalized = self._normalize_filename(original_filename)
207
+ existing_filenames.add(original_filename)
208
+ existing_filenames_normalized.add(normalized)
209
+
210
  if existing_filenames:
211
  print(f"Found {len(existing_filenames)} already processed documents")
212
+ print(f"Existing filenames (original): {list(existing_filenames)}")
213
+ print(f"Existing filenames (normalized): {list(existing_filenames_normalized)}")
214
 
215
  pdf_files = list(folder.glob("*.pdf"))
216
  new_processed_docs = []
 
218
 
219
  for pdf_file in pdf_files:
220
  filename = pdf_file.name
221
+ filename_normalized = self._normalize_filename(filename)
222
 
223
+ # Debug: Print comparison attempt
224
+ print(f"[Filename Check] Checking: '{filename}' (normalized: '{filename_normalized}')")
225
+
226
+ # Skip if already processed (using normalized comparison)
227
+ if skip_existing and filename_normalized in existing_filenames_normalized:
228
  print(f"⊘ Skipped (already processed): {filename}")
229
  skipped_count += 1
230
  continue
231
 
232
+ # Also check original filename for backward compatibility
233
+ if skip_existing and filename in existing_filenames:
234
+ print(f"⊘ Skipped (already processed, exact match): {filename}")
235
+ skipped_count += 1
236
+ continue
237
+
238
  # Process new document
239
  try:
240
  result = self.process_pdf_with_llm(str(pdf_file))
 
270
  if append and json_path.exists():
271
  # Load existing and merge, avoiding duplicates
272
  existing_docs = self.load_from_json(json_path)
273
+ existing_filenames = {doc.get("filename") for doc in existing_docs if doc.get("filename")}
274
+ existing_filenames_normalized = {self._normalize_filename(fn) for fn in existing_filenames}
275
 
276
+ # Add only new documents (using normalized comparison)
277
  for doc in processed_docs:
278
+ doc_filename = doc.get("filename", "")
279
+ doc_filename_normalized = self._normalize_filename(doc_filename)
280
+
281
+ # Check both normalized and original for backward compatibility
282
+ if doc_filename not in existing_filenames and doc_filename_normalized not in existing_filenames_normalized:
283
  existing_docs.append(doc)
284
 
285
  processed_docs = existing_docs
backend/embeddings.py CHANGED
@@ -1,11 +1,15 @@
1
  import os
2
  import time
3
  import random
4
- from typing import List
5
  from pathlib import Path
6
  from dotenv import load_dotenv
7
  import httpx
8
  from openai import OpenAI
 
 
 
 
9
 
10
 
11
  def _chunk_list(items: List[str], chunk_size: int) -> List[List[str]]:
@@ -91,5 +95,169 @@ class OpenAIEmbeddingsWrapper:
91
 
92
  def embed_documents(self, texts: List[str]) -> List[List[float]]:
93
  return self._embed(texts)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  #شرح نظام الأحوال الشخصية
 
1
  import os
2
  import time
3
  import random
4
+ from typing import List, Optional
5
  from pathlib import Path
6
  from dotenv import load_dotenv
7
  import httpx
8
  from openai import OpenAI
9
+ try:
10
+ from huggingface_hub import InferenceClient
11
+ except ImportError:
12
+ InferenceClient = None
13
 
14
 
15
  def _chunk_list(items: List[str], chunk_size: int) -> List[List[str]]:
 
95
 
96
  def embed_documents(self, texts: List[str]) -> List[List[float]]:
97
  return self._embed(texts)
98
+
99
+ def __call__(self, text: str) -> List[float]:
100
+ """
101
+ Make the embeddings wrapper callable for compatibility with FAISS.
102
+ When FAISS calls the embeddings object directly, this delegates to embed_query.
103
+ """
104
+ return self.embed_query(text)
105
+
106
+
107
+ class HuggingFaceEmbeddingsWrapper:
108
+ """
109
+ Embeddings wrapper compatible with LangChain's embeddings interface.
110
+ Uses HuggingFace InferenceClient with Nebius provider for embeddings.
111
+ Implements same interface as OpenAIEmbeddingsWrapper for drop-in replacement.
112
+ """
113
+ def __init__(self, model: str = "Qwen/Qwen3-Embedding-8B", api_key: str | None = None, timeout: float = 60.0):
114
+ if InferenceClient is None:
115
+ raise ImportError("huggingface_hub is required for HuggingFace embeddings. Install it with: pip install huggingface_hub")
116
+
117
+ # Load .env from project root (one level up from backend/)
118
+ project_root = Path(__file__).resolve().parents[1]
119
+ load_dotenv(project_root / ".env")
120
+
121
+ self.model = model or os.getenv("HF_EMBEDDING_MODEL", "Qwen/Qwen3-Embedding-8B")
122
+ self.api_key = api_key or os.getenv("HF_TOKEN")
123
+ if not self.api_key:
124
+ raise ValueError("HF_TOKEN is required for HuggingFace embeddings. Set HF_TOKEN environment variable.")
125
+
126
+ # Timeout/backoff config
127
+ self.timeout = timeout
128
+ self.batch_size = int(os.getenv("HF_EMBED_BATCH_SIZE", "32")) # Smaller batch size for HF
129
+ self.max_retries = int(os.getenv("HF_EMBED_MAX_RETRIES", "6"))
130
+ self.initial_backoff = float(os.getenv("HF_EMBED_INITIAL_BACKOFF", "1.0"))
131
+ self.backoff_multiplier = float(os.getenv("HF_EMBED_BACKOFF_MULTIPLIER", "2.0"))
132
+
133
+ # Initialize HuggingFace InferenceClient with Nebius provider
134
+ self.client = InferenceClient(
135
+ provider="nebius",
136
+ api_key=self.api_key
137
+ )
138
+ print(f"[HF Embeddings] Initialized with model: {self.model}, provider: nebius")
139
+
140
+ def _embed_once(self, inputs: List[str]) -> List[List[float]]:
141
+ """Call HuggingFace feature_extraction API for a batch of texts"""
142
+ import numpy as np
143
+
144
+ # HuggingFace feature_extraction can handle single or batch inputs
145
+ if len(inputs) == 1:
146
+ # Single text
147
+ result = self.client.feature_extraction(inputs[0], model=self.model)
148
+ # Result is numpy.ndarray - convert to list
149
+ if isinstance(result, np.ndarray):
150
+ if result.ndim == 2:
151
+ # 2D array - extract first row
152
+ result = result[0].tolist()
153
+ else:
154
+ # 1D array - convert directly
155
+ result = result.tolist()
156
+ # Result is a list of floats (embedding vector)
157
+ return [result]
158
+ else:
159
+ # Batch processing - HF may support batch, but we'll process one by one for reliability
160
+ embeddings = []
161
+ for text in inputs:
162
+ result = self.client.feature_extraction(text, model=self.model)
163
+ # Convert numpy array to list if needed
164
+ if isinstance(result, np.ndarray):
165
+ if result.ndim == 2:
166
+ result = result[0].tolist() # Extract first row if 2D
167
+ else:
168
+ result = result.tolist()
169
+ embeddings.append(result)
170
+ return embeddings
171
+
172
+ def _embed_with_retries(self, inputs: List[str]) -> List[List[float]]:
173
+ """Embed with retry logic similar to OpenAI wrapper"""
174
+ attempt = 0
175
+ backoff = self.initial_backoff
176
+ while True:
177
+ try:
178
+ return self._embed_once(inputs)
179
+ except Exception as err:
180
+ status = None
181
+ try:
182
+ # Try to extract status code from error if available
183
+ status = getattr(getattr(err, "response", None), "status_code", None)
184
+ except Exception:
185
+ status = None
186
+
187
+ if (status in (429, 500, 502, 503, 504) or status is None) and attempt < self.max_retries:
188
+ retry_after = 0.0
189
+ try:
190
+ retry_after = float(getattr(getattr(err, "response", None), "headers", {}).get("Retry-After", 0))
191
+ except Exception:
192
+ retry_after = 0.0
193
+ jitter = random.uniform(0, 0.5)
194
+ sleep_s = max(retry_after, backoff) + jitter
195
+ time.sleep(sleep_s)
196
+ attempt += 1
197
+ backoff *= self.backoff_multiplier
198
+ continue
199
+ raise
200
+
201
+ def _embed(self, inputs: List[str]) -> List[List[float]]:
202
+ """Process embeddings in batches with delays between batches"""
203
+ all_embeddings: List[List[float]] = []
204
+ for batch in _chunk_list(inputs, self.batch_size):
205
+ embeds = self._embed_with_retries(batch)
206
+ all_embeddings.extend(embeds)
207
+ # Small delay between batches to avoid rate limiting
208
+ time.sleep(float(os.getenv("HF_EMBED_INTER_BATCH_DELAY", "0.2")))
209
+ return all_embeddings
210
+
211
+ def embed_query(self, text: str) -> List[float]:
212
+ """Embed a single query text"""
213
+ return self._embed([text])[0]
214
+
215
+ def embed_documents(self, texts: List[str]) -> List[List[float]]:
216
+ """Embed multiple documents"""
217
+ return self._embed(texts)
218
+
219
+ def __call__(self, text: str) -> List[float]:
220
+ """
221
+ Make the embeddings wrapper callable for compatibility with FAISS.
222
+ When FAISS calls the embeddings object directly, this delegates to embed_query.
223
+ """
224
+ return self.embed_query(text)
225
+
226
+
227
+ def get_embeddings_wrapper(
228
+ model: Optional[str] = None,
229
+ api_key: Optional[str] = None,
230
+ timeout: float = 30.0
231
+ ):
232
+ """
233
+ Factory function to get the appropriate embeddings wrapper based on configuration.
234
+
235
+ Args:
236
+ model: Model name (provider-specific)
237
+ api_key: API key (provider-specific)
238
+ timeout: Timeout in seconds
239
+
240
+ Returns:
241
+ Either OpenAIEmbeddingsWrapper or HuggingFaceEmbeddingsWrapper instance
242
+
243
+ Environment Variables:
244
+ EMBEDDINGS_PROVIDER: "openai" (default), "huggingface", "hf", or "nebius"
245
+ HF_TOKEN: Required if using HuggingFace provider
246
+ HF_EMBEDDING_MODEL: Optional model override for HuggingFace (default: "Qwen/Qwen3-Embedding-8B")
247
+ """
248
+ # Load .env from project root
249
+ project_root = Path(__file__).resolve().parents[1]
250
+ load_dotenv(project_root / ".env")
251
+
252
+ provider = os.getenv("EMBEDDINGS_PROVIDER", "hf").lower() #openai
253
+
254
+ if provider in ["huggingface", "hf", "nebius"]:
255
+ print(f"[Embeddings Factory] Using HuggingFace/Nebius provider")
256
+ hf_model = model or os.getenv("HF_EMBEDDING_MODEL", "Qwen/Qwen3-Embedding-8B")
257
+ return HuggingFaceEmbeddingsWrapper(model=hf_model, api_key=api_key, timeout=timeout)
258
+ else:
259
+ print(f"[Embeddings Factory] Using OpenAI provider (default)")
260
+ openai_model = model or os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-ada-002")
261
+ return OpenAIEmbeddingsWrapper(model=openai_model, api_key=api_key, timeout=timeout)
262
 
263
  #شرح نظام الأحوال الشخصية
backend/main.py CHANGED
@@ -104,6 +104,8 @@ app.add_middleware(
104
  class QuestionRequest(BaseModel):
105
  question: str
106
  use_history: Optional[bool] = True
 
 
107
 
108
 
109
  class QuestionResponse(BaseModel):
@@ -175,6 +177,9 @@ async def health():
175
  @app.post("/ask", response_model=QuestionResponse)
176
  async def ask_question(request: QuestionRequest):
177
  """Answer a question using RAG with multi-turn chat history"""
 
 
 
178
  global rag_system, rag_ready
179
  if rag_system is None or not rag_ready:
180
  raise HTTPException(
@@ -186,13 +191,14 @@ async def ask_question(request: QuestionRequest):
186
  raise HTTPException(status_code=400, detail="Question cannot be empty")
187
 
188
  try:
189
- answer, sources = rag_system.answer_question(
190
  request.question,
191
  use_history=request.use_history,
192
- model_provider="qwen",
 
193
  )
194
- # print("[/ask] Qwen answer:", answer)
195
- # print("[/ask] Sources:", sources)
196
  return QuestionResponse(answer=answer, sources=sources)
197
  except Exception as e:
198
  raise HTTPException(
@@ -229,12 +235,8 @@ async def get_document(filename: str, mode: str = Query("download", enum=["downl
229
 
230
  # If file doesn't exist, try to find it by matching actual files in directory
231
  if not file_path.exists():
232
- print(f"[get_document] Document not found at direct path: {file_path}")
233
- print(f"[get_document] Searching for filename: {decoded_filename}")
234
-
235
  # List all PDF files in documents directory
236
  actual_files = list(documents_dir.glob("*.pdf"))
237
- print(f"[get_document] Found {len(actual_files)} PDF files in directory")
238
 
239
  # Normalize the requested filename for comparison
240
  def normalize_name(name: str) -> str:
@@ -253,24 +255,17 @@ async def get_document(filename: str, mode: str = Query("download", enum=["downl
253
  actual_name = actual_file.name
254
  actual_normalized = normalize_name(actual_name)
255
 
256
- print(f"[get_document] Comparing: '{requested_normalized}' with '{actual_normalized}'")
257
-
258
  if requested_normalized == actual_normalized:
259
  matched_file = actual_file
260
- print(f"[get_document] Found match: {actual_file.name}")
261
  break
262
 
263
  if matched_file:
264
  file_path = matched_file.resolve()
265
  else:
266
- # Log all available files for debugging
267
- print(f"[get_document] Available files in directory:")
268
- for f in actual_files:
269
- print(f"[get_document] - {f.name}")
270
- print(f"[get_document] Requested filename (normalized): {requested_normalized}")
271
  raise HTTPException(
272
  status_code=404,
273
- detail=f"Document not found: {decoded_filename}. Available files: {[f.name for f in actual_files]}"
274
  )
275
 
276
  file_extension = file_path.suffix.lower()
@@ -288,12 +283,35 @@ async def get_document(filename: str, mode: str = Query("download", enum=["downl
288
 
289
  if mode == "preview":
290
  if file_extension != ".pdf":
291
- return JSONResponse({"filename": filename, "error": "Preview only available for PDF files"}, status_code=400)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
292
  return FileResponse(
293
  str(file_path),
294
  media_type="application/pdf",
295
  filename=filename,
296
- headers=build_headers("inline")
297
  )
298
 
299
  media_type = "application/pdf" if file_extension == ".pdf" else "application/octet-stream"
 
104
  class QuestionRequest(BaseModel):
105
  question: str
106
  use_history: Optional[bool] = True
107
+ context_mode: Optional[str] = "chunks"
108
+ model_provider: Optional[str] = "qwen" #qwen openai or huggingface
109
 
110
 
111
  class QuestionResponse(BaseModel):
 
177
  @app.post("/ask", response_model=QuestionResponse)
178
  async def ask_question(request: QuestionRequest):
179
  """Answer a question using RAG with multi-turn chat history"""
180
+ import time
181
+ request_start = time.perf_counter()
182
+
183
  global rag_system, rag_ready
184
  if rag_system is None or not rag_ready:
185
  raise HTTPException(
 
191
  raise HTTPException(status_code=400, detail="Question cannot be empty")
192
 
193
  try:
194
+ answer, sources, _chunks = rag_system.answer_question(
195
  request.question,
196
  use_history=request.use_history,
197
+ model_provider=request.model_provider ,
198
+ context_mode=request.context_mode or "full",
199
  )
200
+ request_time = (time.perf_counter() - request_start) * 1000
201
+ print(f"[Timing] Total /ask endpoint time: {request_time:.2f}ms")
202
  return QuestionResponse(answer=answer, sources=sources)
203
  except Exception as e:
204
  raise HTTPException(
 
235
 
236
  # If file doesn't exist, try to find it by matching actual files in directory
237
  if not file_path.exists():
 
 
 
238
  # List all PDF files in documents directory
239
  actual_files = list(documents_dir.glob("*.pdf"))
 
240
 
241
  # Normalize the requested filename for comparison
242
  def normalize_name(name: str) -> str:
 
255
  actual_name = actual_file.name
256
  actual_normalized = normalize_name(actual_name)
257
 
 
 
258
  if requested_normalized == actual_normalized:
259
  matched_file = actual_file
 
260
  break
261
 
262
  if matched_file:
263
  file_path = matched_file.resolve()
264
  else:
265
+ error_detail = f"Document not found: '{decoded_filename}'. Available files: {[f.name for f in actual_files]}"
 
 
 
 
266
  raise HTTPException(
267
  status_code=404,
268
+ detail=error_detail
269
  )
270
 
271
  file_extension = file_path.suffix.lower()
 
283
 
284
  if mode == "preview":
285
  if file_extension != ".pdf":
286
+ error_msg = f"Preview only available for PDF files. File extension: {file_extension}"
287
+ return JSONResponse({"filename": filename, "error": error_msg}, status_code=400)
288
+
289
+ # Verify file exists before returning
290
+ if not file_path.exists():
291
+ error_msg = f"File not found for preview: {file_path}"
292
+ raise HTTPException(status_code=404, detail=error_msg)
293
+
294
+ # Verify file is readable and not empty
295
+ try:
296
+ file_size = file_path.stat().st_size
297
+ if file_size == 0:
298
+ error_msg = f"File is empty: {file_path}"
299
+ raise HTTPException(status_code=400, detail=error_msg)
300
+ except Exception as e:
301
+ error_msg = f"Error accessing file: {str(e)}"
302
+ raise HTTPException(status_code=500, detail=error_msg)
303
+
304
+ # Build headers for preview (inline display)
305
+ preview_headers = build_headers("inline")
306
+ # Add CORS headers if needed
307
+ preview_headers["Access-Control-Allow-Origin"] = "*"
308
+ preview_headers["Access-Control-Expose-Headers"] = "Content-Disposition, Content-Type"
309
+
310
  return FileResponse(
311
  str(file_path),
312
  media_type="application/pdf",
313
  filename=filename,
314
+ headers=preview_headers
315
  )
316
 
317
  media_type = "application/pdf" if file_extension == ".pdf" else "application/octet-stream"
backend/rag_system.py CHANGED
@@ -1,15 +1,17 @@
1
  import os
2
  import json
 
3
  from pathlib import Path
4
  from typing import List, Tuple, Optional, Dict
5
  from langchain_community.vectorstores import FAISS
6
  from langchain.schema import Document
 
7
  try:
8
- from backend.embeddings import OpenAIEmbeddingsWrapper
9
  from backend.document_processor import DocumentProcessor
10
  from backend.chat_history import ChatHistory
11
  except ModuleNotFoundError:
12
- from embeddings import OpenAIEmbeddingsWrapper
13
  from document_processor import DocumentProcessor
14
  from chat_history import ChatHistory
15
  from openai import OpenAI
@@ -36,18 +38,29 @@ class RAGSystem:
36
  self.json_path = json_path
37
  self.vectorstore = None
38
 
39
- # Initialize embeddings
40
- api_key = openai_api_key or os.getenv("OPENAI_API_KEY")
41
- if not api_key:
42
- raise ValueError("OpenAI API key is required. Set OPENAI_API_KEY environment variable.")
 
 
 
 
 
 
 
 
43
 
44
- self.embeddings = OpenAIEmbeddingsWrapper(api_key=api_key)
45
 
46
- # Initialize document processor
47
- self.processor = DocumentProcessor(api_key=api_key)
 
 
 
48
 
49
  # Initialize LLM client for answering questions
50
- os.environ.setdefault("OPENAI_API_KEY", api_key)
51
  http_client = NoProxyHTTPClient(timeout=60.0)
52
  self.llm_client = OpenAI(http_client=http_client)
53
  self.llm_model = os.getenv("OPENAI_LLM_MODEL", "gpt-4o-mini")
@@ -55,6 +68,13 @@ class RAGSystem:
55
  # Chat history manager
56
  self.chat_history = ChatHistory(max_history=int(os.getenv("CHAT_HISTORY_TURNS", "10")))
57
 
 
 
 
 
 
 
 
58
  # Try to load existing vectorstore
59
  self._load_vectorstore()
60
 
@@ -67,8 +87,22 @@ class RAGSystem:
67
  embeddings=self.embeddings,
68
  allow_dangerous_deserialization=True
69
  )
70
- # Ensure embedding function is callable
71
- self.vectorstore.embedding_function = self.embeddings.embed_query
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  print(f"Loaded existing vectorstore from {self.vectorstore_path}")
73
  except Exception as e:
74
  print(f"Could not load existing vectorstore: {e}")
@@ -171,7 +205,298 @@ class RAGSystem:
171
 
172
  return len(new_processed_docs)
173
 
174
- def answer_question(self, question: str, use_history: bool = True, model_provider: str = "openai") -> Tuple[str, List[str]]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  """
176
  Answer a question using RAG with multi-turn chat history
177
 
@@ -179,18 +504,33 @@ class RAGSystem:
179
  question: The user's question
180
  use_history: Whether to use chat history
181
  model_provider: Model provider to use - "openai" (default) or "qwen"/"huggingface" for Qwen model
 
182
 
183
  Returns:
184
- Tuple of (answer, list of source filenames)
185
  """
 
 
186
  if self.vectorstore is None:
187
  raise ValueError("No documents indexed. Please process documents first.")
188
 
189
- # Ensure embedding function is callable
190
- if getattr(self.vectorstore, "embedding_function", None) is None or not callable(self.vectorstore.embedding_function):
191
- self.vectorstore.embedding_function = self.embeddings.embed_query
 
 
 
 
 
 
 
 
 
 
 
 
 
192
 
193
- # Step 1: Find most similar summary
194
  # Build search query with last chat turn context if history is enabled
195
  search_query = question
196
  if use_history:
@@ -207,70 +547,181 @@ class RAGSystem:
207
  # Combine with current question
208
  search_query = f"{last_turn_text}\nCurrent Q: {question}"
209
 
210
- similar_docs = self.vectorstore.similarity_search(search_query, k=1)
 
 
 
 
211
 
212
- if not similar_docs:
213
- return "I couldn't find any relevant information to answer your question.", []
214
 
215
- # Step 2: Get filename from matched summary
216
- matched_filename = similar_docs[0].metadata.get("filename", "")
 
217
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
 
219
- if not matched_filename:
220
- return "Error: No filename found in matched document metadata.", []
 
 
 
 
221
 
222
- # Step 3: Retrieve full text from JSON
223
- print(f"DEBUG: Searching for filename: '{matched_filename}'")
224
- print(f"DEBUG: JSON path: {self.json_path}")
 
225
 
226
- full_text = self.processor.get_text_by_filename(matched_filename, json_path=self.json_path)
 
 
 
 
 
 
 
 
 
 
 
227
 
228
  if not full_text:
229
- # Debug: Check what's in the JSON file
230
- json_path = Path(self.json_path)
231
- if not json_path.exists():
232
- error_msg = f"Error: JSON file not found at {self.json_path}. Please process documents first."
233
- print(f"DEBUG: {error_msg}")
234
- return error_msg, [matched_filename]
235
 
236
- try:
237
- with open(json_path, "r", encoding="utf-8") as f:
238
- docs = json.load(f)
239
- available_filenames = [doc.get("filename", "unknown") for doc in docs] if isinstance(docs, list) else []
240
- print(f"DEBUG: JSON file exists with {len(available_filenames) if isinstance(docs, list) else 0} documents")
241
- print(f"DEBUG: Available filenames: {available_filenames}")
242
-
243
- error_msg = f"Could not retrieve text for document: '{matched_filename}'. "
244
- if available_filenames:
245
- error_msg += f"Available filenames in JSON: {', '.join(available_filenames)}"
246
- else:
247
- error_msg += "JSON file is empty or invalid."
248
- return error_msg, [matched_filename]
249
- except Exception as e:
250
- error_msg = f"Error loading JSON file: {str(e)}"
251
- print(f"DEBUG: {error_msg}")
252
- return error_msg, [matched_filename]
253
 
254
- # Step 4: Build prompt with full text, question, and chat history
 
255
  history_messages = []
256
  if use_history:
257
  # Get last 3 messages (get 2 turns = 4 messages, then take last 3)
258
  history_messages = self.chat_history.get_recent_history(n_turns=2)
259
 
260
- system_prompt = f"""You are a helpful legal document assistant. Answer questions based on the provided document text.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
  MODE 1 - General Questions:
263
  - Understand the context and provide a clear, helpful answer
264
  - You may paraphrase or summarize information from the document
265
  - Explain concepts in your own words while staying true to the document's meaning
266
- - Whenever you refere to the document, refer to it by the filename WITHOUT the extension such as ".pdf" or".doc": {matched_filename}
267
 
268
  MODE 2 - Legal Articles/Terms (المادة):
269
- - When the user asks about specific legal articles (المادة), legal terms, or exact regulations, you MUST quote the EXACT text from the document verbatim
 
270
  - Copy the complete text word-for-word, including all numbers, punctuation, and formatting
271
- - Do NOT paraphrase, summarize, or generate new text for legal articles
272
  - NEVER create or generate legal text - only use what exists in the document
273
 
 
 
 
 
 
 
 
274
  If the answer is not in the document, say so clearly. MUST Answer in Arabic."""
275
 
276
  # Check if question contains legal article/term keywords
@@ -278,17 +729,18 @@ If the answer is not in the document, say so clearly. MUST Answer in Arabic."""
278
 
279
  legal_instruction = ""
280
  if is_legal_term_question:
281
- legal_instruction = "\n\nCRITICAL: The user is asking about a legal article or legal term. You MUST quote the EXACT text from the document verbatim. Copy the complete text word-for-word, including all numbers and punctuation. Do NOT paraphrase or generate any text."
282
  else:
283
  legal_instruction = "\n\nAnswer the question by understanding the context from the document. You may paraphrase or explain in your own words while staying true to the document's meaning."
284
 
285
- user_prompt = f"""Document Text:
286
- {full_text[:8000]} # Limit to avoid token limits
287
 
288
  User Question: {question}
289
  {legal_instruction}
290
 
291
- Please answer the question based on the document text above. MUST Answer the Question in Arabic"""
 
292
 
293
  messages = [
294
  {"role": "system", "content": system_prompt}
@@ -301,8 +753,11 @@ Please answer the question based on the document text above. MUST Answer the Que
301
  messages.append(msg)
302
 
303
  messages.append({"role": "user", "content": user_prompt})
 
 
304
 
305
  # Step 5: Get answer from LLM
 
306
  try:
307
  # Initialize client based on model_provider
308
  if model_provider.lower() in ["qwen", "huggingface"]:
@@ -323,28 +778,41 @@ Please answer the question based on the document text above. MUST Answer the Que
323
  llm_client = self.llm_client
324
  llm_model = self.llm_model
325
 
 
326
  response = llm_client.chat.completions.create(
327
  model=llm_model,
328
  messages=messages,
329
  temperature=0.3
330
  )
331
-
332
- answer = response.choices[0].message.content
 
333
 
334
  # Filter thinking process from Qwen responses
335
  if model_provider.lower() in ["qwen", "huggingface"]:
336
- answer = self._filter_thinking_process(answer)
 
 
 
 
 
 
337
 
338
- # Step 6: Update chat history
339
  self.chat_history.add_message("user", question)
340
- self.chat_history.add_message("assistant", answer)
 
 
 
341
 
342
- return answer, [matched_filename]
343
  except Exception as e:
 
 
344
  error_msg = f"Error generating answer: {str(e)}"
345
  self.chat_history.add_message("user", question)
346
- self.chat_history.add_message("assistant", error_msg)
347
- return error_msg, [matched_filename]
348
 
349
  def clear_chat_history(self):
350
  """Clear chat history"""
 
1
  import os
2
  import json
3
+ import time
4
  from pathlib import Path
5
  from typing import List, Tuple, Optional, Dict
6
  from langchain_community.vectorstores import FAISS
7
  from langchain.schema import Document
8
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
9
  try:
10
+ from backend.embeddings import get_embeddings_wrapper
11
  from backend.document_processor import DocumentProcessor
12
  from backend.chat_history import ChatHistory
13
  except ModuleNotFoundError:
14
+ from embeddings import get_embeddings_wrapper
15
  from document_processor import DocumentProcessor
16
  from chat_history import ChatHistory
17
  from openai import OpenAI
 
38
  self.json_path = json_path
39
  self.vectorstore = None
40
 
41
+ # Initialize embeddings (supports OpenAI or HuggingFace based on EMBEDDINGS_PROVIDER env var)
42
+ provider = os.getenv("EMBEDDINGS_PROVIDER", "openai").lower()
43
+ if provider in ["huggingface", "hf", "nebius"]:
44
+ # For HuggingFace, use HF_TOKEN
45
+ embeddings_api_key = os.getenv("HF_TOKEN")
46
+ if not embeddings_api_key:
47
+ raise ValueError("HF_TOKEN is required for HuggingFace embeddings. Set HF_TOKEN environment variable.")
48
+ else:
49
+ # For OpenAI, use OPENAI_API_KEY
50
+ embeddings_api_key = openai_api_key or os.getenv("OPENAI_API_KEY")
51
+ if not embeddings_api_key:
52
+ raise ValueError("OpenAI API key is required. Set OPENAI_API_KEY environment variable.")
53
 
54
+ self.embeddings = get_embeddings_wrapper(api_key=embeddings_api_key)
55
 
56
+ # Initialize document processor (always uses OpenAI for LLM processing)
57
+ openai_api_key_for_processor = openai_api_key or os.getenv("OPENAI_API_KEY")
58
+ if not openai_api_key_for_processor:
59
+ raise ValueError("OpenAI API key is required for document processing. Set OPENAI_API_KEY environment variable.")
60
+ self.processor = DocumentProcessor(api_key=openai_api_key_for_processor)
61
 
62
  # Initialize LLM client for answering questions
63
+ os.environ.setdefault("OPENAI_API_KEY", openai_api_key_for_processor)
64
  http_client = NoProxyHTTPClient(timeout=60.0)
65
  self.llm_client = OpenAI(http_client=http_client)
66
  self.llm_model = os.getenv("OPENAI_LLM_MODEL", "gpt-4o-mini")
 
68
  # Chat history manager
69
  self.chat_history = ChatHistory(max_history=int(os.getenv("CHAT_HISTORY_TURNS", "10")))
70
 
71
+ # Cache for JSON file contents and document texts
72
+ self._json_cache = None
73
+ self._json_cache_path = None
74
+ self._text_cache: Dict[str, str] = {} # Cache for document texts by filename
75
+ # Cache for per-document chunk vectorstores: {filename: {\"vectorstore\": FAISS, \"chunks\": List[Document]}}
76
+ self._chunk_cache: Dict[str, Dict[str, object]] = {}
77
+
78
  # Try to load existing vectorstore
79
  self._load_vectorstore()
80
 
 
87
  embeddings=self.embeddings,
88
  allow_dangerous_deserialization=True
89
  )
90
+ # Ensure embedding function is properly set
91
+ # FAISS may use either embedding_function attribute or call embeddings directly
92
+ # Set embedding_function to the embed_query method for compatibility
93
+ if not hasattr(self.vectorstore, 'embedding_function') or self.vectorstore.embedding_function is None:
94
+ self.vectorstore.embedding_function = self.embeddings.embed_query
95
+ elif not callable(self.vectorstore.embedding_function):
96
+ self.vectorstore.embedding_function = self.embeddings.embed_query
97
+
98
+ # Also ensure the embeddings object itself is accessible and callable
99
+ # This handles cases where FAISS tries to call the embeddings object directly
100
+ if hasattr(self.vectorstore, 'embeddings'):
101
+ self.vectorstore.embeddings = self.embeddings
102
+
103
+ # Verify embedding function is working
104
+ if not callable(self.vectorstore.embedding_function):
105
+ raise ValueError("Embedding function is not callable after initialization")
106
  print(f"Loaded existing vectorstore from {self.vectorstore_path}")
107
  except Exception as e:
108
  print(f"Could not load existing vectorstore: {e}")
 
205
 
206
  return len(new_processed_docs)
207
 
208
+ @staticmethod
209
+ def _parse_llm_response(raw_response: str) -> str:
210
+ """
211
+ Parse LLM response to extract answer.
212
+
213
+ Args:
214
+ raw_response: The raw response from LLM
215
+
216
+ Returns:
217
+ The answer text
218
+ """
219
+ # Try to parse as JSON first
220
+ try:
221
+ # Look for JSON in the response (might be wrapped in markdown code blocks)
222
+ response_text = raw_response.strip()
223
+
224
+ # Remove markdown code blocks if present
225
+ if response_text.startswith("```json"):
226
+ response_text = response_text[7:] # Remove ```json
227
+ elif response_text.startswith("```"):
228
+ response_text = response_text[3:] # Remove ```
229
+
230
+ if response_text.endswith("```"):
231
+ response_text = response_text[:-3] # Remove closing ```
232
+
233
+ response_text = response_text.strip()
234
+
235
+ # Try to parse as JSON
236
+ parsed = json.loads(response_text)
237
+
238
+ answer = parsed.get("answer", raw_response)
239
+ return answer
240
+
241
+ except (json.JSONDecodeError, ValueError) as e:
242
+ # If JSON parsing fails, return the raw response
243
+ return raw_response
244
+
245
+ def _load_json_cached(self) -> List[Dict[str, str]]:
246
+ """Load JSON file with caching to avoid repeated file I/O"""
247
+ json_path = Path(self.json_path)
248
+
249
+ # Check if cache is valid (file hasn't changed)
250
+ if self._json_cache is not None and self._json_cache_path == str(json_path):
251
+ if json_path.exists():
252
+ # Check if file modification time changed
253
+ current_mtime = json_path.stat().st_mtime
254
+ if hasattr(self, '_json_cache_mtime') and self._json_cache_mtime == current_mtime:
255
+ return self._json_cache
256
+
257
+ # Load from file
258
+ if not json_path.exists():
259
+ return []
260
+
261
+ try:
262
+ with open(json_path, "r", encoding="utf-8") as f:
263
+ docs = json.load(f)
264
+ # Cache the results
265
+ self._json_cache = docs if isinstance(docs, list) else []
266
+ self._json_cache_path = str(json_path)
267
+ self._json_cache_mtime = json_path.stat().st_mtime
268
+ return self._json_cache
269
+ except Exception as e:
270
+ return []
271
+
272
+ def _get_text_by_filename_cached(self, filename: str) -> Optional[str]:
273
+ """Get full text for a document by filename using cache"""
274
+ # Check text cache first
275
+ if filename in self._text_cache:
276
+ return self._text_cache[filename]
277
+
278
+ # Load from JSON cache
279
+ docs = self._load_json_cached()
280
+ for doc in docs:
281
+ if doc.get("filename") == filename:
282
+ text = doc.get("text", "")
283
+ # Cache the text
284
+ self._text_cache[filename] = text
285
+ return text
286
+
287
+ return None
288
+
289
+ def _get_or_build_chunk_vectorstore(
290
+ self,
291
+ filename: str,
292
+ full_text: str,
293
+ chunk_size: int = 2000,
294
+ chunk_overlap: int = 300
295
+ ) -> Tuple[FAISS, List[Document]]:
296
+ """
297
+ Build or retrieve an in-memory FAISS vectorstore of semantic chunks for a single document.
298
+
299
+ Args:
300
+ filename: Document filename used as key in cache/metadata
301
+ full_text: Full document text to chunk
302
+ chunk_size: Approximate character length for each chunk
303
+ chunk_overlap: Overlap between consecutive chunks (characters)
304
+
305
+ Returns:
306
+ Tuple of (FAISS vectorstore over chunks, list of chunk Documents)
307
+ """
308
+ # Return from cache if available
309
+ if filename in self._chunk_cache:
310
+ entry = self._chunk_cache[filename]
311
+ return entry["vectorstore"], entry["chunks"] # type: ignore[return-value]
312
+
313
+ # Create text splitter tuned for Arabic legal text
314
+ text_splitter = RecursiveCharacterTextSplitter(
315
+ chunk_size=chunk_size,
316
+ chunk_overlap=chunk_overlap,
317
+ separators=[
318
+ "\n\n",
319
+ "\n",
320
+ "المادة ",
321
+ "مادة ",
322
+ ". ",
323
+ " ",
324
+ ""
325
+ ],
326
+ )
327
+
328
+ chunks = text_splitter.split_text(full_text)
329
+ chunk_docs: List[Document] = []
330
+ for idx, chunk in enumerate(chunks):
331
+ chunk_docs.append(
332
+ Document(
333
+ page_content=chunk,
334
+ metadata={
335
+ "filename": filename,
336
+ "chunk_index": idx,
337
+ },
338
+ )
339
+ )
340
+
341
+ if not chunk_docs:
342
+ # Fallback: single chunk with entire text
343
+ chunk_docs = [
344
+ Document(
345
+ page_content=full_text,
346
+ metadata={
347
+ "filename": filename,
348
+ "chunk_index": 0,
349
+ },
350
+ )
351
+ ]
352
+
353
+ chunk_vectorstore = FAISS.from_documents(chunk_docs, embedding=self.embeddings)
354
+ self._chunk_cache[filename] = {
355
+ "vectorstore": chunk_vectorstore,
356
+ "chunks": chunk_docs,
357
+ }
358
+ return chunk_vectorstore, chunk_docs
359
+
360
+ def _classify_question(self, question: str, use_history: bool = True, model_provider: str = "openai") -> Tuple[str, Optional[str], Optional[List[str]], Optional[List[str]]]:
361
+ """
362
+ Classify question into one of three categories: law-new, law-followup, or general.
363
+
364
+ Args:
365
+ question: The user's question
366
+ use_history: Whether to use chat history
367
+ model_provider: Model provider to use
368
+
369
+ Returns:
370
+ Tuple of (label, answer, sources, chunks) where:
371
+ - label: "law-new", "law-followup", or "general"
372
+ - For "general": answer contains the answer string, sources=[], chunks=None
373
+ - For "law-new" or "law-followup": answer=None, sources=None, chunks=None (RAG will handle answering)
374
+ """
375
+ # Get previous turn context for distinguishing law-new from law-followup
376
+ previous_context = ""
377
+ if use_history:
378
+ last_turn = self.chat_history.get_last_turn()
379
+ if last_turn and len(last_turn) >= 2:
380
+ prev_user = last_turn[0].get("content", "") if last_turn[0].get("role") == "user" else ""
381
+ prev_assistant = last_turn[1].get("content", "") if last_turn[1].get("role") == "assistant" else ""
382
+ if prev_user and prev_assistant:
383
+ previous_context = f"\n\nPrevious conversation:\nUser: {prev_user}\nAssistant: {prev_assistant}"
384
+
385
+ classification_prompt = f"""Classify the following question as one of: "law-new", "law-followup", or "general".
386
+
387
+ A "law-new" question is:
388
+ - A law-related question that starts a new topic/thread
389
+ - Not primarily dependent on the immediately previous answer
390
+ - About legal documents, regulations, laws, articles (المادة), legal cases, procedures, terms, definitions
391
+ - Anything related to legal matters in documents, but as a new inquiry
392
+
393
+ A "law-followup" question is:
394
+ - A law-related question that is a follow-up, inference, or clarification based on the previous assistant response
395
+ - Refers to or builds upon the previous answer (e.g., "what about...", "can you explain more about...", "based on that...", "how about...", "what if...")
396
+ - Asks for clarification, elaboration, or related information about what was just discussed
397
+ - Continues the conversation thread about the same legal topic
398
+ - Uses pronouns or references that relate to the previous response
399
+
400
+ A "general" question is:
401
+ - Greetings (السلام عليكم, مرحبا, etc.)
402
+ - Casual conversation
403
+ - Questions not related to legal documents or law
404
+
405
+ {previous_context}
406
+
407
+ Current Question: {question}
408
+
409
+ If the question is "general", provide a helpful answer in Arabic.
410
+ If the question is "law-new", respond with only "law-new".
411
+ If the question is "law-followup", respond with only "law-followup".
412
+ """
413
+
414
+ try:
415
+ # Initialize client based on model_provider
416
+ if model_provider.lower() in ["qwen", "huggingface"]:
417
+ hf_token = os.getenv("HF_TOKEN")
418
+ if not hf_token:
419
+ # Fallback to OpenAI if HF_TOKEN not available
420
+ llm_client = self.llm_client
421
+ llm_model = self.llm_model
422
+ else:
423
+ http_client = NoProxyHTTPClient(timeout=60.0)
424
+ llm_client = OpenAI(
425
+ base_url="https://router.huggingface.co/v1",
426
+ api_key=hf_token,
427
+ http_client=http_client
428
+ )
429
+ llm_model = os.getenv("QWEN_MODEL", "Qwen/Qwen3-32B:nscale")
430
+ else:
431
+ llm_client = self.llm_client
432
+ llm_model = self.llm_model
433
+
434
+ # Build messages with chat history if enabled
435
+ history_messages = []
436
+ if use_history:
437
+ history_messages = self.chat_history.get_recent_history(n_turns=2)
438
+
439
+ system_prompt = """You are a helpful assistant. Classify questions into one of three categories and answer general questions in Arabic.
440
+ If the question is a greeting or general question, provide a friendly, helpful answer in Arabic.
441
+ If the question is law-related and starts a new topic, respond with only "law-new".
442
+ If the question is law-related and is a follow-up to the previous response, respond with only "law-followup".
443
+ Respond with ONLY one of: "law-new", "law-followup", or provide an answer if it's general."""
444
+
445
+ messages = [{"role": "system", "content": system_prompt}]
446
+
447
+ # Add chat history
448
+ if history_messages:
449
+ for msg in history_messages[:-1] if len(history_messages) > 0 and history_messages[-1].get("content") == question else history_messages:
450
+ messages.append(msg)
451
+
452
+ messages.append({"role": "user", "content": classification_prompt})
453
+
454
+ response = llm_client.chat.completions.create(
455
+ model=llm_model,
456
+ messages=messages,
457
+ temperature=0.3
458
+ )
459
+
460
+ raw_response = response.choices[0].message.content.strip()
461
+
462
+ # Filter thinking process from Qwen responses
463
+ if model_provider.lower() in ["qwen", "huggingface"]:
464
+ raw_response = self._filter_thinking_process(raw_response)
465
+
466
+ # Check classification result
467
+ response_lower = raw_response.lower().strip()
468
+ is_law_new = "law-new" in response_lower and len(response_lower) < 20
469
+ is_law_followup = "law-followup" in response_lower and len(response_lower) < 20
470
+
471
+ if is_law_new:
472
+ print(f"[Classification] Question classified as: law-new")
473
+ return ("law-new", None, None, None) # Continue with RAG flow
474
+ elif is_law_followup:
475
+ print(f"[Classification] Question classified as: law-followup")
476
+ return ("law-followup", None, None, None) # Continue with RAG flow, will reuse chunks if available
477
+ else:
478
+ # General question - use the response as answer
479
+ answer = self._parse_llm_response(raw_response)
480
+
481
+ # Update chat history
482
+ self.chat_history.add_message("user", question)
483
+ self.chat_history.add_message("assistant", answer)
484
+
485
+ print(f"[Classification] Question classified as: general, answered directly")
486
+ return ("general", answer, [], None) # Return answer with empty sources and no chunks
487
+
488
+ except Exception as e:
489
+ # On error, default to law-new to use RAG flow
490
+ print(f"[Classification] Error classifying question, defaulting to law-new: {e}")
491
+ return ("law-new", None, None, None)
492
+
493
+ def answer_question(
494
+ self,
495
+ question: str,
496
+ use_history: bool = True,
497
+ model_provider: str = "openai",
498
+ context_mode: str = "full",
499
+ ) -> Tuple[str, List[str], Optional[List[str]]]:
500
  """
501
  Answer a question using RAG with multi-turn chat history
502
 
 
504
  question: The user's question
505
  use_history: Whether to use chat history
506
  model_provider: Model provider to use - "openai" (default) or "qwen"/"huggingface" for Qwen model
507
+ context_mode: Context construction mode - "full" (entire document) or "chunks" (top semantic chunks)
508
 
509
  Returns:
510
+ Tuple of (answer, list of source filenames, optional list of chunk texts for testing)
511
  """
512
+ start_time = time.perf_counter()
513
+
514
  if self.vectorstore is None:
515
  raise ValueError("No documents indexed. Please process documents first.")
516
 
517
+ # Step 0: Classify question into law-new, law-followup, or general
518
+ classification_start = time.perf_counter()
519
+ label, answer, sources, chunks = self._classify_question(question, use_history, model_provider)
520
+ classification_time = (time.perf_counter() - classification_start) * 1000
521
+ print(f"[Timing] Question classification: {classification_time:.2f}ms")
522
+
523
+ # If general question was handled, return the result immediately
524
+ if label == "general":
525
+ return answer, sources, chunks
526
+
527
+ # Step 1: Find most similar summary (law-related questions only)
528
+ # Check if there's a previous document to potentially reuse
529
+ search_start = time.perf_counter()
530
+ previous_document = None
531
+ if use_history:
532
+ previous_document = self.chat_history.get_last_document()
533
 
 
534
  # Build search query with last chat turn context if history is enabled
535
  search_query = question
536
  if use_history:
 
547
  # Combine with current question
548
  search_query = f"{last_turn_text}\nCurrent Q: {question}"
549
 
550
+ # Perform similarity search with scores for relevance checking
551
+ # Use k=3 to get multiple candidates for comparison
552
+ similar_docs_with_scores = self.vectorstore.similarity_search_with_score(search_query, k=3)
553
+ search_time = (time.perf_counter() - search_start) * 1000
554
+ print(f"[Timing] Similarity search: {search_time:.2f}ms")
555
 
556
+ if not similar_docs_with_scores:
557
+ return "I couldn't find any relevant information to answer your question.", [], None
558
 
559
+ # Extract best matching document and score
560
+ best_doc, best_score = similar_docs_with_scores[0]
561
+ best_filename = best_doc.metadata.get("filename", "")
562
 
563
+ # Step 2: Check if we should reuse previous document
564
+ matched_filename = best_filename
565
+ if previous_document and use_history:
566
+ # Check if previous document is in the search results
567
+ previous_doc_found = False
568
+ previous_doc_score = None
569
+
570
+ for doc, score in similar_docs_with_scores:
571
+ filename = doc.metadata.get("filename", "")
572
+ if filename == previous_document:
573
+ previous_doc_found = True
574
+ previous_doc_score = score
575
+ break
576
+
577
+ if previous_doc_found and previous_doc_score is not None:
578
+ # Check if previous document score is close to best score
579
+ # FAISS returns distance scores (lower is better), so we compare the difference
580
+ score_difference = abs(best_score - previous_doc_score)
581
+ # If difference is small (within 0.15), reuse previous document
582
+ # This threshold can be adjusted based on testing
583
+ relevance_threshold = 0.15
584
+
585
+ if score_difference <= relevance_threshold:
586
+ matched_filename = previous_document
587
+ print(f"[RAG] Reusing previous document: {matched_filename} (score diff: {score_difference:.4f})")
588
+ else:
589
+ print(f"[RAG] Previous document less relevant, using best match: {best_filename} (score diff: {score_difference:.4f})")
590
+ else:
591
+ print(f"[RAG] Previous document not in top results, using best match: {best_filename}")
592
 
593
+ # Get the matched document object
594
+ matched_doc = None
595
+ for doc, _ in similar_docs_with_scores:
596
+ if doc.metadata.get("filename", "") == matched_filename:
597
+ matched_doc = doc
598
+ break
599
 
600
+ # If matched document not found in results (shouldn't happen), use best match
601
+ if matched_doc is None:
602
+ matched_doc = best_doc
603
+ matched_filename = best_filename
604
 
605
+ # Print the filename and most similar summary
606
+ print(f"[RAG] Matched filename: {matched_filename}")
607
+
608
+
609
+ if not matched_filename:
610
+ return "Error: No filename found in matched document metadata.", [], None
611
+
612
+ # Step 3: Retrieve full text from JSON (with caching)
613
+ retrieval_start = time.perf_counter()
614
+ full_text = self._get_text_by_filename_cached(matched_filename)
615
+ retrieval_time = (time.perf_counter() - retrieval_start) * 1000
616
+ print(f"[Timing] Text retrieval from JSON: {retrieval_time:.2f}ms")
617
 
618
  if not full_text:
619
+ # Load JSON to get available filenames for error message
620
+ docs = self._load_json_cached()
621
+ available_filenames = [doc.get("filename", "unknown") for doc in docs] if isinstance(docs, list) else []
 
 
 
622
 
623
+ error_msg = f"Could not retrieve text for document: '{matched_filename}'. "
624
+ if available_filenames:
625
+ error_msg += f"Available filenames in JSON: {', '.join(available_filenames)}"
626
+ else:
627
+ error_msg += "JSON file is empty or invalid."
628
+ return error_msg, [matched_filename], None
629
+
 
 
 
 
 
 
 
 
 
 
630
 
631
+ # Step 4: Build context (full document or top semantic chunks), prompt, and chat history
632
+ prompt_start = time.perf_counter()
633
  history_messages = []
634
  if use_history:
635
  # Get last 3 messages (get 2 turns = 4 messages, then take last 3)
636
  history_messages = self.chat_history.get_recent_history(n_turns=2)
637
 
638
+ # Decide how to construct document context for the LLM
639
+ context_mode_normalized = (context_mode or "full").lower()
640
+ if context_mode_normalized not in ["full", "chunks"]:
641
+ context_mode_normalized = "full"
642
+
643
+ # Default: use full document text (truncated)
644
+ document_context_label = "Document Text"
645
+ selected_chunks: Optional[List[str]] = None # Store chunks for return to frontend
646
+ if context_mode_normalized == "full":
647
+ print(f"[RAG] full mode ...")
648
+ document_context = full_text[:16000] # Limit to avoid token limits
649
+ else:
650
+ print(f"[RAG] Chunk mode ...")
651
+ # Check if we should reuse previous chunks (only for law-followup AND same document)
652
+ previous_chunks = None
653
+ if label == "law-followup" and use_history:
654
+ previous_chunks = self.chat_history.get_last_chunks()
655
+ previous_doc = self.chat_history.get_last_document()
656
+ if previous_chunks and previous_doc == matched_filename:
657
+ print(f"[RAG] Reusing previous chunks for law-followup question ({len(previous_chunks)} chunks)")
658
+ selected_chunks = previous_chunks # Reuse previous chunks
659
+ document_context_label = "Selected Document Excerpts"
660
+ chunk_texts: List[str] = []
661
+ for idx, chunk_text in enumerate(previous_chunks, start=1):
662
+ chunk_texts.append(f"[مقطع {idx}]\n{chunk_text}")
663
+ document_context = "\n\n".join(chunk_texts)[:25000]
664
+ else:
665
+ previous_chunks = None # Can't reuse, do new search
666
+ print(f"[RAG] Cannot reuse chunks: law-followup but different document or no previous chunks")
667
+
668
+ # If not reusing previous chunks, do normal chunk search (for law-new or when reuse not possible)
669
+ if previous_chunks is None:
670
+ # Chunk mode: build or load per-document chunk vectorstore and retrieve top-k chunks
671
+ chunk_vs, _ = self._get_or_build_chunk_vectorstore(matched_filename, full_text)
672
+ # Use the current question as the chunk search query
673
+ # (we already used enriched search_query for document selection)
674
+ top_k = 4
675
+ try:
676
+ top_chunks = chunk_vs.similarity_search(question, k=top_k)
677
+ except Exception as e:
678
+ print(f"[RAG] Chunk similarity search failed for {matched_filename}, falling back to full text: {e}")
679
+ document_context = full_text[:25000]
680
+ context_mode_normalized = "full"
681
+ else:
682
+ if not top_chunks:
683
+ print(f"[RAG] No chunks returned for {matched_filename}, falling back to full text")
684
+ document_context = full_text[:8000]
685
+ context_mode_normalized = "full"
686
+ else:
687
+ document_context_label = "Selected Document Excerpts"
688
+ chunk_texts: List[str] = []
689
+ selected_chunks = [] # Store raw chunk texts for return
690
+ for idx, doc in enumerate(top_chunks, start=1):
691
+ chunk_text = doc.page_content
692
+ selected_chunks.append(chunk_text) # Store raw chunk text
693
+ chunk_texts.append(f"[مقطع {idx}]\n{chunk_text}")
694
+ document_context = "\n\n".join(chunk_texts)[:20000]
695
+
696
+ # Build prompts
697
+ mode_note = ""
698
+ if context_mode_normalized == "chunks":
699
+ mode_note = (
700
+ "\n\nNote: The provided document text consists of selected relevant excerpts (مقاطع) "
701
+ "from the same document, not the full law. Answer strictly based on these excerpts."
702
+ )
703
+
704
+ system_prompt = f"""You are a helpful legal document assistant. Answer questions based on the provided document text. {mode_note}
705
 
706
  MODE 1 - General Questions:
707
  - Understand the context and provide a clear, helpful answer
708
  - You may paraphrase or summarize information from the document
709
  - Explain concepts in your own words while staying true to the document's meaning
 
710
 
711
  MODE 2 - Legal Articles/Terms (المادة):
712
+ - When the user asks about specific legal articles (المادة), legal terms, or exact regulations, you MUST quote the EXACT text from the document (con
713
+ text) verbatim
714
  - Copy the complete text word-for-word, including all numbers, punctuation, and formatting
715
+ - Do NOT paraphrase, summarize, or generate new text for legal articles (المادة)
716
  - NEVER create or generate legal text - only use what exists in the document
717
 
718
+ IMPORTANT - Response Format:
719
+ - Do NOT include source citations in your answer (e.g., do NOT write "المصدر: نظام الاحوال الشخصية.pdf" or similar source references)
720
+ - Do NOT mention the document filename or source at the end of your answer
721
+ - Simply provide the answer directly without any source attribution
722
+ - Whenever you refere to the document (context or filename) in response, refer to it by the filename WITHOUT the extension such as ".pdf" or".doc"
723
+
724
+
725
  If the answer is not in the document, say so clearly. MUST Answer in Arabic."""
726
 
727
  # Check if question contains legal article/term keywords
 
729
 
730
  legal_instruction = ""
731
  if is_legal_term_question:
732
+ legal_instruction = "\n\nCRITICAL: The user is asking about a legal article or legal term. Carefully search the provided context to find the relevant article. Reference the article correctly as it has been stated in the context. Articles might be referenced by their content, position, or topic - for example, 'المادة الأولى' might refer to the first article in a section even if not explicitly numbered. Find and quote the relevant text accurately from the document, maintaining the exact wording as it appears. Do NOT create or generate legal text - only use what exists in the document."
733
  else:
734
  legal_instruction = "\n\nAnswer the question by understanding the context from the document. You may paraphrase or explain in your own words while staying true to the document's meaning."
735
 
736
+ user_prompt = f"""{document_context_label}:
737
+ {document_context}
738
 
739
  User Question: {question}
740
  {legal_instruction}
741
 
742
+ Please answer the question based on the document text above.
743
+ MUST Answer the Question in Arabic."""
744
 
745
  messages = [
746
  {"role": "system", "content": system_prompt}
 
753
  messages.append(msg)
754
 
755
  messages.append({"role": "user", "content": user_prompt})
756
+ prompt_time = (time.perf_counter() - prompt_start) * 1000
757
+ print(f"[Timing] Prompt construction: {prompt_time:.2f}ms")
758
 
759
  # Step 5: Get answer from LLM
760
+ llm_start = time.perf_counter()
761
  try:
762
  # Initialize client based on model_provider
763
  if model_provider.lower() in ["qwen", "huggingface"]:
 
778
  llm_client = self.llm_client
779
  llm_model = self.llm_model
780
 
781
+ # Get answer from LLM (non-streaming)
782
  response = llm_client.chat.completions.create(
783
  model=llm_model,
784
  messages=messages,
785
  temperature=0.3
786
  )
787
+ raw_response = response.choices[0].message.content
788
+ llm_time = (time.perf_counter() - llm_start) * 1000
789
+ print(f"[Timing] LLM API call: {llm_time:.2f}ms")
790
 
791
  # Filter thinking process from Qwen responses
792
  if model_provider.lower() in ["qwen", "huggingface"]:
793
+ raw_response = self._filter_thinking_process(raw_response)
794
+
795
+ # Step 6: Parse LLM response to extract answer
796
+ parse_start = time.perf_counter()
797
+ answer = self._parse_llm_response(raw_response)
798
+ parse_time = (time.perf_counter() - parse_start) * 1000
799
+ print(f"[Timing] Response parsing: {parse_time:.2f}ms")
800
 
801
+ # Step 7: Update chat history with document source and chunks
802
  self.chat_history.add_message("user", question)
803
+ self.chat_history.add_message("assistant", answer, source_document=matched_filename, chunks=selected_chunks)
804
+
805
+ total_time = (time.perf_counter() - start_time) * 1000
806
+ print(f"[Timing] Total inference time: {total_time:.2f}ms")
807
 
808
+ return answer, [matched_filename], selected_chunks
809
  except Exception as e:
810
+ total_time = (time.perf_counter() - start_time) * 1000
811
+ print(f"[Timing] Total inference time (error): {total_time:.2f}ms")
812
  error_msg = f"Error generating answer: {str(e)}"
813
  self.chat_history.add_message("user", question)
814
+ self.chat_history.add_message("assistant", error_msg, source_document=matched_filename, chunks=None)
815
+ return error_msg, [matched_filename], None
816
 
817
  def clear_chat_history(self):
818
  """Clear chat history"""
documents/شرح نظام الأحوال الشخصية.pdf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:2322f04afd2f881ad7118847022039c633d30e3552b84947f0ef81d3702fe444
3
- size 2779178
 
 
 
 
files_upload.py DELETED
@@ -1,42 +0,0 @@
1
- from huggingface_hub import HfApi
2
- from pathlib import Path
3
- import os
4
- from dotenv import load_dotenv
5
-
6
- # Load environment variables from .env if present
7
- load_dotenv()
8
-
9
- # Get token from environment variable (more secure)
10
- token = os.getenv("HF_TOKEN")
11
- if not token:
12
- raise ValueError("HF_TOKEN environment variable not set. Set it with: export HF_TOKEN='your_token_here'")
13
-
14
- # Initialize API with token
15
- api = HfApi(token=token)
16
- repo_id = "AldawsariNLP/Saudi-Law-AI-Assistant"
17
-
18
- # Upload all PDFs from local documents folder
19
- local_docs = Path("documents")
20
- pdf_files = list(local_docs.glob("*.pdf"))
21
-
22
- if not pdf_files:
23
- print("No PDF files found in documents/ folder; skipping upload.")
24
- exit(0)
25
-
26
- print(f"Found {len(pdf_files)} PDF file(s) to upload")
27
- for pdf_file in pdf_files:
28
- print(f"Uploading {pdf_file.name}...")
29
- try:
30
- api.upload_file(
31
- path_or_fileobj=str(pdf_file),
32
- path_in_repo=f"documents/{pdf_file.name}",
33
- repo_id=repo_id,
34
- repo_type="space",
35
- token=token, # Also pass token here for safety
36
- )
37
- print(f"✓ Successfully uploaded {pdf_file.name}")
38
- except Exception as e:
39
- print(f"✗ Failed to upload {pdf_file.name}: {e}")
40
- raise
41
-
42
- print("Upload complete!")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
frontend/src/App.js CHANGED
@@ -40,7 +40,12 @@ function App() {
40
  e.preventDefault();
41
  if (!input.trim() || loading) return;
42
 
43
- const userMessage = { role: 'user', content: input };
 
 
 
 
 
44
  setMessages(prev => [...prev, userMessage]);
45
  setInput('');
46
  setLoading(true);
@@ -51,15 +56,17 @@ function App() {
51
  });
52
 
53
  const assistantMessage = {
 
54
  role: 'assistant',
55
  content: response.data.answer,
56
- sources: response.data.sources
57
  };
58
  setMessages(prev => [...prev, assistantMessage]);
59
  } catch (error) {
60
  const errorMessage = {
 
61
  role: 'assistant',
62
- content: error.response?.data?.detail || 'عذراً، حدث خطأ. يرجى المحاولة مرة أخرى.',
63
  error: true
64
  };
65
  setMessages(prev => [...prev, errorMessage]);
@@ -98,10 +105,21 @@ function App() {
98
  setPreviewLoading(false);
99
  };
100
 
 
 
 
 
 
 
 
 
 
 
101
  const handleSourceClick = async (source) => {
102
  if (!source) return;
103
  const filename = source.split('/').pop() || source;
104
  const extension = filename.split('.').pop()?.toLowerCase();
 
105
  setPreviewFilename(filename);
106
  setPreviewError(null);
107
  setPreviewLoading(true);
@@ -111,19 +129,51 @@ function App() {
111
  }
112
  setPreviewUrl(null);
113
  if (extension !== 'pdf') {
114
- setPreviewError('المعاينة متاحة فقط لملفات PDF.');
 
 
115
  setPreviewLoading(false);
116
  return;
117
  }
118
  try {
119
  const url = `${DOCS_URL}/${encodeURIComponent(filename)}?mode=preview`;
120
- const response = await axios.get(url, { responseType: 'blob' });
 
 
 
 
 
 
 
 
 
 
121
  const blob = new Blob([response.data], { type: 'application/pdf' });
122
  const objectUrl = URL.createObjectURL(blob);
123
  previewUrlRef.current = objectUrl;
124
  setPreviewUrl(objectUrl);
 
125
  } catch (error) {
126
- setPreviewError(error.response?.data?.detail || 'تعذر تحميل المعاينة.');
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  } finally {
128
  setPreviewLoading(false);
129
  }
@@ -196,10 +246,10 @@ function App() {
196
  }
197
  };
198
  return (
199
- <div key={idx} className={`message ${msg.role}`}>
200
  <div className="message-content">
201
  <div className="message-header">
202
- {msg.role === 'user' ? '👤 أنت' : '🤖 المساعد'}
203
  </div>
204
  <div className={`message-text ${msg.error ? 'error' : ''}`}>
205
  {renderContent()}
@@ -210,7 +260,7 @@ function App() {
210
  <ul>
211
  {msg.sources.map((source, i) => (
212
  <li key={i}>
213
- <span className="source-name">{source.split('/').pop()}</span>
214
  <div className="source-actions">
215
  <button
216
  type="button"
 
40
  e.preventDefault();
41
  if (!input.trim() || loading) return;
42
 
43
+ // Use unique IDs to prevent collision
44
+ const baseTime = Date.now();
45
+ const userMessageId = baseTime;
46
+ const assistantMessageId = baseTime + 1; // Ensure different ID
47
+
48
+ const userMessage = { id: userMessageId, role: 'user', content: input };
49
  setMessages(prev => [...prev, userMessage]);
50
  setInput('');
51
  setLoading(true);
 
56
  });
57
 
58
  const assistantMessage = {
59
+ id: assistantMessageId,
60
  role: 'assistant',
61
  content: response.data.answer,
62
+ sources: response.data.sources || []
63
  };
64
  setMessages(prev => [...prev, assistantMessage]);
65
  } catch (error) {
66
  const errorMessage = {
67
+ id: assistantMessageId,
68
  role: 'assistant',
69
+ content: error.response?.data?.detail || error.message || 'عذراً، حدث خطأ. يرجى المحاولة مرة أخرى.',
70
  error: true
71
  };
72
  setMessages(prev => [...prev, errorMessage]);
 
105
  setPreviewLoading(false);
106
  };
107
 
108
+ const getDisplaySourceName = (source) => {
109
+ if (!source) return '';
110
+ const fullName = source.split('/').pop() || source;
111
+ const lastDot = fullName.lastIndexOf('.');
112
+ if (lastDot > 0) {
113
+ return fullName.substring(0, lastDot);
114
+ }
115
+ return fullName;
116
+ };
117
+
118
  const handleSourceClick = async (source) => {
119
  if (!source) return;
120
  const filename = source.split('/').pop() || source;
121
  const extension = filename.split('.').pop()?.toLowerCase();
122
+ console.log('[Preview] Requesting preview for:', filename);
123
  setPreviewFilename(filename);
124
  setPreviewError(null);
125
  setPreviewLoading(true);
 
129
  }
130
  setPreviewUrl(null);
131
  if (extension !== 'pdf') {
132
+ const errorMsg = 'المعاينة متاحة فقط لملفات PDF.';
133
+ console.error('[Preview] Error:', errorMsg);
134
+ setPreviewError(errorMsg);
135
  setPreviewLoading(false);
136
  return;
137
  }
138
  try {
139
  const url = `${DOCS_URL}/${encodeURIComponent(filename)}?mode=preview`;
140
+ console.log('[Preview] Requesting URL:', url);
141
+ const response = await axios.get(url, {
142
+ responseType: 'blob',
143
+ timeout: 30000 // 30 second timeout
144
+ });
145
+ console.log('[Preview] Response received, status:', response.status, 'size:', response.data.size);
146
+
147
+ if (!response.data || response.data.size === 0) {
148
+ throw new Error('Received empty file');
149
+ }
150
+
151
  const blob = new Blob([response.data], { type: 'application/pdf' });
152
  const objectUrl = URL.createObjectURL(blob);
153
  previewUrlRef.current = objectUrl;
154
  setPreviewUrl(objectUrl);
155
+ console.log('[Preview] Successfully created object URL');
156
  } catch (error) {
157
+ console.error('[Preview] Error details:', {
158
+ message: error.message,
159
+ response: error.response?.data,
160
+ status: error.response?.status,
161
+ statusText: error.response?.statusText,
162
+ url: error.config?.url
163
+ });
164
+
165
+ let errorMsg = 'تعذر تحميل المعاينة.';
166
+ if (error.response?.data?.detail) {
167
+ errorMsg = `خطأ: ${error.response.data.detail}`;
168
+ } else if (error.response?.status === 404) {
169
+ errorMsg = 'الملف غير موجود.';
170
+ } else if (error.response?.status === 403) {
171
+ errorMsg = 'غير مسموح بالوصول إلى هذا الملف.';
172
+ } else if (error.message) {
173
+ errorMsg = `خطأ: ${error.message}`;
174
+ }
175
+
176
+ setPreviewError(errorMsg);
177
  } finally {
178
  setPreviewLoading(false);
179
  }
 
246
  }
247
  };
248
  return (
249
+ <div key={msg.id || idx} className={`message ${msg.role}`}>
250
  <div className="message-content">
251
  <div className="message-header">
252
+ {msg.role === 'user' ? '👤 أنت' : '🤖 المساعد القانوني'}
253
  </div>
254
  <div className={`message-text ${msg.error ? 'error' : ''}`}>
255
  {renderContent()}
 
260
  <ul>
261
  {msg.sources.map((source, i) => (
262
  <li key={i}>
263
+ <span className="source-name">{getDisplaySourceName(source)}</span>
264
  <div className="source-actions">
265
  <button
266
  type="button"
processed_documents.json CHANGED
The diff for this file is too large to render. See raw diff
 
test_nebius_embeddings.py ADDED
@@ -0,0 +1,292 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for Nebius Embeddings API via HuggingFace Router
4
+ Tests direct API calls to verify authentication and functionality
5
+ """
6
+
7
+ import os
8
+ import sys
9
+ import requests
10
+ from pathlib import Path
11
+ from dotenv import load_dotenv
12
+
13
+ try:
14
+ from huggingface_hub import InferenceClient
15
+ HF_HUB_AVAILABLE = True
16
+ except ImportError:
17
+ HF_HUB_AVAILABLE = False
18
+ print("WARNING: huggingface_hub not available. InferenceClient test will be skipped.")
19
+
20
+ # Load .env from project root
21
+ project_root = Path(__file__).resolve().parent
22
+ load_dotenv(project_root / ".env")
23
+
24
+ API_URL = "https://router.huggingface.co/nebius/v1/embeddings"
25
+ MODEL = os.getenv("HF_EMBEDDING_MODEL", "Qwen/Qwen3-Embedding-8B")
26
+
27
+ def get_headers():
28
+ """Get authorization headers"""
29
+ hf_token = os.getenv("HF_TOKEN")
30
+ if not hf_token:
31
+ print("ERROR: HF_TOKEN environment variable is not set!")
32
+ print("Please set HF_TOKEN in your .env file or environment variables.")
33
+ sys.exit(1)
34
+
35
+ return {
36
+ "Authorization": f"Bearer {hf_token}",
37
+ "Content-Type": "application/json"
38
+ }
39
+
40
+ def query(payload):
41
+ """Make API request to Nebius embeddings endpoint"""
42
+ headers = get_headers()
43
+ try:
44
+ response = requests.post(API_URL, headers=headers, json=payload, timeout=60.0)
45
+ return response
46
+ except requests.exceptions.RequestException as e:
47
+ print(f"ERROR: Request failed: {e}")
48
+ return None
49
+
50
+ def test_single_text():
51
+ """Test embedding a single text"""
52
+ print("\n" + "="*60)
53
+ print("TEST 1: Single Text Embedding")
54
+ print("="*60)
55
+
56
+ test_text = "ما هي المادة المتعلقة بالنفقة في نظام الأحوال الشخصية؟"
57
+ print(f"Input text: {test_text}")
58
+ print(f"Model: {MODEL}")
59
+
60
+ payload = {
61
+ "model": MODEL,
62
+ "input": test_text
63
+ }
64
+
65
+ response = query(payload)
66
+ if response is None:
67
+ return False
68
+
69
+ print(f"\nStatus Code: {response.status_code}")
70
+
71
+ if response.status_code == 200:
72
+ data = response.json()
73
+ print(f"Response keys: {list(data.keys())}")
74
+
75
+ if "data" in data and len(data["data"]) > 0:
76
+ embedding = data["data"][0]["embedding"]
77
+ print(f"Embedding dimensions: {len(embedding)}")
78
+ print(f"First 10 values: {embedding[:10]}")
79
+ print(f"Last 10 values: {embedding[-10:]}")
80
+ print("✓ Single text embedding successful!")
81
+ return True
82
+ else:
83
+ print(f"Unexpected response format: {data}")
84
+ return False
85
+ else:
86
+ print(f"ERROR: Request failed with status {response.status_code}")
87
+ print(f"Response: {response.text}")
88
+ if response.status_code == 401:
89
+ print("\nAuthentication failed. Please check:")
90
+ print("1. HF_TOKEN is correct and valid")
91
+ print("2. Token has proper permissions for Nebius provider")
92
+ print("3. Token is not expired")
93
+ return False
94
+
95
+ def test_batch_texts():
96
+ """Test embedding multiple texts"""
97
+ print("\n" + "="*60)
98
+ print("TEST 2: Batch Text Embedding")
99
+ print("="*60)
100
+
101
+ test_texts = [
102
+ "ما هي المادة المتعلقة بالنفقة؟",
103
+ "ما هي شروط الزواج؟",
104
+ "كيف يتم الطلاق؟"
105
+ ]
106
+ print(f"Input texts ({len(test_texts)}):")
107
+ for i, text in enumerate(test_texts, 1):
108
+ print(f" {i}. {text}")
109
+ print(f"Model: {MODEL}")
110
+
111
+ payload = {
112
+ "model": MODEL,
113
+ "input": test_texts
114
+ }
115
+
116
+ response = query(payload)
117
+ if response is None:
118
+ return False
119
+
120
+ print(f"\nStatus Code: {response.status_code}")
121
+
122
+ if response.status_code == 200:
123
+ data = response.json()
124
+ print(f"Response keys: {list(data.keys())}")
125
+
126
+ if "data" in data:
127
+ print(f"Number of embeddings returned: {len(data['data'])}")
128
+ for i, item in enumerate(data["data"]):
129
+ embedding = item["embedding"]
130
+ print(f" Embedding {i+1}: {len(embedding)} dimensions")
131
+ print("✓ Batch text embedding successful!")
132
+ return True
133
+ else:
134
+ print(f"Unexpected response format: {data}")
135
+ return False
136
+ else:
137
+ print(f"ERROR: Request failed with status {response.status_code}")
138
+ print(f"Response: {response.text}")
139
+ return False
140
+
141
+ def test_huggingface_hub_client():
142
+ """Test using HuggingFace Hub InferenceClient (same approach as HuggingFaceEmbeddingsWrapper)"""
143
+ print("\n" + "="*60)
144
+ print("TEST 3: HuggingFace Hub InferenceClient")
145
+ print("="*60)
146
+
147
+ if not HF_HUB_AVAILABLE:
148
+ print("SKIPPED: huggingface_hub package not installed")
149
+ return None
150
+
151
+ hf_token = os.getenv("HF_TOKEN")
152
+ if not hf_token:
153
+ print("ERROR: HF_TOKEN not set")
154
+ return False
155
+
156
+ test_text = "ما هي المادة المتعلقة بالنفقة في نظام الأحوال الشخصية؟"
157
+ print(f"Input text: {test_text}")
158
+ print(f"Model: {MODEL}")
159
+ print(f"Provider: nebius")
160
+
161
+ try:
162
+ # Initialize client (same as HuggingFaceEmbeddingsWrapper)
163
+ client = InferenceClient(
164
+ provider="nebius",
165
+ api_key=hf_token
166
+ )
167
+ print("✓ InferenceClient initialized successfully")
168
+
169
+ # Test feature_extraction (same as HuggingFaceEmbeddingsWrapper)
170
+ print("Calling client.feature_extraction()...")
171
+ result = client.feature_extraction(
172
+ test_text,
173
+ model=MODEL
174
+ )
175
+
176
+ # Check result format - InferenceClient returns numpy.ndarray
177
+ import numpy as np
178
+
179
+ # Convert numpy array to list if needed
180
+ if isinstance(result, np.ndarray):
181
+ # Handle 2D array (batch) or 1D array (single)
182
+ if result.ndim == 2:
183
+ # Batch result - convert to list of lists
184
+ result = result.tolist()
185
+ else:
186
+ # Single result - convert to list
187
+ result = result.tolist()
188
+
189
+ if isinstance(result, list):
190
+ # Handle nested list (batch) or flat list (single)
191
+ if len(result) > 0 and isinstance(result[0], list):
192
+ # Batch result
193
+ print(f"✓ Feature extraction successful! (batch format)")
194
+ print(f"Number of embeddings: {len(result)}")
195
+ for i, emb in enumerate(result):
196
+ print(f" Embedding {i+1}: {len(emb)} dimensions")
197
+ else:
198
+ # Single result
199
+ print(f"✓ Feature extraction successful!")
200
+ print(f"Embedding dimensions: {len(result)}")
201
+ print(f"First 10 values: {result[:10]}")
202
+ print(f"Last 10 values: {result[-10:]}")
203
+
204
+ # Test batch processing
205
+ print("\nTesting batch processing...")
206
+ test_texts = [
207
+ "ما هي المادة المتعلقة بالنفقة؟",
208
+ "ما هي شروط الزواج؟"
209
+ ]
210
+ results = []
211
+ for text in test_texts:
212
+ embedding = client.feature_extraction(text, model=MODEL)
213
+ # Convert numpy array to list if needed
214
+ if isinstance(embedding, np.ndarray):
215
+ if embedding.ndim == 2:
216
+ embedding = embedding.tolist()[0] # Extract first row if 2D
217
+ else:
218
+ embedding = embedding.tolist()
219
+ results.append(embedding)
220
+ print(f"✓ Batch processing successful! Processed {len(results)} texts")
221
+ print(f" Embedding 1: {len(results[0])} dimensions")
222
+ print(f" Embedding 2: {len(results[1])} dimensions")
223
+
224
+ return True
225
+ else:
226
+ print(f"Unexpected result format: {type(result)}")
227
+ print(f"Result: {result}")
228
+ return False
229
+
230
+ except Exception as e:
231
+ print(f"ERROR: InferenceClient test failed")
232
+ print(f"Error type: {type(e).__name__}")
233
+ print(f"Error message: {str(e)}")
234
+
235
+ # Provide helpful error messages
236
+ if "401" in str(e) or "Unauthorized" in str(e):
237
+ print("\nAuthentication failed. Please check:")
238
+ print("1. HF_TOKEN is correct and valid")
239
+ print("2. Token has proper permissions for Nebius provider")
240
+ print("3. Token is not expired")
241
+ elif "404" in str(e) or "Not Found" in str(e):
242
+ print("\nModel or endpoint not found. Please check:")
243
+ print(f"1. Model '{MODEL}' is available on Nebius")
244
+ print("2. Provider 'nebius' is correctly configured")
245
+
246
+ return False
247
+
248
+ def main():
249
+ """Run all tests"""
250
+ print("Nebius Embeddings API Test")
251
+ print("="*60)
252
+ print(f"API URL: {API_URL}")
253
+ print(f"Model: {MODEL}")
254
+ print(f"HF_TOKEN: {'*' * 20 if os.getenv('HF_TOKEN') else 'NOT SET'}")
255
+
256
+ # Check if token is set
257
+ if not os.getenv("HF_TOKEN"):
258
+ print("\nERROR: HF_TOKEN not found!")
259
+ print("Please set it in your .env file:")
260
+ print(" HF_TOKEN=your_token_here")
261
+ sys.exit(1)
262
+
263
+ # Run tests
264
+ results = []
265
+ results.append(("Single Text (Direct API)", test_single_text()))
266
+ results.append(("Batch Texts (Direct API)", test_batch_texts()))
267
+
268
+ # Test HuggingFace Hub InferenceClient if available
269
+ if HF_HUB_AVAILABLE:
270
+ hf_result = test_huggingface_hub_client()
271
+ if hf_result is not None:
272
+ results.append(("HuggingFace Hub InferenceClient", hf_result))
273
+
274
+ # Summary
275
+ print("\n" + "="*60)
276
+ print("TEST SUMMARY")
277
+ print("="*60)
278
+ for test_name, success in results:
279
+ status = "✓ PASSED" if success else "✗ FAILED"
280
+ print(f"{test_name}: {status}")
281
+
282
+ all_passed = all(result[1] for result in results)
283
+ if all_passed:
284
+ print("\n✓ All tests passed! API is working correctly.")
285
+ sys.exit(0)
286
+ else:
287
+ print("\n✗ Some tests failed. Check the errors above.")
288
+ sys.exit(1)
289
+
290
+ if __name__ == "__main__":
291
+ main()
292
+