mackenzietechdocs commited on
Commit
af1f59b
·
verified ·
1 Parent(s): 6cee5e8

Add BoolQ evaluation results via inspect-ai on HF Jobs

Browse files

**Description:**

This PR adds BoolQ evaluation results for `openai/gpt-oss-20b`, following the Hugging Face Skills evaluation workflow.

- Benchmark: BoolQ (google/boolq, validation split)
- Task: `inspect_evals/boolq`
- Framework: `inspect-ai` + `inspect-evals`
- Infra: `hf jobs uv run` on `a10g-small`, Inference Providers
- Metric: accuracy = 89.1% (stderr = 0.005)

The command used was:

```bash
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN \
-- \
--model "openai/gpt-oss-20b" \
--task "inspect_evals/boolq"
```

Files changed (1) hide show
  1. README.md +37 -1
README.md CHANGED
@@ -4,6 +4,21 @@ pipeline_tag: text-generation
4
  library_name: transformers
5
  tags:
6
  - vllm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ---
8
 
9
  <p align="center">
@@ -179,4 +194,25 @@ This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas
179
  primaryClass={cs.CL},
180
  url={https://arxiv.org/abs/2508.10925},
181
  }
182
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  library_name: transformers
5
  tags:
6
  - vllm
7
+ model-index:
8
+ - name: ChatGPT-OSS 20B
9
+ results:
10
+ - task:
11
+ name: BoolQ
12
+ type: boolq
13
+ dataset:
14
+ name: BoolQ
15
+ type: google/boolq
16
+ config: default
17
+ split: validation
18
+ metrics:
19
+ - name: accuracy
20
+ type: accuracy
21
+ value: 89.1
22
  ---
23
 
24
  <p align="center">
 
194
  primaryClass={cs.CL},
195
  url={https://arxiv.org/abs/2508.10925},
196
  }
197
+ ```
198
+
199
+ ## Evaluation
200
+
201
+ This model was evaluated on the **BoolQ** benchmark using the `inspect-ai` framework and `inspect-evals`, run on Hugging Face Jobs with Inference Providers.
202
+
203
+ **Benchmark:** BoolQ (google/boolq, validation split, 3,270 examples)
204
+ **Task:** `inspect_evals/boolq`
205
+ **Framework:** `inspect-ai` + `inspect-evals`
206
+ **Infrastructure:** `hf jobs uv run` on an `a10g-small` GPU
207
+ **Provider model:** `hf-inference-providers/openai/gpt-oss-20b`
208
+ **Metric:** accuracy = **89.1%** (stderr = 0.005)
209
+
210
+ **Command used:**
211
+
212
+ ```bash
213
+ hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
214
+ --flavor a10g-small \
215
+ --secrets HF_TOKEN \
216
+ -- \
217
+ --model "openai/gpt-oss-20b" \
218
+ --task "inspect_evals/boolq"