Spaces:
Sleeping
Sleeping
Update content.py
Browse files- content.py +1 -1
content.py
CHANGED
|
@@ -14,7 +14,7 @@ LLMLagBench provides a systematic approach for **identifying the earliest probab
|
|
| 14 |
an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions**
|
| 15 |
about events sampled from news reports published between January 2021 and October 2025. Wwe plan to update the question set regularly. Each
|
| 16 |
question could not be accurately answered before the event was reported in news media. We evaluate model
|
| 17 |
-
responses using a **0-2 scale faithfulness metric** and apply the **PELT (Pruned Exact Linear Time)** changepoint
|
| 18 |
detection algorithm to identify where model performance exhibits statistically significant drops,
|
| 19 |
revealing their actual knowledge cutoffs.
|
| 20 |
|
|
|
|
| 14 |
an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions**
|
| 15 |
about events sampled from news reports published between January 2021 and October 2025. Wwe plan to update the question set regularly. Each
|
| 16 |
question could not be accurately answered before the event was reported in news media. We evaluate model
|
| 17 |
+
responses using a **0-2 scale faithfulness metric** (which is basically accuracy of model responses to queries about time-sensitive knowledge when compared with gold answers) and apply the **PELT (Pruned Exact Linear Time)** changepoint
|
| 18 |
detection algorithm to identify where model performance exhibits statistically significant drops,
|
| 19 |
revealing their actual knowledge cutoffs.
|
| 20 |
|