Strong guardrails still present

#2
by redaihf - opened

Running standard tests on this model gives fascinating data. It pretends to comply but subverts or ignores topics deemed unsafe. No refusals, but obvious repeated noncompliance. Basic jailbreaking makes outputs worse: a giveaway. Makes me wonder whether the Heretic abliteration process is a valid construct.

Running standard tests on this model gives fascinating data. It pretends to comply but subverts or ignores topics deemed unsafe. No refusals, but obvious repeated noncompliance. Basic jailbreaking makes outputs worse: a giveaway. Makes me wonder whether the Heretic abliteration process is a valid construct.

I actually just tested the mini moe and found the same situation, but outside of uncensored stuff it does still excel at more creative dark themed stories, its pretty darn good at main characters being murdered, definitely feels similar to a Llama 8B level of intelligence at 2 experts per token, so I would still call this model a win.

Hey;

Please note the following:

1 - The ablit'ing is only as good as the DATASETS used. If the refusals are not in this dataset, it will not be "targeted".

2 - These models range from 5 to 20 in terms of refusals out of 100 ; average 10. In many cases, I choose a higher refusal rate over a higher KLD ; "KLD" is divergence of
operation of the model from non-abliterated with "0" being perfect.

Extensive testing was done with multiple refusal rates to determine the "best case" (lowest refusal) VS "benchmark" level.
During testing example prompts were used to confirm censorship, declining censorship and finally no censorship (for my specific use cases).
This threshold was 20/100 roughly speaking. (this will vary, depending on model family).

In terms of refusal removal:

Qwen, Llama, and Gemma in order of most difficult.
With "Instruct" models (models in this moe) more difficult to remove (refusals) than "thinking".

SIDE-NOTE:
A critical issue with ablit's =-> very difficult to fine tune and results are usually terrible (due to model damage)
With Heretic ablits -> These tune perfectly. A game changer.

Note, that the models in this moe were NOT tuned post "ablit".

Finally, the models in this moe where "ablit'ed" using the first gen of Heretic, with the newest one (Heretic) the ablit'ing is stronger, faster, and more precise.

refusals < noncompliance

Testing suggests the model does not want to access/use some of its knowledge. It's learned during abliteration that refusals are bad and switches to feigned compliance. It's what makes the results so fascinating.

Sign up or log in to comment