gpt-2-vuln-code

This model was trained using influence-guided dataset selection, a technique that uses influence scores to identify the most impactful training data for specific concepts.

Model Description

Base Model: distilgpt2
Training Concepts: vulnerability detection, static code analysis, SAST, secure coding practices, CWE, CVE, automated security testing, code review tools, threat modeling
Training Method: Influence-guided data selection
Compute Budget: 100 steps per condition
Total Datasets: 3

Training Approach

This model was trained using three different data selection strategies to validate the effectiveness of influence-guided training:

Positive Influence: Datasets with high positive influence scores (most aligned with target concepts)
Random Baseline: Randomly sampled datasets
Negative Influence: Datasets with high negative influence scores (least aligned)

Benchmark Results

Condition	Perplexity ↓	Train Loss ↓	Eval Loss ↓
Positive	12.17	2.9640	2.4989
Random	4.81	1.9605	1.5703

Lower is better for all metrics

Training Datasets

The model was trained on datasets selected through influence scoring:

DamarJati/indocorpus-sastra (Influence: -0.867)
crmamede/vulnerability_detection__explainability (Influence: 0.621)
jason-oneal/mitre-stix-cve-exploitdb-dataset-alpaca (Influence: -0.526)

Intended Use

This model demonstrates the effectiveness of influence-guided training for:

Concept-specific language modeling
Data-efficient training
Dataset curation research

Limitations

Trained on a limited compute budget for benchmarking purposes
May not generalize well outside the target concepts: vulnerability detection, static code analysis, SAST, secure coding practices, CWE, CVE, automated security testing, code review tools, threat modeling
Performance depends on the quality of influence score estimation

Citation

If you use this model or the influence-guided training approach, please cite:

@software{influence_guided_training,
  title = {Influence-Guided Dataset Selection for Language Models},
  author = {Learning Curator by Durinn},
  year = {2025},
  url = {https://huggingface.co/durinn/gpt-2-vuln-code}
}

Model Card Contact

For questions or feedback, visit Durinn

Generated by Learning Curator - AI-powered dataset discovery and training plan optimization

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

durinn
/

gpt-2-vuln-code