English
influence-guided-training
dataset-curation
distilgpt2

gpt-2-vuln-code

This model was trained using influence-guided dataset selection, a technique that uses influence scores to identify the most impactful training data for specific concepts.

Model Description

  • Base Model: distilgpt2
  • Training Concepts: vulnerability detection, static code analysis, SAST, secure coding practices, CWE, CVE, automated security testing, code review tools, threat modeling
  • Training Method: Influence-guided data selection
  • Compute Budget: 100 steps per condition
  • Total Datasets: 3

Training Approach

This model was trained using three different data selection strategies to validate the effectiveness of influence-guided training:

  1. Positive Influence: Datasets with high positive influence scores (most aligned with target concepts)
  2. Random Baseline: Randomly sampled datasets
  3. Negative Influence: Datasets with high negative influence scores (least aligned)

Benchmark Results

Condition Perplexity ↓ Train Loss ↓ Eval Loss ↓
Positive 12.17 2.9640 2.4989
Random 4.81 1.9605 1.5703

Lower is better for all metrics

Training Datasets

The model was trained on datasets selected through influence scoring:

  • DamarJati/indocorpus-sastra (Influence: -0.867)
  • crmamede/vulnerability_detection__explainability (Influence: 0.621)
  • jason-oneal/mitre-stix-cve-exploitdb-dataset-alpaca (Influence: -0.526)

Intended Use

This model demonstrates the effectiveness of influence-guided training for:

  • Concept-specific language modeling
  • Data-efficient training
  • Dataset curation research

Limitations

  • Trained on a limited compute budget for benchmarking purposes
  • May not generalize well outside the target concepts: vulnerability detection, static code analysis, SAST, secure coding practices, CWE, CVE, automated security testing, code review tools, threat modeling
  • Performance depends on the quality of influence score estimation

Citation

If you use this model or the influence-guided training approach, please cite:

@software{influence_guided_training,
  title = {Influence-Guided Dataset Selection for Language Models},
  author = {Learning Curator by Durinn},
  year = {2025},
  url = {https://huggingface.co/durinn/gpt-2-vuln-code}
}

Model Card Contact

For questions or feedback, visit Durinn


Generated by Learning Curator - AI-powered dataset discovery and training plan optimization

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train durinn/gpt-2-vuln-code