YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 DeBERTa-v3-Base Code Quality Classifier

A fine-tuned DeBERTa-v3-base model trained to classify clean vs. buggy code using the CodeXGlue Defect Detection dataset.
This model is designed specifically for dataset filtering to improve downstream code language model training (e.g., Qwen2.5-Coder).


πŸ“Œ Model Summary

This classifier predicts whether a given code snippet is non-defective (label 0) or buggy (label 1).
The output probabilities are used to rank samples by quality and select the highest-quality subset.

This model is part of a research pipeline analyzing how data quality affects token-level performance in generative code models.


Expected Result

5% improvement in perplexity based on early benchmarking of tuned DeBERTa

🧱 Model Description

Architecture

  • Base model: microsoft/deberta-v3-base
  • Task: Binary sequence classification
  • Labels:
    • 0 = clean code
    • 1 = buggy / defective code
  • Max sequence length: 512 tokens

Purpose

This model is intended for:

  • Dataset quality filtering
  • Improving generative model training stability
  • Research on LLM token quality and perplexity
  • Understanding effects of removing noisy samples

This model is not intended for real-world bug detection or vulnerability scanning.


πŸ“š Dataset

Training Dataset

CodeXGlue Code Defect Detection
(https://huggingface.co/datasets/code_x_glue_cc_defect_detection)

  • "func": raw function-level source code
  • "target": binary label (0 = clean, 1 = buggy)
  • ~21,000 training examples

Preprocessing

  • Tokenized with DeBERTa-v3-base tokenizer
  • Truncated to 512 tokens
  • Padded dynamically using DataCollatorWithPadding

πŸ§ͺ Training Procedure

Hyperparameters

Hyperparameter Value
Epochs 1
Learning rate 2e-5
Batch size 8
FP16 Yes
Max length 512
Optimizer AdamW
Loss Cross-entropy
remove_unused_columns False

Training Code Snippet

model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base", num_labels=2
)

TrainingArguments(
    output_dir="filter_model",
    learning_rate=2e-5,
    num_train_epochs=1,
    fp16=True,
    per_device_train_batch_size=8,
    logging_steps=50,
    remove_unused_columns=False,
)
Downloads last month
31
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support