YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 DeBERTa-v3-Base Code Quality Classifier

A fine-tuned DeBERTa-v3-base model trained to classify clean vs. buggy code using the CodeXGlue Defect Detection dataset.
This model is designed specifically for dataset filtering to improve downstream code language model training (e.g., Qwen2.5-Coder).

📌 Model Summary

This classifier predicts whether a given code snippet is non-defective (label 0) or buggy (label 1).
The output probabilities are used to rank samples by quality and select the highest-quality subset.

This model is part of a research pipeline analyzing how data quality affects token-level performance in generative code models.

Expected Result

5% improvement in perplexity based on early benchmarking of tuned DeBERTa

🧱 Model Description

Architecture

Base model: microsoft/deberta-v3-base
Task: Binary sequence classification
Labels:
- 0 = clean code
- 1 = buggy / defective code
Max sequence length: 512 tokens

Purpose

This model is intended for:

Dataset quality filtering
Improving generative model training stability
Research on LLM token quality and perplexity
Understanding effects of removing noisy samples

This model is not intended for real-world bug detection or vulnerability scanning.

📚 Dataset

Training Dataset

CodeXGlue Code Defect Detection
(https://huggingface.co/datasets/code_x_glue_cc_defect_detection)

"func": raw function-level source code
"target": binary label (0 = clean, 1 = buggy)
~21,000 training examples

Preprocessing

Tokenized with DeBERTa-v3-base tokenizer
Truncated to 512 tokens
Padded dynamically using DataCollatorWithPadding

🧪 Training Procedure

Hyperparameters

Hyperparameter	Value
Epochs	1
Learning rate	2e-5
Batch size	8
FP16	Yes
Max length	512
Optimizer	AdamW
Loss	Cross-entropy
remove_unused_columns	False

Training Code Snippet

model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base", num_labels=2
)

TrainingArguments(
    output_dir="filter_model",
    learning_rate=2e-5,
    num_train_epochs=1,
    fp16=True,
    per_device_train_batch_size=8,
    logging_steps=50,
    remove_unused_columns=False,
)

Downloads last month: 31

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support