π§ DeBERTa-v3-Base Code Quality Classifier
A fine-tuned DeBERTa-v3-base model trained to classify clean vs. buggy code using the CodeXGlue Defect Detection dataset.
This model is designed specifically for dataset filtering to improve downstream code language model training (e.g., Qwen2.5-Coder).
π Model Summary
This classifier predicts whether a given code snippet is non-defective (label 0) or buggy (label 1).
The output probabilities are used to rank samples by quality and select the highest-quality subset.
This model is part of a research pipeline analyzing how data quality affects token-level performance in generative code models.
Expected Result
5% improvement in perplexity based on early benchmarking of tuned DeBERTa
π§± Model Description
Architecture
- Base model: microsoft/deberta-v3-base
- Task: Binary sequence classification
- Labels:
0= clean code1= buggy / defective code
- Max sequence length: 512 tokens
Purpose
This model is intended for:
- Dataset quality filtering
- Improving generative model training stability
- Research on LLM token quality and perplexity
- Understanding effects of removing noisy samples
This model is not intended for real-world bug detection or vulnerability scanning.
π Dataset
Training Dataset
CodeXGlue Code Defect Detection
(https://huggingface.co/datasets/code_x_glue_cc_defect_detection)
"func": raw function-level source code"target": binary label (0 = clean, 1 = buggy)- ~21,000 training examples
Preprocessing
- Tokenized with DeBERTa-v3-base tokenizer
- Truncated to 512 tokens
- Padded dynamically using
DataCollatorWithPadding
π§ͺ Training Procedure
Hyperparameters
| Hyperparameter | Value |
|---|---|
| Epochs | 1 |
| Learning rate | 2e-5 |
| Batch size | 8 |
| FP16 | Yes |
| Max length | 512 |
| Optimizer | AdamW |
| Loss | Cross-entropy |
| remove_unused_columns | False |
Training Code Snippet
model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/deberta-v3-base", num_labels=2
)
TrainingArguments(
output_dir="filter_model",
learning_rate=2e-5,
num_train_epochs=1,
fp16=True,
per_device_train_batch_size=8,
logging_steps=50,
remove_unused_columns=False,
)
- Downloads last month
- 31