ERC Classifiers
This repository contains a model trained for multi-label classification of scientific papers in the ERC (European Research Council) context. The model predicts multiple categories for a paper, such as its research domain or topic, based on the abstract and title.
Model Description
The model is based on SPECTER (a transformer-based model pre-trained on scientific literature), fine-tuned for multi-label classification on a dataset of scientific papers. The model classifies papers into several categories, which are defined by the ERC categories. The fine-tuned model is trained to predict these categories given the title and abstract of each paper.
Preprocessing
The preprocessing pipeline involves:
- Data Loading: Papers are loaded from a Parquet file containing the title, abstract, and their respective categories.
- Label Cleaning: Labels (categories) are processed to remove any unnecessary information (like content within parentheses).
- Label Encoding: Categories are transformed into a binary matrix using the MultiLabelBinarizer from scikit-learn. Each category corresponds to a column, and the value is
1if the paper belongs to that category,0otherwise. - Statistics and Visualization: Basic statistics and visualizations, such as label distributions, are generated to help understand the dataset better.
Training
The model is fine-tuned on the preprocessed dataset using the following setup:
- Base Model: The model uses the
allenai/spectertransformer as the base model for sequence classification. - Optimizer: AdamW optimizer with a learning rate of
5e-5is used. - Loss Function: Binary Cross-Entropy with logits (
BCEWithLogitsLoss) is employed, as the task is multi-label classification. - Epochs: The model is trained for 5 epochs with a batch size of 4.
- Training Data: The model is trained on a processed dataset stored in
train_ready.parquet.
Evaluation
The model is evaluated using both single-label and multi-label metrics:
Single-Label Evaluation
- Accuracy: The accuracy is measured by checking how often the true label appears in the predicted labels.
- Precision, Recall, F1: These metrics are calculated for each class and averaged for the entire dataset.
Multi-Label Evaluation
- Micro and Macro Metrics: Precision, recall, and F1 scores are computed using both micro-averaging (overall performance) and macro-averaging (performance per label).
- Label Frequency Plot: A plot showing the frequency distribution of labels in the test set.
- Top and Bottom F1 Plot: A plot visualizing the top and bottom labels based on their F1 scores.
Dataset
The dataset consists of scientific papers, each with the following columns:
- title: The title of the paper.
- abstract: The abstract of the paper.
- label: A list of categories (labels) assigned to the paper.
The dataset is preprocessed and stored in a train_ready.parquet file.
Files
config.json: Model configuration file.model.safetensors: Saved fine-tuned model weights.tokenizer.json: Tokenizer configuration for the fine-tuned model.tokenizer_config.json: Tokenizer settings.special_tokens_map.json: Special tokens used by the tokenizer.vocab.txt: Vocabulary file for the fine-tuned tokenizer.
Usage
To use the model, follow these steps:
Install Dependencies:
pip install transformers torch datasetsLoad the Model and Tokenizer:
from transformers import AutoModelForSequenceClassification, AutoTokenizer model_name = "SIRIS-Lab/erc-classifiers" # Load fine-tuned model and tokenizer model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)Use the Model for Prediction:
# Example paper title and abstract text = "Example title and abstract of a scientific paper." # Tokenize the input text inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) # Make predictions with torch.no_grad(): logits = model(**inputs).logits # Apply sigmoid activation to get probabilities probabilities = torch.sigmoid(logits) # Get predicted labels (threshold at 0.5) predicted_labels = (probabilities >= 0.5).long().cpu().numpy() print(predicted_labels)
Conclusion
This model provides an efficient solution for classifying scientific papers into multiple categories based on their content. It uses state-of-the-art transformer-based techniques and is fine-tuned on a real-world dataset of ERC-related scientific papers.
- Downloads last month
- 14