EnvGPT-7B

EnvGPT-7B is a domain-specific large language model tailored for environmental science tasks, fine-tuned on both English and Chinese datasets.

Environmental science presents unique challenges for LLMs due to its interdisciplinary nature. EnvGPT-7B was developed to address these challenges by leveraging environmental science-specific instruction datasets and benchmarks.

The model was fine-tuned on the environmental science-specific instruction datasets, ChatEnv and ChatEnv-zh, through Supervised Fine-Tuning (SFT). The combined dataset includes over 200 million tokens, covering diverse topics in environmental science in both English and Chinese. This bilingual training enables EnvGPT-7B to achieve strong performance in Chinese as well as English tasks.

🚀 Getting Started

Download the model

Download the model: EnvGPT-7B

git lfs install
git clone https://huggingface.co/SustcZhangYX/EnvGPT-7B

Model Usage

Here is a Python code snippet that demonstrates how to load the tokenizer and model and generate text using EnvGPT.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# 1. Set your local EnvGPT model path here
model_path = "YOUR_LOCAL_MODEL_PATH"

# 2. Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 3. Build chat messages
messages = [
    {"role": "system", "content": "You are an expert assistant in environmental science, EnvGPT. You are a helpful assistant."},
    {"role": "user",   "content": "What is the definition of environmental science?"},
]

# 4. Format the prompt using the chat template
#    add_generation_prompt=True appends the assistant start token (e.g., <|assistant|>)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

# 5. Initialize the text-generation pipeline
text_gen = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    return_full_text=False,  # Only return the newly generated text
)

# 6. Generate the response
#    do_sample=True enables sampling (stochastic decoding)
#    top_p=0.6 applies nucleus sampling
#    temperature=0.8 controls randomness
#    max_new_tokens=4096 allows up to 4096 new tokens
outputs = text_gen(
    text,
    max_new_tokens=4096,  # Up to 4096 new tokens
    do_sample=True,       # Enable sampling instead of greedy decoding
    top_p=0.6,            # Nucleus sampling parameter
    temperature=0.8,      # Sampling temperature
)

# 7. Print the assistant’s reply (without the original prompt)
print(outputs[0]["generated_text"])

This code demonstrates how to load the tokenizer and model from your local path, define environmental science-specific prompts, and generate responses using sampling techniques like top-p and temperature.

🌏 Acknowledgement

EnvGPT-7B is fine-tuned based on the open-sourced Qwen2.5. We sincerely thank the Qwen team for their efforts in developing and releasing such a powerful open-source foundation model, which makes domain-specific adaptations like EnvGPT possible.

❗Disclaimer

This project is intended solely for academic research and exploration. Please note that, like all large language models, this model may exhibit limitations, including potential inaccuracies or hallucinations in generated outputs.

Limitations

  • The model may produce hallucinated outputs or inaccuracies, which are inherent to large language models.
  • The model's identity has not been specifically optimized and may generate content that resembles outputs from other Qwen-based models or similar architectures.
  • Generated outputs can vary between attempts due to sensitivity to prompt phrasing and token context.

🚩Citation

If you find our work helpful, please consider citing our research: "Fine-Tuning Large Language Models for Interdisciplinary Environmental Challenges":

@article{ZHANG2025100608,
title = {Fine-Tuning Large Language Models for Interdisciplinary Environmental Challenges},
journal = {Environmental Science and Ecotechnology},
pages = {100608},
year = {2025},
issn = {2666-4984},
doi = {https://doi.org/10.1016/j.ese.2025.100608},
url = {https://www.sciencedirect.com/science/article/pii/S2666498425000869},
author = {Yuanxin Zhang and Sijie Lin and Yaxin Xiong and Nan Li and Lijin Zhong and Longzhen Ding and Qing Hu}
}
Downloads last month
2
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 3 Ask for provider support

Datasets used to train SustcZhangYX/EnvGPT-7B

Collection including SustcZhangYX/EnvGPT-7B