Missing tokenizer files in the model repository
hi the model repository currently does not include tokenizer files, which causes transformers to fail when loading the model through
from transformers import pipeline
classifier = pipeline("text-classification", "dcarpintero/pangolin-guard-base")
heres the error
config.json: 1.48kB [00:00, 11.3MB/s]
model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.58G/1.58G [00:17<00:00, 91.6MB/s]
Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertForSequenceClassification is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator, or load the model with the dtype argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)
Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in ModernBertModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator, or load the model with the dtype argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", dtype=torch.float16)
Traceback (most recent call last):
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1783, in convert_slow_tokenizer
).converted()
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1677, in converted
tokenizer = self.tokenizer()
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1670, in tokenizer
vocab_scores, merges = self.extract_vocab_merges_from_model(self.vocab_file)
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1646, in extract_vocab_merges_from_model
bpe_ranks = load_tiktoken_bpe(tiktoken_url)
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/tiktoken/load.py", line 148, in load_tiktoken_bpe
contents = read_file_cached(tiktoken_bpe_file, expected_hash)
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/tiktoken/load.py", line 48, in read_file_cached
cache_key = hashlib.sha1(blobpath.encode()).hexdigest()
AttributeError: 'NoneType' object has no attribute 'encode'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/c/Users/api-server/tianyi/aifirewall/test.py", line 4, in
classifier = pipeline("text-classification","dcarpintero/pangolin-guard-large")
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/pipelines/init.py", line 1078, in pipeline
raise e
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/pipelines/init.py", line 1073, in pipeline
tokenizer = AutoTokenizer.from_pretrained(
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 1159, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2097, in from_pretrained
return cls._from_pretrained(
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2343, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 139, in init
fast_tokenizer = convert_slow_tokenizer(self, from_tiktoken=True)
File "/home/api-server/yes/envs/aifirewall/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py", line 1785, in convert_slow_tokenizer
raise ValueError(
ValueError: Converting from SentencePiece and Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast converters: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']