Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation β’ 0.2B β’ Updated β’ 1.16k β’ 5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation β’ 0.2B β’ Updated β’ 25 β’ 2 -
BEE-spoke-data/beecoder-220M-python
Text Generation β’ 0.2B β’ Updated β’ 28 β’ 3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation β’ 0.2B β’ Updated β’ 1.09k β’ 1
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation β’ 1B β’ Updated β’ 30 β’ 2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation β’ 1B β’ Updated β’ 17 β’ 1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation β’ 8B β’ Updated β’ 32 -
BEE-spoke-data/phi-1bee5
Text Generation β’ 1B β’ Updated β’ 12 β’ 1
trained and adapted tokenizers - various
π§"raw" pretrained smol_llama checkpoints - WIP π§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation β’ 0.1B β’ Updated β’ 2.98k β’ 30 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation β’ 81.3M β’ Updated β’ 1.17k β’ 9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation β’ 0.2B β’ Updated β’ 3.3k β’ 13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation β’ 58.1M β’ Updated β’ 1.15k β’ 4
Pretrained encoder (fill-mask) models we made
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification β’ 0.2B β’ Updated β’ 24 β’ 2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification β’ 24.6M β’ Updated β’ 34 β’ 1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification β’ 0.4B β’ Updated β’ 33 β’ 1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification β’ 0.1B β’ Updated β’ 19
concept datasets extracted from fineweb
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
π§"raw" pretrained smol_llama checkpoints - WIP π§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation β’ 0.1B β’ Updated β’ 2.98k β’ 30 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation β’ 81.3M β’ Updated β’ 1.17k β’ 9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation β’ 0.2B β’ Updated β’ 3.3k β’ 13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation β’ 58.1M β’ Updated β’ 1.15k β’ 4
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation β’ 0.2B β’ Updated β’ 1.16k β’ 5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation β’ 0.2B β’ Updated β’ 25 β’ 2 -
BEE-spoke-data/beecoder-220M-python
Text Generation β’ 0.2B β’ Updated β’ 28 β’ 3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation β’ 0.2B β’ Updated β’ 1.09k β’ 1
Pretrained encoder (fill-mask) models we made
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation β’ 1B β’ Updated β’ 30 β’ 2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation β’ 1B β’ Updated β’ 17 β’ 1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation β’ 8B β’ Updated β’ 32 -
BEE-spoke-data/phi-1bee5
Text Generation β’ 1B β’ Updated β’ 12 β’ 1
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification β’ 0.2B β’ Updated β’ 24 β’ 2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification β’ 24.6M β’ Updated β’ 34 β’ 1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification β’ 0.4B β’ Updated β’ 33 β’ 1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification β’ 0.1B β’ Updated β’ 19
trained and adapted tokenizers - various
concept datasets extracted from fineweb