Spaces:
Runtime error
Runtime error
| <html lang="en"> | |
| <head> | |
| <title>Bootstrap Example</title> | |
| <meta charset="utf-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1"> | |
| <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/css/bootstrap.min.css"> | |
| <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> | |
| <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/js/bootstrap.min.js"></script> | |
| <style> | |
| .faded { | |
| margin: 0 auto; | |
| background: var(--window-color); | |
| box-shadow: 0 0 5px 5px var(--window-color); | |
| font-family: cursive; | |
| font-family: "Gill Sans", sans-serif; | |
| display: inline-block | |
| } | |
| .padded { | |
| width: 100%; | |
| max-width: 800px; | |
| text-align: left; | |
| } | |
| .demo_title { | |
| font-size: 32px; | |
| box-shadow: 0 0 5px 5px var(--window-color); | |
| font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Helvetica,Arial, | |
| sans-serif,Apple Color Emoji,Segoe UI Emoji; | |
| } | |
| .demo_text { | |
| font-size: 16px; | |
| box-shadow: 0 0 5px 5px var(--window-color); | |
| font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Helvetica,Arial, | |
| sans-serif,Apple Color Emoji,Segoe UI Emoji; | |
| } | |
| .tab-group { | |
| font-size: 15px; | |
| } | |
| .tab-content { | |
| margin-top: 16px; | |
| } | |
| ul > li { | |
| margin: 3px 0; | |
| } | |
| ol > li { | |
| margin: 5px 0; | |
| } | |
| /* a:link { | |
| color: #00194a; | |
| text-decoration: none; | |
| } | |
| a:visited { | |
| color: #3f004a; | |
| text-decoration: none; | |
| } */ | |
| </style> | |
| </head> | |
| <body> | |
| <div class="tab-group" style="width: 100%; margin:0 auto;"> | |
| <div> | |
| <!-- Nav tabs --> | |
| <ul class="nav nav-tabs" role="tablist"> | |
| <li role="presentation" class="active"><a href="#tab1" aria-controls="tab1" role="tab" data-toggle="tab">Memory-Efficient Training</a></li> | |
| <li role="presentation"><a href="#tab2" aria-controls="tab2" role="tab" data-toggle="tab">Security</a></li> | |
| <li role="presentation"><a href="#tab3" aria-controls="tab3" role="tab" data-toggle="tab">Make Your Own</a></li> | |
| </ul> | |
| <!-- Tab panes --> | |
| <div class="tab-content"> | |
| <div role="tabpanel" class="tab-pane active" id="tab1"> | |
| <p> | |
| Our aim is to train a large model in a decentralized fashion on consumer hardware or low-end cloud instances. | |
| This means we need to make the model, dataset, and other memory buffers fit onto a few GB of disk, 12-16 GB of CPU RAM, | |
| and 8-12 GB of GPU memory. Unfortunately, this rules out many popular techniques such as | |
| <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.06840">ZeRO-Offload</a>: | |
| there is simply not enough RAM for that. Instead, we must make better use of what limited memory we have. | |
| To do this, we use two techniques: 8-bit Optimizers for GPU memory and dataset streaming for RAM & HDD. | |
| </p> | |
| <p> | |
| <b>8-bit Optimizers:</b> | |
| Using optimizers such as LAMB or Adam requires four times as much GPU memory as simply storing model parameters (8 bytes vs 2 bytes). | |
| As such, for training large models with many parameters the optimizers make up the largest chunk of memory. | |
| With 8-bit optimizers this memory is reduced by 75% (2 bytes) making it much easier to fit large models onto consumer GPUs. | |
| </p><p> | |
| Naturally, we can combine this technique with offloading: storing 8-bit optimizer states in CPU memory rather | |
| than GPU memory (0 bytes GPU, 2 bytes CPU). To perform an optimizer update, we transfer the GPU gradients | |
| to the CPU, perform the optimizer update, and then transfer the updated weights to the GPU. | |
| We can do this for each weight one-by-one so that additional CPU memory required for the optimizer update | |
| is minimal. | |
| The combination of offloading and 8-bit optimizers means that we conserve GPU memory (0 bytes per parameter) | |
| and also use only a limited amount of CPU memory (2 bytes per parameter). | |
| </p> | |
| <p> | |
| <b>Dataset Streaming</b> | |
| Usually data is stored on disk and needs to be fully or partially loaded into CPU memory to be used for training. | |
| Large datasets used for pre-training measure in <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.00027">hundreds of gigabytes</a> | |
| or even <a target="_blank" rel="noopener noreferrer" href="https://laion.ai/laion-400-open-dataset/">terabytes</a>. | |
| This can pose a significant problem, as most desktop and cheap cloud instance simply do not have that much space. | |
| Furthermore, downloading the dataset over the internet would take up hours before one can even begin training. | |
| <!--Changing the dataset means downloading a new dataset in full and using additional disk space.--> | |
| </p> | |
| <p> | |
| To circumvent these problems, we stream the training dataset in the same way as you stream online videos. | |
| Participants download a small random portion of the training dataset and immediately begin training on it, | |
| while additional data is loaded in background. As such, we can train a model with virtually no memory | |
| overhead from the dataset and switching to a new dataset is as simple as changing an argument to the dataset class. | |
| </p> | |
| <center> | |
| Here's a tutorial for using these techniques:<br> | |
| <a target="_blank" rel="noopener noreferrer" href="https://colab.research.google.com/gist/justheuristic/75f6a2a731f05a213a55cd2c8a458aaf/fine-tune-a-language-model-with-dataset-streaming-and-8-bit-optimizers.ipynb"> | |
| <img src="https://colab.research.google.com/assets/colab-badge.svg" width=360px> | |
| </a> | |
| </center> | |
| </div> | |
| <div role="tabpanel" class="tab-pane" id="tab2"> | |
| <p>In this section, we discuss common concerns related to security of the collaborative training.</p> | |
| <p> | |
| <b>Q: If I join a collaborative training, do I allow other people to execute code on my computer?</b> | |
| </p> | |
| <p> | |
| <b>A:</b> During the training, participants only exchange data (gradients, statistics, model weights) and never send code to each other. | |
| No other peer can execute code on your computer. | |
| </p> | |
| <p> | |
| To join the training, you typically need to run the code (implementing the model, data streaming, training loop, etc.) | |
| from a repository or a Colab notebook provided by the authors of the experiment. | |
| This is no different from running any other open source project/Colab notebook. | |
| </p> | |
| <p> | |
| <b>Q: Can a malicious participant influence the training outcome?</b> | |
| </p> | |
| <p> | |
| <b>A:</b> It is indeed possible unless we use some defense mechanism. | |
| For instance, a malicious participant can damage model weights by sending large numbers instead of the correct gradients. | |
| The same can happen due to broken hardware or misconfiguration. | |
| </p> | |
| <ul> | |
| <li> | |
| <p> | |
| One possible defense is using <b>authentication</b> combined with <b>model checkpointing</b>. | |
| In this case, participants should log in (e.g. with their Hugging Face account) to interact with the rest of the collaboration. | |
| In turn, moderators can screen potential participants and add them to an allowlist. | |
| If something goes wrong (e.g. if a participant sends invalid gradients and the model diverges), | |
| the moderators remove them from the list and revert the model to the latest checkpoint unaffected by the attack. | |
| </p> | |
| <!-- <p><b>Spoiler (TODO): How to implement authentication in a decentralized system efficiently?</b></p>--> | |
| <p> | |
| Nice bonus: using this data, the moderators can acknowledge the personal contribution of each participant. | |
| </p> | |
| </li> | |
| <li> | |
| <p> | |
| Another defense is replacing the naive averaging of the peers' gradients with an <b>aggregation technique robust to outliers</b>. | |
| <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2012.10333">Karimireddy et al. (2020)</a> | |
| suggested such a technique (named CenteredClip) and proved that it does not significantly affect the model's convergence. | |
| </p> | |
| <!-- <p><b>Spoiler (TODO): How does CenteredClip protect from outliers? (Interactive Demo)</b></p>--> | |
| <p> | |
| In our case, CenteredClip is useful but not enough to protect from malicious participants, | |
| since it implies that the CenteredClip procedure itself is performed by a trusted server. | |
| In contrast, in our decentralized system, all participants can aggregate a part of the gradients and we cannot assume all of them to be trusted. | |
| </p> | |
| <p> | |
| Recently, <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2106.11257">Gorbunov et al. (2021)</a> | |
| proposed a robust aggregation protocol for decentralized systems that does not require this assumption. | |
| This protocol uses CenteredClip as a subroutine but is able to detect and ban participants who performed it incorrectly. | |
| </p> | |
| </li> | |
| </ul> | |
| </div> | |
| <div role="tabpanel" class="tab-pane" id="tab3"> | |
| <p>In this section, we provide a roadmap for you to run the collaborative training yourself.</p> | |
| <p> | |
| <b>Got confused?</b> Feel free to ask any questions at our <a target="_blank" rel="noopener noreferrer" href="https://discord.gg/uGugx9zYvN">Discord</a>! | |
| </p> | |
| <ol> | |
| <li> | |
| Set up dataset streaming: | |
| <ul> | |
| <li> | |
| <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/docs/datasets/share_dataset.html">Upload</a> your dataset to Hugging Face Hub | |
| in a streaming-friendly format (<a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/laion/laion_100m_vqgan_f8">example</a>). | |
| </li> | |
| <li>Set up dataset streaming (see the "Efficient Training" section).</li> | |
| </ul> | |
| </li> | |
| <li> | |
| Write code of training peers (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_trainer.py">example</a>): | |
| <ul> | |
| <li>Implement your model, set up dataset streaming, and write the training loop.</li> | |
| <li> | |
| Get familiar with the hivemind library | |
| (e.g., via the <a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html">quickstart</a>). | |
| </li> | |
| <li> | |
| In the training loop, wrap up your PyTorch optimizer with | |
| <a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/modules/optim.html#hivemind.optim.experimental.optimizer.Optimizer">hivemind.Optimizer</a> | |
| (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/task.py#L121">example</a>). | |
| </li> | |
| </ul> | |
| </li> | |
| <li> | |
| <b>(optional)</b> Write code of auxiliary peers (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_aux_peer.py">example</a>): | |
| <ul> | |
| <li> | |
| Auxiliary peers a special kind of peers responsible for | |
| logging loss and other metrics (e.g., to <a target="_blank" rel="noopener noreferrer" href="https://wandb.ai/">Weights & Biases</a>) | |
| and uploading model checkpoints (e.g., to <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/docs/transformers/model_sharing">Hugging Face Hub</a>). | |
| </li> | |
| <li> | |
| Such peers don't need to calculate gradients and may be run on cheap machines without GPUs. | |
| </li> | |
| <li> | |
| They can serve as a convenient entry point to | |
| <a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/modules/dht.html">hivemind.DHT</a> | |
| (i.e., their address can be specified as <code>initial_peers</code>). | |
| </li> | |
| <li> | |
| It is useful to fix their address by providing <code>host_maddrs</code> and <code>identity_path</code> | |
| arguments to <code>hivemind.DHT</code> | |
| (these are forwarded to the underlying <a target="_blank" rel="noopener noreferrer" href="https://libp2p.io/">libp2p</a> daemon). | |
| </li> | |
| </ul> | |
| </li> | |
| <li> | |
| <b>(optional)</b> Make it easier for other people to join: | |
| <ul> | |
| <li> | |
| Create notebooks for free GPU providers (Google Colab, Kaggle, AWS SageMaker, etc.). | |
| People may run them online and/or download and run them on their own hardware. | |
| </li> | |
| <li> | |
| <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/organizations/new">Create</a> a Hugging Face organization | |
| with all resources related to the training | |
| (dataset, model, inference demo, links to a dashboard with loss and other metrics, etc.). | |
| Look at <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/training-transformers-together">ours</a> as an example. | |
| </li> | |
| <li> | |
| Set up an authentication system (see the "Security" section). | |
| For example, you can ask people to join your organization with their Hugging Face accounts | |
| (Hugging Face allows to share a link for joining or manually approve new participants). | |
| This allows you to screen participants, | |
| acknowledge their contributions (e.g., make a leaderboard), and | |
| ban accounts who behave maliciously. | |
| </li> | |
| <li> | |
| Set up an inference demo for your model (e.g., using <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/spaces">Spaces</a>) or | |
| a script that periodically uploads the inference results to show the training progress. | |
| </li> | |
| </ul> | |
| </li> | |
| </ol> | |
| </div> | |
| </div> | |
| </div> | |
| </div> | |
| </body> |