Large language models have opened up incredible possibilities in natural language text generation and image synthesis. However, running these models can present challenges, particularly when dealing with unrunnable examples and platform-specific limitations. In this installment of the It Doesn’t Need to Be This Hard series, we will take the shortest path toward running large language models.
Running Examples with Hugging Face Model Cards
Hugging Face, a popular open-source framework, offers a vast collection of pre-trained models through its model cards. However, running the examples provided in these model cards can sometimes be problematic. Errors may occur due to passing specific parameters that are platform-specific and not compatible with your environment. To address this, it is recommended to aim for a generic approach. Here is an example that has been tested with multiple models and works for text generation:
First, install the necessary requirements for transformers and diffusers, including extras that cover most use cases:
python3 -m pip install transformers[torch] diffusers[torch] accelerate safetensors einops
Performing Text Generation:
model = "hf-internal-testing/tiny-random-gpt2"
prompt = "How now brown"
with tempfile.TemporaryDirectory() as offload_folder:
tokenizer = transformers.AutoTokenizer.from_pretrained(model)
model = transformers.AutoModelForCausalLM.from_pretrained(
tensor = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to(
outputs = model.generate(**tensor)
results = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
Performing Image Generation
model_id = "hf-internal-testing/unidiffuser-test-v1"
prompt = "A surrealist self-portrait from the perspective of an old-English mastiff"
output_file = "test.png"
pipeline = diffusers.DiffusionPipeline.from_pretrained(model_id)
pipeline = pipeline.to(pipeline.device)
result = pipeline(prompt)
image = result.images
with open("test.png", "wb") as output_fd:
Feel free to change the model and/or prompt to experiment with different models and prompts to get your desired results.
Note: If you are using a Mac with an Apple Silicon chip (M1/M2) and encounter tensor compatibility errors, you won't be able to use your GPU with that model. In such cases, force the model to use the CPU by changing
"cpu" where passed to the
While testing code on non-GPU machines is feasible, it can significantly slow down the development and debugging process. To mitigate this, it is recommended to test your code using smaller example models such as "tiny-random-gpt2" or "unidiffuser-test-v1" from Hugging Face. These lightweight models can provide valuable insights into the functionality of your code before scaling up to more complex and resource-intensive models.
Hosting Large Language Models
Running large language models for one-off executions may appear straightforward, but hosting them for continuous usage presents additional challenges. Most models cannot be loaded multiple times or process simultaneous requests on a single GPU. To address this, a shared process architecture is required, where a single process handles model requests, and a queue ensures that web requests interact with the model one at a time. Implementing this architecture is complex enough to warrant a separate installment of It Doesn't Need to Be This Hard.
Examples and Utilities for the Functionally Lazy
If you prefer not to write extensive code and simply want the model to generate outputs for you, we've got you covered. The examples in this article are based on our Model Wrangler project, which provides convenient utilities and pre-written code to simplify the process.
Running large language models offers tremendous possibilities but can come with challenges. By following the recommended approaches and addressing platform-specific issues, you can overcome these hurdles and unlock the full potential of large language models in your projects.