Adam Englander

It Doesn't Need to Be This Hard: Running Large Language Models

It Doesn't Need to Be This Hard: Running Large Language Models


Large language models have opened up incredible possibilities in natural language text generation and image synthesis. However, running these models can present challenges, particularly when dealing with unrunnable examples and platform-specific limitations. In this installment of the It Doesn’t Need to Be This Hard series, we will take the shortest path toward running large language models.

Running Examples with Hugging Face Model Cards

Hugging Face, a popular open-source framework, offers a vast collection of pre-trained models through its model cards. However, running the examples provided in these model cards can sometimes be problematic. Errors may occur due to passing specific parameters that are platform-specific and not compatible with your environment. To address this, it is recommended to aim for a generic approach. Here is an example that has been tested with multiple models and works for text generation:

Install Requirements

First, install the necessary requirements for transformers and diffusers, including extras that cover most use cases:

python3 -m pip install transformers[torch] diffusers[torch] accelerate safetensors einops

Performing Text Generation:

import tempfile
import transformers

model = "hf-internal-testing/tiny-random-gpt2"
prompt = "How now brown"

with tempfile.TemporaryDirectory() as offload_folder:
    tokenizer = transformers.AutoTokenizer.from_pretrained(model)
    model = transformers.AutoModelForCausalLM.from_pretrained(
    tensor = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to(
    outputs = model.generate(**tensor)
    results = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

Performing Image Generation

import diffusers

model_id = "hf-internal-testing/unidiffuser-test-v1"
prompt = "A surrealist self-portrait from the perspective of an old-English mastiff"
output_file = "test.png"

pipeline = diffusers.DiffusionPipeline.from_pretrained(model_id)
pipeline =
result = pipeline(prompt)
image = result.images[0]
with open("test.png", "wb") as output_fd:

Feel free to change the model and/or prompt to experiment with different models and prompts to get your desired results.

Note: If you are using a Mac with an Apple Silicon chip (M1/M2) and encounter tensor compatibility errors, you won't be able to use your GPU with that model. In such cases, force the model to use the CPU by changing model.device or pipeline.device to "cpu" where passed to the to() function.

Non-GPU Machines

While testing code on non-GPU machines is feasible, it can significantly slow down the development and debugging process. To mitigate this, it is recommended to test your code using smaller example models such as "tiny-random-gpt2" or "unidiffuser-test-v1" from Hugging Face. These lightweight models can provide valuable insights into the functionality of your code before scaling up to more complex and resource-intensive models.

Hosting Large Language Models

Running large language models for one-off executions may appear straightforward, but hosting them for continuous usage presents additional challenges. Most models cannot be loaded multiple times or process simultaneous requests on a single GPU. To address this, a shared process architecture is required, where a single process handles model requests, and a queue ensures that web requests interact with the model one at a time. Implementing this architecture is complex enough to warrant a separate installment of It Doesn't Need to Be This Hard.

Examples and Utilities for the Functionally Lazy

If you prefer not to write extensive code and simply want the model to generate outputs for you, we've got you covered. The examples in this article are based on our Model Wrangler project, which provides convenient utilities and pre-written code to simplify the process.


Running large language models offers tremendous possibilities but can come with challenges. By following the recommended approaches and addressing platform-specific issues, you can overcome these hurdles and unlock the full potential of large language models in your projects.