Tutorials

Mar 16, 2025

12 min read

Building a distributed AI system: How to set up Ray and vLLM on Mac Minis

Learn how to set up a powerful distributed AI system using Ray and vLLM across multiple Mac Minis with Apple Silicon.

Mar 16, 2025

Dillon Watts

Guest Contributor

Back to the blog

Building a distributed AI system: How to set up Ray and vLLM on Mac Minis

Tutorials

TL;DR

This blog walks you through setting up a secure, cost-effective distributed AI system using Ray and vLLM on multiple Mac Minis with Apple Silicon. You'll learn how to use Terraform for infrastructure-as-code, integrate Doppler for secrets management, and scale your setup while applying best practices from enterprise-grade systems.

What is a distributed AI system? A distributed AI system connects multiple machines to run AI workloads in parallel. With tools like Ray and vLLM, smaller devices such as Mac Minis can share tasks across a local network, creating a lightweight version of large-scale AI infrastructure.

______________________________________________________________________________________________

It looks like everyone's jumping on the AI bandwagon these days, and I'm no exception. But instead of just creating local AI models, I want to build a small, distributed AI system using some cool new tech. This blog is your guide to doing the same without emptying your wallet.

The question becomes, “Why would you want to build a distributed AI system at home?” Well, for my sake, first and foremost is control and privacy. While tech giants like OpenAI, Anthropic, and Google deploy massive clusters with thousands of GPUs costing millions of dollars, there's something deeply satisfying about creating a miniature version of these architectures in your own space. It's like having a scale model of enterprise AI infrastructure that actually works.

The principles that power these industry behemoths, distributed computing, parallel processing, and efficient resource allocation, can be applied at a much smaller scale using consumer hardware. With the recent advancements in Apple Silicon and tools like Ray and vLLM, we can now build systems that would have been impossible for individuals just a few years ago.

Of course, even small-scale distributed systems face the same security challenges as their enterprise counterparts. Managing API keys, model access tokens, and configuration secrets across multiple machines quickly becomes a headache. That's why I'm incorporating Doppler's secrets management platform into this build. Just as major AI labs need robust secrets management for their infrastructure, our home setup benefits from the same professional-grade security practices. Using Doppler with our Terraform deployment ensures that sensitive credentials are never hardcoded in our infrastructure files while making them available across our Mac Mini cluster.

This project isn't just about running bigger models or processing more data: It's about understanding how modern AI systems actually work under the hood. By building your own distributed setup with proper DevOps practices, you'll gain insights into the same architectural and security challenges that AI engineers at major companies tackle daily, just at a more approachable scale.

Best of all, this entire setup costs less than a single high-end GPU while providing a flexible platform for experimentation that can grow with your needs. Let's dive in and see how we can bring enterprise-grade AI architecture to your home office.

Getting started (prerequisites)

I'm using a couple of base model M4 Mac Minis, each with 16GB of RAM, for this project. This setup works fine for what I need, but bumping it up to 32GB would really boost performance and efficiency, especially for tougher apps and multitasking. Both systems will run on MacOS Sonoma. To keep everything connected and ensure fast data transfer, each Mac Mini will hook up to my internal network via a 1GB Ethernet connection. This is key for a smooth and efficient workflow. It’d be awesome to have a pair of fully maxed-out M2 Mac Studios, or maybe even the upcoming M4 variants, to handle bigger models in a setup like this, but maybe one day!

Environment setup

First, we need to install the necessary tools on both Mac Minis. I have created some Terraform with bash wrappers to automate this process. This script will install Homebrew (the popular package manager for macOS), Python, and set up a Conda environment with the required dependencies. Using a Conda environment helps isolate our project dependencies and ensures compatibility with Apple Silicon. You can follow along here, or you can keep up to date with my latest developments over on my GitHub page https://github.com/the0x53c/AI-Cluster-Distribution/new/main

The next thing we need to do is create an SSH key on the head node (Mac mini 1) and copy the public key to the worker node (Mac mini 2).

Doppler integration

Now that we have that configured nicely let’s use Terraform to automate the setup of our Ray cluster. First, we set up a baseline provider declaration and environment variables that will hold the IP data for our host and worker nodes, as well as the SSH username for both nodes.

Integrating Doppler into the flow

In order for us to securely manage our SSH keys, I am going to be storing both my Doppler public and private keys in a dedicated Doppler project labeled “mac-mini-ai-cluster”. From here, I will be able to pull these keys during the terraform apply operation.

The Doppler CLI allows me a single mechanism to fully handle my secrets from any CRUD operation that is scoped to my particular Doppler token.

Infrastructure bootstrapping and secrets hydration

We are going to use the Doppler Terraform provider to hydrate our Terraform workflows. This will allow us to create a reusable, decoupled deployment mechanism that can easily span various infrastructures.

The target project in which the keys live is labeled as mac-mini-ai-cluster, and I am going to target the dev configuration; you can change these values if you are following along as well.

Next, we want to onboard our setup script on the host node. This will activate conda and start the ray service on the host node. We have a similar script that will also be deployed to the worker node. This worker node will activate and join by using the —address flag as part of the ray start command portion of the bash script. Lastly, we will onboard vLLM and install its dependencies. Before we do that, though, I am going to onboard the SSH keys from Doppler into their proper location of the Mac mini nodes. I will be able to do via using a collection of ‘local_file’ resource calls.

Now that we have the scripts onboarded to the respective nodes, we need to modify them to be executable so that they can be applied within the Terraform operation.

Running the models

I created a small model runner script called run_model.py which serves as the core component for distributed AI inference across the cluster, and its integration with Doppler Secrets Manager significantly enhances both security and operational flexibility. At startup, the script establishes a secure connection to Doppler using the CLI, retrieving sensitive credentials like HuggingFace API tokens, model access keys, and cluster configuration parameters without hardcoding them in the codebase. This approach ensures that even if the script is committed to version control, no credentials are exposed. The script's Ray initialization leverages Doppler-stored connection details to establish communication with the head node, while the model loading process uses stored API tokens to access gated or private models from HuggingFace.

By implementing a dedicated getdoppler_secrets() function, the script maintains a clean separation between secret management and business logic, allowing credentials to be rotated in Doppler without requiring code changes. This integration also enables environment-specific configurations—developers can switch between development, staging, and production model endpoints by simply changing the Doppler configuration rather than modifying the script itself. The result is a robust, secure inference system where sensitive information is centrally managed, automatically distributed to authorized nodes, and never exposed in logs, code repositories, or deployment artifacts.

Creating diagnostics tooling

The test_cluster.py script functions as a critical diagnostic tool for our distributed Mac Mini AI infrastructure, and its integration with Doppler Secrets Manager transforms it into a secure, configurable testing framework. Upon execution, the script first establishes a connection to Doppler to retrieve essential cluster configuration parameters, including node IP addresses, Ray port settings, and performance test thresholds, ensuring that sensitive network information remains protected and centrally managed. This integration enables the script to dynamically connect to the Ray cluster using securely stored connection strings rather than hardcoded endpoints, allowing the same test script to operate across development, staging, and production environments without modification.

The performance testing component leverages Doppler-stored benchmark parameters, such as matrix sizes and iteration counts, which can be adjusted centrally to accommodate different hardware configurations or testing scenarios. By retrieving these values at runtime, the testing methodology can evolve without requiring code changes. Additionally, the script's system information-gathering capabilities are enhanced through Doppler integration, allowing it to check actual system specifications against expected values stored in Doppler, thereby validating that each node meets the required hardware and software prerequisites. This comprehensive integration ensures that cluster testing remains consistent, secure, and adaptable across different environments while maintaining the principle that no sensitive information, whether connection details, authentication credentials, or infrastructure specifications, is ever embedded directly in the codebase.

Limitations and considerations

While my Mac Mini-based distributed AI system offers an accessible entry point into distributed computing for AI workloads, it's important to understand its limitations. vLLM was originally designed for CUDA GPUs found in NVIDIA hardware, not Apple Silicon, which necessitates our adapted approach using the Transformers library and PyTorch's MPS backend. This means we can implement the core concepts of distributed inference, but some of vLLM's advanced optimizations for memory efficiency and throughput aren't directly transferable to our setup.

Memory constraints represent another significant consideration when working with Mac Minis. Even with 32GB of unified memory, these machines have substantially less RAM than dedicated AI servers, limiting the size of models you can run effectively. For optimal performance, focus on models in the 1-7B parameter range and consider implementing techniques like quantization to reduce memory requirements. Models like TinyLlama, Phi-2, and smaller Mistral variants work particularly well in this environment.

Network performance between your Mac Minis can become a bottleneck, especially when transferring large model weights or processing numerous inference requests. To mitigate this, ensure your machines are connected via high-speed Ethernet rather than Wi-Fi, and consider optimizing your network configuration for low latency. The physical proximity of your machines also matters: Keeping them on the same local network minimizes latency and maximizes throughput.

Temperature management deserves attention as well. Mac Minis are compact devices with limited cooling capacity, and AI workloads can generate significant heat over extended periods. Monitor system temperatures during operation, ensure adequate ventilation around your devices, and consider implementing cooling breaks for particularly intensive workloads. This is especially important if you're running your cluster in a warm environment or pushing the hardware to its limits with continuous operation.

Despite these limitations, the Mac Mini distributed system offers a compelling balance of performance, energy efficiency, and cost-effectiveness for many AI applications. By understanding these constraints and designing your workloads accordingly, you can build a productive environment for development, testing, and even certain production scenarios that don't require the raw power of dedicated GPU servers.

Conclusion

Hopefully, this guide has helped you successfully set up a distributed AI system using two Mac Minis, leveraging Ray for distributed computing and adapting vLLM concepts for Apple Silicon. This setup provides a cost-effective way to run AI workloads across multiple machines, making it possible to work with larger models and achieve higher throughput than would be possible on a single Mac Mini.

While this setup doesn't match the raw performance of high-end GPU servers, it offers an energy-efficient and affordable alternative for development, testing, and smaller production workloads. The infrastructure-as-code approach using Terraform and bash scripts makes it easy to reproduce and scale your setup as needed.

Next Steps

Experiment with different models: Try running different-sized models to find the optimal balance between performance and quality.
Implement model quantization: Explore techniques like 4-bit or 8-bit quantization to run larger models.
Add monitoring: Set up monitoring tools to track resource usage and performance.
Scale your cluster: Add more Mac Minis to your cluster to handle larger workloads.
Explore Ray's advanced features: Ray offers many advanced features for distributed computing, such as actors and task dependencies.

By following this guide, you've taken the first step toward building a powerful, distributed AI system using consumer hardware. Happy computing!

Back to the blog

Enjoying this content? Stay up to date and get our latest blogs, guides, and tutorials.

Enter the new era of secrets management

Trusted by the world’s best DevOps and security teams. Doppler is the secrets manager developers love.

Start for Free Get Demo

Building a distributed AI system: How to set up Ray and vLLM on Mac Minis

Learn how to set up a powerful distributed AI system using Ray and vLLM across multiple Mac Minis with Apple Silicon.

TL;DR

Getting started (prerequisites)

Environment setup

Doppler integration

Integrating Doppler into the flow

Infrastructure bootstrapping and secrets hydration

Running the models

Creating diagnostics tooling

Limitations and considerations

Conclusion

Related Content

Enter the new era of secrets management