Tuesday, April 15, 2025

How to Setup GPU Cluster using 3 Mac Studio


Ever wondered what it takes to run cutting-edge AI models with billions of parameters right on your desk? I’ve been experimenting with a cluster of Mac Studios to tackle some of the largest open-source large language models (LLMs) out there, like Llama’s 405 billion parameters and DeepSeek R1’s massive 671 billion. These models are memory-hungry beasts, and while high-end GPUs like Nvidia’s H100 are the go-to for many, they’re expensive, power-hungry, and often limited in RAM. Enter Apple Silicon—specifically, Mac Studios with their unified memory architecture, power efficiency, and surprising AI potential. In this deep dive, We’ll walk you through Alex Ziskind experiments clustering Mac Studios, leveraging Apple’s MLX framework, and pushing the limits of what’s possible with consumer-grade hardware. Let’s explore how to run AI models locally, efficiently, and without breaking the bank.

Why Apple Silicon for AI?

Before we get into the nitty-gritty, let’s talk about why Apple Silicon, particularly the Mac Studio, is a compelling choice for running LLMs. Traditional AI setups lean heavily on GPUs like Nvidia’s H100 or consumer-grade RTX 5090. These are fast for parallel tasks, but they come with downsides: high costs (H100s can run tens of thousands of dollars), significant power consumption, and limited VRAM (32 GB for the RTX 5090, for instance). Finding these GPUs in stock is another headache.

Apple Silicon, found in devices like the Mac Studio, offers a different approach. The M4 Max and M3 Ultra chips feature unified memory, meaning the CPU and GPU share a single pool of RAM—up to 128 GB on the M4 Max and a whopping 512 GB on the latest M3 Ultra Mac Studio. This shared memory is a game-changer for LLMs, which often require massive amounts of RAM to load and run. Apple’s chips are also ultra-efficient, sipping power compared to the watt-guzzling GPUs. A single Mac Studio with 128 GB can handle a 70 billion parameter model with ease, using about 65 GB of memory at 10 tokens per second. Not blazing fast, but it gets the job done.

The catch? A single Mac Studio, even with 512 GB, might not be enough for the largest models like DeepSeek R1 (671 billion parameters) or the newer DeepSeek V3 (750 GB in 8-bit quantization). That’s where clustering comes in—combining multiple Mac Studios to pool their resources and tackle these AI giants.

Setting Up the Mac Studio Cluster

My setup consists of four Mac Studios, each with an M4 Max chip and 128 GB of unified memory, plus an M3 Ultra Mac Studio with 512 GB for the heavy lifting. The goal? Run massive LLMs locally using Apple’s MLX framework and a clustering solution called MLX Distributed. Here’s how it works.

Hardware Breakdown

  • Four M4 Max Mac Studios: Each with 128 GB of unified memory, these are the workhorses of the cluster. They’re power-efficient, using just 48 watts at idle with a few programs open.
  • One M3 Ultra Mac Studio: The “big boy” with 512 GB of unified memory, priced at around $10,000. It’s not cheap, but it’s a fraction of the cost of equivalent Nvidia GPU setups.
  • Networking: All machines are connected via a 10-gigabit Ethernet switch, achieving transfer speeds of 9.4 Gbps. I also tested Thunderbolt 5 networking (up to 65 Gbps), but more on that later.

Software Setup

To make this cluster hum, I used MLX, Apple’s machine learning framework optimized for Apple Silicon. Think of it as Apple’s answer to Nvidia’s CUDA, but tailored for the unified memory architecture. MLX Distributed allows you to spread an LLM across multiple Macs, splitting the model’s layers and memory demands.

Here’s what you need to set up MLX Distributed:

  1. SSH and Passwordless Login: Ensure all Macs have SSH enabled and are configured for passwordless login. This allows seamless communication between machines.
  2. Conda Environments: Use Conda to create identical Python environments on each Mac. I set up one environment on a single machine, tested it, then copied it to the others. This ensures consistency across the cluster.
  3. Project Folder: Create a project folder with two key files:
    • hosts.json: Lists the hostnames of your Macs (e.g., ams1, ams2, ams3, ams4 for my setup).
    • MLX scripts: These handle model loading and distribution. I’ve shared a detailed guide and scripts in a GitHub repository (linked in the video description of the source YouTube video).
  4. Model Files: Download your LLM (e.g., DeepSeek Coder V2 Light Instruct or DeepSeek R1) to each machine’s cache directory (~/cache/huggingface/hub). For massive models, use a Thunderbolt 5 SSD to transfer files quickly—420 GB in about two minutes.


Networking: Ethernet vs. Thunderbolt

I tested two networking setups: a 10-gigabit Ethernet switch and Thunderbolt 5 networking. Ethernet is simpler to configure and more stable, delivering 9.4 Gbps consistently. Thunderbolt 5, with speeds up to 65 Gbps, is faster for file transfers but doesn’t significantly boost MLX Distributed performance. Why? MLX Distributed keeps a full copy of the model on each machine, so the network is mainly used for communication, not large data transfers. For most users, Ethernet is the better choice—easier to set up and just as effective for clustering.

Testing the Cluster: Small Models First

To get a feel for the cluster’s performance, I started with a smaller model: DeepSeek Coder V2 Light Instruct (4-bit quantized, MLX format). This model is lightweight enough to run on a single Mac but perfect for testing clustering.

  • Single M4 Max (128 GB): Loaded the model and ran a task to write a JavaScript function. Result? 168 tokens per second. Not bad for a single machine.
  • Two M4 Max Macs: Added a second Mac to the cluster via MLX Distributed. Performance dropped to 107 tokens per second, with GPU usage at 50–60% on each machine. The overhead of network communication (via MPI, or Message Passing Interface) caused a slight slowdown.
  • Four M4 Max Macs: With all four Macs in the cluster, performance dipped further to 79 tokens per second. GPU usage remained below 100%, indicating MLX Distributed wasn’t fully utilizing the hardware. This scaling inefficiency is a known challenge with distributed computing.

The takeaway? For smaller models, a single Mac Studio is often faster and simpler. Clustering shines when you need to handle models too large for one machine.

Taking on the Big Boys: DeepSeek R1 and V3

Now for the main event: running DeepSeek R1 (671 billion parameters, 420 GB in 4-bit quantization) and DeepSeek V3 (750 GB in 8-bit quantization). These models push the limits of consumer hardware, requiring creative solutions to fit them into memory.

DeepSeek R1 on the Cluster

I first tried DeepSeek R1 on a single M4 Max (128 GB). The memory usage skyrocketed to 120 GB, hit the red zone, and crashed. No surprise—420 GB is far too large for one machine. Next, I tried two M4 Max Macs (256 GB total). Memory usage climbed to 102–107 GB per machine, but the model still failed to load. With four M4 Max Macs (512 GB total), I finally got it running:

  • Performance: 15 tokens per second.
  • Memory Usage: Each machine used about 102 GB, with GPU usage at 40%. The cluster was stable, but the low GPU utilization suggested room for optimization.

For comparison, I ran DeepSeek R1 on the M3 Ultra Mac Studio (512 GB). It loaded the entire 407 GB model into memory with ease, staying in the green zone and delivering 19 tokens per second. The single M3 Ultra outperformed the cluster, highlighting the efficiency of a single high-memory machine for large models.

DeepSeek V3: The 8-Bit Challenge

DeepSeek V3, with its 750 GB 8-bit quantized model, was the ultimate test. My cluster (four M4 Max + one M3 Ultra = 1,024 GB total) should theoretically handle it, but the uneven memory distribution posed a problem. MLX Distributed splits the model equally across machines, meaning each Mac needed to handle 150 GB—too much for the 128 GB M4 Maxes.

To solve this, I dug into the MLX pipeline code and worked with a contact at Apple to tweak the layer distribution. The goal was to assign more layers to the 512 GB M3 Ultra while keeping the M4 Maxes within their memory limits. After days of trial and error, I got a smaller 4-bit model working with a custom split:

  • M3 Ultra: 305 GB of memory used.
  • M4 Maxes: 73–92 GB each.
  • Performance: 16.48 tokens per second.

The 8-bit DeepSeek V3 was trickier. Even with a rank file to prioritize the M3 Ultra (assigning it layers 28–60 and 400 GB of memory), memory pressure spiked, and the cluster hit a wall. The issue? MPI’s random rank assignment didn’t guarantee the M3 Ultra got the primary load, despite my efforts to hardcode it. After weeks of debugging, I couldn’t fully crack it.

However, a friend with two 512 GB M3 Ultra Mac Studios ran the 8-bit DeepSeek V3 successfully, achieving 12 tokens per second. The even memory split (512 GB + 512 GB) avoided the distribution issues I faced, proving that identical machines are currently the best bet for MLX Distributed.

Comparing Clustering vs. Single Machine

So, what’s the verdict? Should you cluster Mac Studios or go for a single high-memory machine? Here’s a breakdown:

Clustering Pros

  • Scalability: Combine multiple Macs to handle models too large for one machine (e.g., DeepSeek V3).
  • Cost Flexibility: Four 128 GB Mac Studios are expensive but still cheaper than multiple Nvidia H100s.
  • Power Efficiency: The cluster used 150 watts under load, compared to 251 watts for the M3 Ultra alone.

Clustering Cons

  • Complexity: Setting up MLX Distributed, SSH, and identical environments is time-consuming and error-prone.
  • Scaling Inefficiency: Adding more machines often reduces tokens per second due to network and MPI overhead.
  • Uneven Memory: Models like DeepSeek V3 require custom tweaks to handle machines with different memory capacities.

Single Machine Pros

  • Simplicity: One M3 Ultra Mac Studio (512 GB) ran DeepSeek R1 at 19 tokens per second with zero setup hassle.
  • Efficiency: Higher GPU utilization and no network overhead make single machines faster for most tasks.
  • Future-Proofing: 512 GB of unified memory can handle most current LLMs and likely future ones.

Single Machine Cons

  • Cost: $10,000 for a 512 GB Mac Studio is steep, though still competitive with GPU setups.
  • Limits: Even 512 GB can’t handle the largest 8-bit models alone (e.g., DeepSeek V3).

Practical Recommendations

Based on my experiments, here’s how to approach running LLMs on Apple Silicon:

  1. For Smaller Models (30–70 Billion Parameters):
    • A single M4 Max Mac Studio (128 GB) is ideal. It handles 4-bit to 8-bit quantized models like Llama 3.3 (70 billion) at 10–13 tokens per second.
    • No clustering needed—keep it simple and fast.
  2. For Large Models (400–671 Billion Parameters):
    • A 512 GB M3 Ultra Mac Studio is the sweet spot. It ran DeepSeek R1 (671 billion) at 19 tokens per second with room to spare.
    • If you need more memory, cluster identical machines (e.g., two 512 GB Mac Studios) for stability and ease of setup.
  3. For Massive Models (750 GB+):
    • Clustering is your only option with current hardware. Use identical machines to avoid memory distribution issues.
    • Stay tuned for MLX updates—Apple and the open-source community are actively improving dynamic load balancing.
  4. Networking Tips:
    • Stick with a 10-gigabit Ethernet switch for clustering. It’s reliable and sufficient for MLX Distributed.
    • Reserve Thunderbolt 5 for transferring large model files between machines or to an SSD.
  5. Software Tips:
    • Use MLX for the best performance on Apple Silicon. It’s faster than alternatives like LM Studio (168 vs. 115 tokens per second for DeepSeek Coder V2).
    • Check out my GitHub repository (linked in the video description) for detailed MLX Distributed setup scripts and guides.
    • Monitor GPU and memory usage with tools like Mactop to optimize performance.

Power Efficiency: A Hidden Win

One standout feature of Apple Silicon is its power efficiency. My four M4 Max Mac Studios consumed just 48 watts at idle and 150 watts under load. The M3 Ultra, while higher at 251 watts, is still a fraction of what a multi-GPU setup demands. For home or small office setups, this means lower electricity bills and less heat—no need for industrial cooling systems. If you’re running AI models 24/7, these savings add up.

Challenges and Future Outlook

Clustering Mac Studios isn’t without its hurdles. MLX Distributed’s reliance on MPI introduces overhead, reducing performance as more machines are added. Uneven memory distribution remains a pain point for mixed setups like mine. However, the MLX team and Apple are making strides. Recent updates to MLX-LM (a tool for managing MLX models) include better dynamic load balancing, which could soon make mixed-memory clusters more viable.

On the hardware front, Apple’s unified memory architecture is a boon for AI workloads. As LLMs grow larger, we may see Mac Studios with even more memory—perhaps 1 TB in future M-series chips. For now, the 512 GB M3 Ultra is a powerhouse, and clustering remains a practical way to push beyond its limits.

Why Run LLMs Locally?

You might be wondering: why go through all this trouble to run LLMs locally? Cloud services like AWS or Google Cloud offer powerful GPU clusters, but they come with recurring costs, data privacy concerns, and dependency on internet connectivity. Running models locally gives you:

  • Control: Full ownership of your data and models.
  • Cost Savings: A one-time hardware investment vs. ongoing cloud fees.
  • Offline Access: Perfect for secure or remote environments.
  • Customization: Tweak models and quantization to suit your needs.

For developers, researchers, or hobbyists, a Mac Studio cluster is a gateway to experimenting with state-of-the-art AI without needing a data center.

Final Thoughts

My journey clustering Mac Studios to run massive LLMs has been equal parts challenging and rewarding. A single 512 GB M3 Ultra Mac Studio is a beast for most models, delivering 19 tokens per second on DeepSeek R1 with minimal setup. Clustering four 128 GB M4 Max Macs and the M3 Ultra pushed the boundaries, handling DeepSeek R1 at 15 tokens per second and coming close on DeepSeek V3. While MLX Distributed needs refinement for mixed-memory setups, it’s a powerful tool for Apple Silicon users.

If you’re diving into local AI, start with a single Mac Studio for models up to 70 billion parameters. For larger models, consider a 512 GB machine or a cluster of identical units. Keep an eye on MLX updates and community resources for the latest optimizations. With Apple Silicon, you don’t need a room full of GPUs to run cutting-edge AI—just a few Macs, some patience, and a passion for experimentation.

Source: Alex Ziskind Youtube

0 comments:

Post a Comment