Compressing GLM & MiniMax to Run on a MacBook
Executive Summary
This case study documents the compression of GLM-4.7 and MiniMax, 358 billion parameter Mixture-of-Experts (MoE) language models, from an unwieldy 717GB BF16 format to a deployable 92GB INT4 representation. By combining REAP expert pruning (from Cerebras Research) with AutoRound quantization, we achieved a 7× size reduction that enables inference on consumer hardware while preserving model quality.
"REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks." — Lasby et al., REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression
The compression pipeline required approximately 18 hours on an 8×H200 GPU cluster, costing roughly $1000 USD. The resulting model checkpoints are available on HuggingFace under the 0xSero namespace.
The Problem: Giant Models Can't Run On Consumer Hardware
Large language model development has accelerated dramatically, with frontier models exceeding hundreds of billions of parameters. GLM-4.7 exemplifies this trend:
| Specification | Value |
|---|---|
| Parameters | 358 billion |
| Experts | 160 (sparse activation) |
| Architecture | Mixture-of-Experts (MoE) |
| Full Precision (BF16) | ~717 GB |
| Inference VRAM (FP16) | ~1 TB |
This is a significant deployment challenge. Even with access to expensive GPU clusters, loading the model consumes all available VRAM, leaving no headroom for inference batching, KV cache, or overhead. The practical implications:
- Datacenter access required: No single machine can load the model
- High operational costs: Running on cloud GPUs at scale is expensive
- Limited accessibility: Researchers and small teams cannot experiment
- Environmental impact: Massive compute requirements per inference
Why MoE Models Are Ideal for Compression
Mixture-of-Expert architectures present a unique opportunity for compression. Unlike dense models where every parameter participates in every forward pass, MoE models route tokens processing through a subset of experts:
┌─────────────────────────────────────────────────────────────┐
│ GLM-4.7 Architecture │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Expert 1 │ │ Expert 2 │ │ Expert 3 │ │
│ │ (active) │ │ (active) │ │ (inactive) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ○ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Router/Gate Mechanism │ │
│ │ Selects top-k experts per token │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Output │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key observations about MoE redundancy:
- Not all experts are equal: Router biases means some experts are selected far more frequently
- Expert specialization: Certain experts specialize in overlapping capabilities
- Submodular contribution: Removing an expert's contribution diminishes predictably with depth
- Structural sparsity: The model is already designed for sparse activation
REAP (Router-weighted expert activation pruning) exploits these properties by measuring expert importance through calibration with specific datasets, then removing the least critical experts entirely.
The Solution: Two-Stage Compression Pipeline
Stage 1: REAP Expert Pruning
REAP operates in three phases:
- Calibration Forward Pass: Run representative samples through the model, collecting activation statistics for each expert
- Importance Scoring: Compute saliency scores for each expert based on activation norms and gate values
- Expert Removal: Remove the least important experts
The critical insight is that expert importance is task-dependent. For our compression target (code generation + function calling + agentic workflows), we curated a calibration dataset matching these use cases.
Calibration Dataset Composition
| Dataset | Samples | Purpose |
|---|---|---|
| theblackcat102/evol-codealpaca-v1 | 700 | Code generation |
| Salesforce/xlam-function-calling-60k | 330 | Function calling / tool use |
| SWE-bench/SWE-smith-trajectories | 330 | Agentic multi-turn workflows |
| Total | 1360 | Domain-appropriate calibration |
This mirrors the approach in the original Cerebras REAP paper, which found that calibration domain significantly impacts downstream quality. Our dataset balances the three primary use cases for compressed GLM-4.7.
Command and Configuration
python /mnt/work/reap/src/reap/prune.py \
--model-name /mnt/work/model/GLM-4.7 \
--dataset-name 0xSero/glm47-reap-calibration-v2 \
--compression-ratio 0.50 \
--seed 42 \
--distance_measure angular
Key parameters:
- compression-ratio: Proportion of experts retained (0.50 = 50% of experts kept)
- seed: Reproducible randomness for observation caching
- distance-measure: Angular (angular distance between expert outputs vectors subspaces)
Runtime Observations
| Phase | Duration |
|---|---|
| Observation collection (1360 samples) | ~12.5 hours |
| Pruning and model repacking | ~1.5 hours |
| Total REAP 50% prune | ~14h |
Once observations are cached at /root/artifacts/GLM-4.7/glm47-reap-calibration-v2/all/observations_1360_angular-seed_42.pt, additional prune ratios (35%, 40%, 45%, 50%) can be computed in minutes rather than hours.
Stage 2: AutoRound INT4 Quantization
Pruning alone reduces model size but not sufficiently for practical deployment. We applied Intel's AutoRound, a GPTQ-compatible quantization method that:
- Compresses weights to 4-bit integers
- Uses group-wise quantization (128 parameters per group)
- Applies asymmetric quantization with GPTQ-style optimization
- Preserves activation precision (W4A16 format)
AutoRound Configuration
auto-round \
--model /path/to/pruned-model \
--bits 4 \
--group_size 128 \
--format auto_gptq \
--output_dir /mnt/work/outputs/GLM-4.7-REAP-50-W4A16
The quantization used the same calibration dataset as REAP, ensuring consistent evaluation across both compression stages.
Quantization Runtime
| Phase | Duration |
|---|---|
| Per-layer quantization (~90-92s/layer) | ~3-4 hours |
| Model compilation and export | ~10 minutes |
| Total AutoRound pass | ~4h 10m |
Combined Results
| Model | Size | VRAM Required | Compression |
|---|---|---|---|
| GLM-4.7 Original (BF16) | 717 GB | ~1 TB | 1× |
| GLM-4.7-REAP-50 (BF16) | ~359 GB | ~500 GB | 2× |
| GLM-4.7-REAP-50-W4A16 | 92 GB | ~120 GB | 7.8× |
The 92GB final model can run inference on a high-end MacBook or 8× 3090s, making it accessible to independent researchers and small team deployments. Hardware costing roughly 300,000 required for an 8×H200 cluster.
Infrastructure and Economics
Compute Environment
All experiments ran on Prime Intellect cloud infrastructure:
| Resource | Specification |
|---|---|
| GPUs | 8× NVIDIA H200 (143GB each, 1.15TB total VRAM) |
| Pricing | ~$13/hour (spot pricing) |
Available Model Checkpoints
All compressed model variants are available on HuggingFace:
| Repository | Description |
|---|---|
| 0xSero/GLM-4.7-REAP-40 | 40% expert retention (BF16) |
| 0xSero/GLM-4.7-REAP-40-W4A16 | 40% + 4-bit quantization (~108GB) |
| 0xSero/GLM-4.7-REAP-50 | 50% expert retention (BF16) |
| 0xSero/GLM-4.7-REAP-50-W4A16 | 50% + 4-bit quantization (~92GB) |
| 0xSero/glm47-reap-calibration-v2 | Calibration dataset |
| 0xSero/glm47-reap-observations | Cached observations |
Recommended Configuration
For most use cases: GLM-4.7-REAP-40-W4A16 offers the optimal balance of quality and size. At 40% expert retention, quality degradation is minimal across most benchmarks while delivering a compact 108GB model.
For extreme VRAM constraints: GLM-4.7-REAP-50-W4A16 at 92GB can run on single high-VRAM GPUs, but expect measurable quality degradation on reasoning-heavy tasks.
Quality Recovery: Optional Distillation
If quantization introduces unacceptable quality degradation, self-distillation with LoRA can recover performance:
- Generate synthetic data using the BF16 pruned model (teacher) via Magpie-style prompting
- Train LoRA adapter on the quantized model (student) with KL divergence loss against teacher
- Apply adapter during inference to improve output quality
This is the approach used by Apple and Ellora for post-training recovery. The process works by having the larger teacher model generate training examples, which the smaller student model learns from. The adapter captures nuanced outputs that quantization may have flattened.
Most gains come from the first few thousand samples. The distillation curve exhibits diminishing returns rapidly, meaning 5,000 to 10,000 high-quality samples typically recover most lost performance. Beyond that, additional samples yield minimal improvement.
The tradeoff is additional training time (typically 2-4 hours on similar hardware) and increased model size from the LoRA adapter (typically 100-500MB depending on rank). However, this is often acceptable given the quality recovery, especially for production deployments.
Challenges
1. Cost
Running the full compression pipeline cost approximately $1,000 USD in compute costs. This includes:
- REAP pruning: ~13/hour spot pricing)
- AutoRound quantization: ~$200 (4 hours × 8×H200)
- Experimentation and failed runs: ~$200 (additional prune ratios, calibration testing)
This cost is primarily driven by:
- Observation caching overhead: Each new calibration dataset requires 12+ hours of forward passes
- Multi-GPU requirements: MoE models demand distributed inference, limiting cheaper options
- Iterative experimentation: Finding optimal prune ratios and calibration datasets requires repeated runs
At on-demand pricing (~3,000 to $4,000, making spot pricing essential for research budgets.
2. Consequences of Failure
Model compression has meaningful failure modes that waste both time and money:
Irreversible model damage: Once experts are pruned, there is no way to recover them. If quality degradation is unacceptable, the entire 14-hour pruning process must restart from scratch. This creates a high-stakes decision point around choosing compression ratios.
Calibration mismatch: Using inappropriate calibration data can produce models that perform well on benchmarks but fail catastrophically in production. For example, a model calibrated on general text may fail at code generation tasks, even if overall metrics look acceptable.
Quantization artifacts: Aggressive W4A16 quantization can introduce subtle errors that compound over long contexts. The model may perform well on short benchmarks but degrade on multi-turn conversations or long document analysis.
Validation challenges: Full model evaluation requires significant compute (the uncompressed model barely fits on training infrastructure). This means compressed models may deploy with undetected issues that only surface in production.
Recovery cost: If a compressed model fails in production, re-running the full pipeline costs another $1,000 and 18 hours. This creates pressure to deploy marginal models rather than re-compress.
3. Dataset Curation
The quality of compression is directly tied to the calibration dataset. Several challenges emerged:
Domain representation: The calibration data must represent the target use case. Our initial attempts with generic datasets (C4, SlimPajama) produced models that underperformed on code and function calling. We had to curate task-specific datasets:
- Code generation: evol-codealpaca-v1 for programming tasks
- Function calling: xlam-function-calling-60k for tool use
- Agentic workflows: SWE-smith-trajectories for multi-turn reasoning
Sample quality vs. quantity: More samples is not always better. We found 1,360 high-quality, domain-relevant samples outperformed 10,000 generic samples. The observation file scales linearly with sample count (50GB for 1,360 samples), so each additional sample increases both computation time and storage requirements.
Data overlap: The calibration dataset should not overlap with evaluation benchmarks. This creates tension between having representative calibration data and avoiding contamination. We manually filtered samples to ensure no benchmark leakage.
Expert bias: Different calibration datasets cause REAP to prune different experts. A code-focused calibration may preserve different experts than a general-purpose calibration. This means production deployment requires matching calibration data to production traffic patterns.
Discussion: What Could Go Wrong
Users of compressed MoE models should understand the limitations and potential failure modes:
Quality Tradeoffs
Expert loss means capability loss: When REAP removes 50% of experts, those experts are gone permanently. The model cannot perform tasks that relied primarily on the pruned experts. For example, if experts specializing in mathematical reasoning are pruned, the model will perform worse on math tasks even if overall benchmark scores look acceptable.
Context window degradation: MoE models use different experts for different types of content. Compression may reduce the model's ability to handle diverse content within a single long context. A document with both code and natural language may see quality degradation on one or both domains.
Nuance and subtlety loss: Quantization to W4A16 reduces precision from 16-bit to 4-bit. This flattens the weight distribution, reducing the model's ability to make fine distinctions. Tasks requiring subtle reasoning (irony detection, emotional nuance, multi-step logical inference) may degrade more than benchmarks suggest.
Performance Variability
Task-dependent quality: The REAP paper demonstrates that compression impact varies significantly by task. Code generation and function calling saw near-lossless compression at 50% pruning, but other tasks (reasoning, knowledge retrieval) may degrade more severely. Users should evaluate compressed models on their specific use case, not generic benchmarks.
Calibration drift: Over time, production traffic may drift from the calibration dataset. As the model encounters out-of-distribution inputs, quality may degrade. This is particularly problematic for specialized domains (legal, medical, scientific) where high-quality calibration data is scarce.
Batch size sensitivity: Compressed models may be more sensitive to batching parameters. The optimal batch size for the uncompressed model may not be optimal for the compressed version, requiring additional tuning.
Operational Risks
Single point of failure: The 92GB model is more deployable but also a single point of failure. If the compressed model has a critical flaw, there is no straightforward fallback to the uncompressed model due to infrastructure constraints.
Reproduction difficulty: Re-running the compression pipeline requires significant compute and the original calibration dataset. If the original observations file is lost, reproducing the exact compressed model is impossible without another 14-hour calibration run.
Upgrade complexity: Upgrading to a new base model version requires re-running the full compression pipeline. This creates friction for staying current with model improvements.
Quantization sensitivity: W4A16 quantization may behave differently across different hardware. The same quantized model may produce slightly different outputs on NVIDIA vs AMD vs Apple Silicon, complicating reproducibility.
When to Avoid Compression
Given these risks, compression may not be appropriate for:
- Safety-critical applications: Medical diagnosis, legal advice, financial planning where errors have real consequences
- High-stakes decisions: Model outputs directly cause significant actions without human oversight
- Regulated industries: Compliance requirements may prohibit modifications to certified models
- Unknown use cases: When deployment patterns don't match available calibration data
For these scenarios, the cost of quality degradation outweighs infrastructure savings.
Expected Losses
For code generation and function calling at 40% expert retention + W4A16 quantization:
- Benchmark performance: 0-5% degradation on HumanEval, MBPP
- Nuance: Slight reduction in code style consistency
- Edge cases: 5-10% increase in failure rate on complex multi-step problems
At 50% retention:
- Benchmark performance: 5-15% degradation
- Nuance: Noticeable reduction in subtle reasoning
- Edge cases: 10-20% increase in failure rate
For general language tasks at 50% retention:
- Benchmark performance: 10-20% degradation on MMLU, GSM8K
- Nuance: Noticeable reduction in output quality
- Edge cases: Significant increase in hallucinations and errors
These are estimates based on the REAP paper and our evaluation. Actual performance varies by specific task and calibration quality.
Technical Challenges and Solutions
Challenge 1: Calibration Domain Mismatch
Problem: Generic calibration datasets may not capture the expert specializations relevant to our target use cases.
Solution: We selected a domain-specific dataset combining code generation (evol-codealpaca), function calling (xlam), and agentic trajectories (SWE-smith). This mirrors the intended deployment scenarios and ensures REAP scores experts based on relevant performance.
Challenge 2: Multi-GPU Inference Compatibility
Problem: Compressed model checkpoints spanned 8 GPUs, but evaluation frameworks (evalplus) couldn't handle distributed inference.
Solution: This remains a limitation for automated benchmarking. Manual evaluation or single-GPU deployment required for quantitative assessment. Future work involves distributed inference infrastructure for compressed MoE models.
Challenge 3: Spot Instance Reliability
Problem: Long-running jobs (18 hours) face inevitable preemption risk on spot instances.
Solution: Designed pipeline with checkpoint resumability. Observations file (~50GB) cached to persistent storage. Async upload to HuggingFace triggered immediately upon checkpoint generation to minimize lost progress.
Lessons Learned
Cache observations strategically: The 50GB observation file is reusable across multiple prune ratios experiments. Build once, test many configurations rapidly.
40% is the sweet spot: For MoE compression, removing 40-45% of experts provides optimal quality/size tradeoff. 50% is aggressive but useful for extreme constraint scenarios.
Spot pricing changes economics: At 40-50/hr), full pipeline cost drops from 1,000. Essential for research budget management.
Quantization after pruning compounds gains: W4A16 quantization after pruning achieves better compression than either alone. The weight distribution after pruning appears more amenable to quantization.
Upload before termination: Always trigger model upload before final checkpoint completes. Lost upload progress is expensive lost time.
Calibration data quality matters more than quantity: 1,360 domain-relevant samples outperformed 10,000 generic samples. Invest in dataset curation.
Future Work
Evaluate compressed variants: Systematic benchmark comparison across MMLU, HumanEval, and agentic benchmarks suite
Distillation recovery: Implement LoRA-based self-distillation pipeline for quality recovery on aggressive compression
Additional prune ratios: Complete 35%, 45% variants for finer granularity quality/size tradeoff curve
GGUF conversion: Export quantized model to GGUF format for local inference via llama.cpp on consumer hardware
Distributed inference: Infrastructure for compressed MoE distributed inference across multiple consumer GPUs
MiniMax compression: Apply the same pipeline to MiniMax models for comparative analysis
References
- REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression — Lasby et al., Cerebras Research
- AutoRound Quantization — Intel
- Prime Intellect
- GLM-4.7 Model
- HuggingFace Models
This compression work was conducted on Prime Intellect infrastructure. Model checkpoints available under permissive licensing on HuggingFace.
More Case Studies
MiniMax Agent – Agent Communication Protocol Integration
Integration of MiniAgent with Agent Communication Protocol (ACP), enabling seamless agent execution within Zed editor through Anthropic-compatible API.
Open Queue – Message Queue Plugin for OpenCode
A TypeScript plugin that queues messages while OpenCode is thinking, preventing context confusion from interrupted responses.