When The Cloud Isn't Enough

Why in-house AI chip development is becoming a competitive necessity

Nov 12, 2025

Magnificent Seven tech stocks have outperformed the market largely because of AI infrastructure investment. But there’s a less obvious story beneath those earnings numbers: the companies building their own custom chips are pulling away from those relying on off-the-shelf solutions.

This isn’t about being early to AI. It’s about who controls the entire stack from silicon to software. And it’s about to become a defining competitive advantage that’s very expensive to replicate.

The Nvidia dependency problem

Every company doing AI at scale is dependent on Nvidia GPUs. This includes OpenAI, Meta, Microsoft, Amazon, Google, and hundreds of smaller companies. Nvidia has 90%+ market share in AI training chips.

That dependency is fine when supply is abundant and pricing is reasonable. It’s a problem when GPUs are constrained, when lead times extend to months, and when Nvidia’s pricing power increases.

The companies that saw this coming - Google with TPUs, Amazon with Trainium, Meta with their custom ASIC work - have optionality. They can use Nvidia when it makes sense and their own chips when it doesn’t. Companies that didn’t invest in custom silicon are stuck waiting in Nvidia’s order queue.

The China-specific calculation

Volkswagen’s development of their own AI chips for the China market through partnerships with Xpeng isn’t just about technology capabilities. It’s about supply chain resilience in a world where chip export controls are tightening.

US restrictions on advanced chip exports to China create uncertainty for any company relying on American semiconductors for Chinese operations. Developing or licensing China-based chip solutions provides insulation from export control risk.

This is a preview of what might happen more broadly if geopolitical tensions increase. Companies will diversify chip sources not just for technical reasons, but for political risk management.

The economic model shift

Building custom chips requires massive upfront investment. Google spent billions developing TPU infrastructure. The business case only makes sense at enormous scale.

But the inflection point where custom chips become economically viable keeps getting lower. Five years ago, maybe only Google and Amazon had sufficient scale. Today, Microsoft, Meta, Apple, and several other companies clear the threshold. In five more years, dozens more might.

As AI becomes more central to core products, companies reach a point where paying Nvidia’s markup doesn’t make sense compared to investing in custom silicon. The calculation isn’t just about cost per chip - it’s about cost per AI operation, and optimizing for your specific workload.

Why general-purpose GPUs are expensive

Nvidia’s chips are designed to be general-purpose. They need to handle gaming, cryptocurrency mining, AI training, AI inference, scientific computing, and other workloads. That versatility comes at a cost.

If you know exactly what workloads you’re running - for instance, you’re only doing large language model inference - you can design chips optimized specifically for that. You remove capabilities you don’t need. You add capabilities that matter for your use case. The result is often better price-performance.

This is why Google’s TPUs excel at certain AI workloads despite having less theoretical peak performance than Nvidia GPUs. They’re optimized for Google’s actual usage patterns rather than general-purpose computing.

The talent barrier

Building custom chips isn’t just expensive in money. It’s expensive in talent. You need chip designers, verification engineers, software engineers to write compilers and frameworks, and systems engineers to integrate everything.

Companies that started early have built these teams. Companies starting now face a brutal hiring market where experienced chip designers are extremely expensive and scarce. TSMC’s lead times for advanced node production are measured in years.

This creates a “rich get richer” dynamic. Companies with existing chip programs can iterate and improve. Companies without programs face years and billions of dollars just to get to v1.

The software stack problem

Having custom chips is only valuable if you can actually use them. That requires software frameworks, compilers, and tooling. Nvidia’s CUDA ecosystem took decades to build and is one of their biggest competitive advantages.

Companies building custom chips need to either build equivalent software stacks or ensure compatibility with existing frameworks. That’s non-trivial. Many custom AI chips fail not because the hardware is bad, but because the software ecosystem isn’t mature enough.

This is why some companies are open-sourcing their chip designs or software stacks - they need ecosystem support to be viable. You can have the best chip in the world, but if developers can’t easily use it, adoption won’t happen.

The inference vs. training split

An underappreciated nuance: AI training and AI inference have very different requirements. Training requires massive compute, memory bandwidth, and runs in data centers. Inference needs to be fast, energy-efficient, and often runs closer to end users.

Many companies are finding that custom chips make more sense for inference than training. Training can use Nvidia GPUs. Inference can use specialized chips optimized for low latency and efficiency.

This is the approach several cloud providers are taking: offer Nvidia for training, offer custom chips for inference. It reduces total dependency on Nvidia while not requiring custom solutions for every workload.

The data center efficiency angle

Beyond cost per chip, there’s cost per watt. Running massive AI infrastructure requires enormous electricity. Chips that deliver better performance per watt directly translate to lower operating costs at scale.

Google’s TPUs and Amazon’s Trainium chips tout power efficiency as a key advantage. For companies running data centers at the scale of millions of square feet, power efficiency differences compound into hundreds of millions in annual operating cost differences.

This matters more as AI workloads grow. A data center full of AI chips might consume 10-50 megawatts of power. Improving efficiency by 20% saves millions annually in electricity alone.

The Amazon Web Services problem

Amazon’s interesting position: they need custom chips to reduce costs for AWS infrastructure. But Nvidia is also a key partner whose GPUs are sold through AWS.

Amazon can’t fully abandon Nvidia because customers want access to latest Nvidia chips. But Amazon also can’t rely entirely on Nvidia because margins on cloud services get compressed if chip costs stay high.

The solution is a portfolio approach: offer Nvidia for customers who want it, promote Graviton and Trainium for customers optimizing for cost. Give customers choice while steering toward higher-margin custom solutions when possible.

The Microsoft-OpenAI dynamic

Microsoft has invested heavily in OpenAI but also develops custom AI chips. These strategies seem contradictory. Why build chips when your primary AI partner (OpenAI) can train models that you then deploy?

The answer is probably insurance. Microsoft doesn’t control OpenAI. If that relationship changes, or if OpenAI’s costs become unreasonable, Microsoft needs alternatives. Custom chips provide optionality.

It’s also possible Microsoft envisions running multiple LLMs, not just OpenAI’s. Supporting other models or developing their own requires infrastructure that isn’t dependent on a single partner.

What this means for smaller companies

The chip development arms race creates a problem for smaller AI companies. They can’t afford to build custom chips. They’re stuck paying market prices for Nvidia GPUs or cloud compute.

This becomes a competitive disadvantage if larger companies with custom chips can deliver AI capabilities at dramatically lower cost. Price-per-inference might differ by 5-10x between companies with optimized custom chips versus those using cloud-based general-purpose GPUs.

Smaller companies either need to be so much better algorithmically that they overcome the hardware disadvantage, or they need to find niches where custom chips don’t matter. Neither is easy.

The edge computing shift

An emerging factor: edge AI. Running AI models on devices rather than in clouds. This requires completely different chip designs - low power, small form factor, but still capable of running inference workloads.

Apple, Qualcomm, and others are developing chips optimized for edge AI. This might be where the next wave of AI infrastructure competition happens. Not bigger data centers, but smarter devices.

Companies that figure out how to deliver useful AI capabilities on device, without needing cloud connectivity, unlock new use cases and business models. That requires purpose-built chips, not repurposed data center GPUs.

The geopolitical wildcards

Export controls on advanced chips are already limiting what companies can deploy in certain regions. If restrictions tighten further, companies operating globally need chip sources that aren’t subject to US export restrictions.

This creates opportunities for non-US chip designers. If Chinese companies, European companies, or others can provide alternatives to Nvidia that aren’t subject to US restrictions, there’s a ready market.

The chip industry has historically been global. Geopolitics is forcing regionalization. Companies serving global markets need chip strategies that work across different regulatory regimes.

The long-term consolidation

We’re probably headed toward a world where there are two tiers: companies that build custom AI chips and companies that use off-the-shelf solutions. The gap between tiers will widen over time.

Tier one companies will have lower costs, better performance for their specific workloads, and more control over their technology stack. Tier two companies will have higher costs, less optimization, and dependency on chip vendors.

This doesn’t mean tier two companies can’t succeed. But they’ll need other advantages - better algorithms, better data, better products - to offset the infrastructure disadvantage.

Why this matters for marketing

This seems like pure technology discussion, but it has marketing implications. Companies with better AI infrastructure can deliver better AI-powered products. Better recommendations, better search, better personalization, better customer service.

That product advantage translates to marketing advantage. If your product experience is noticeably better because of superior AI capabilities, marketing becomes easier. If you’re trying to market a product with inferior AI because your infrastructure costs more and performs worse, you’re fighting uphill.

Infrastructure isn’t just about cost efficiency. It’s about enabling product capabilities that competitors can’t match. That’s where AI chip development connects to business outcomes.

Data, Tech & Tools

Discussion about this post