Cloudflare Outage Brought the Internet’s Inherent Fragility into the Limelight

Are you facing issues with your favourite websites lately? You aren’t the only one! Web infrastructure provider Cloudflare experienced a massive service disruption on Tuesday after its collapse, affecting a myriad of online services, including ChatGPT, X, Canva, and even outage tracking sites like DownDetector. It’s the third major internet outage in the span of a month. 

Mehdi Doudi, CEO of internet performance monitoring platform Catchpoint, addressed this issue as a “wake-up call” for companies. He added, “Everybody’s putting all their eggs in one basket, and then they’re surprised when there is a problem. It’s on the company’s side to make sure that they have redundancy and resiliency.”

For context, Cloudflare outage came shortly after similar issues hit AWS and Microsoft Azure, leading to significant parts of the internet to go down. Just like them, Cloudflare aids a substantial portion of the web. The platform’s ‘content delivery network’ helps keep sites running, along with providing DDoS attack protection and Domain Name System (DNS). In December 2024, the connectivity-cloud company reported that its network supports about 20% of all websites. This includes 35% of Fortune 500 companies in addition to ‘millions’ of other customers.  

Cloudflare has a reputation for its robust performance and security features which have made it renowned globally. However, the platform’s latest outage shows how over dependent the web infrastructure has become. After the Amazon Web Services outage took place impacting the secure messaging app, Meredith Whittaker, the President of Signal, wrote “they didn’t have any other choice but to use a major cloud service provider to run on. The entire stack, practically speaking, is owned by three four players,” she added.  

These latest outages highlight that organizations need better, more foolproof backup plans, especially when 90% of the internet and web rely on a few providers. Doudi told The Verge “Outages will be here, and they’re just going to keep happening more frequently. The blast radius will keep growing.” “The question is, what are you doing about it?” he added.

Unlike Microsoft Azure and Amazon Web Services, who connected their outage problems with DNS, Cloudflare linked it to a single file. Cloudflare spokesperson, Jackie Dutton said — “The root cause of the outage was a configuration file that is automatically generated to manage threat traffic. The file grew beyond an expected size of entries and triggered a crash in the software system that handles traffic for a number of Cloudflare’s services.”

It might sound unreasonable that such file issues can interrupt the internet or cause an outage, however, for big companies like Cloudflare, it is possible. Rob Lee, the chief of AI and Research at SANS, tells The Verge “When you operate infrastructure at Cloudflare’s scale, even small deviations can have outsized consequences.” He continued “These platforms are built for speed, so anything that delays or halts decision making can cascade quickly. In high performance environments, a millisecond delay can become a complete traffic stoppage.”

According to Rob, a configuration file mentioned by Cloudflare “drives routing security policies, load balancing decisions, and how traffic is distributed globally.” In case the configuration file increases in size, “it may result in slower parsing, memory issues, CPU contention, or logic failures within the systems that depend on it.”

Similarly, Amazon Web Services held ‘faulty automation’ responsible for causing the chain of issues that led to its recent outage. The company also said that these kinds of disruptions are likely to happen again. “Are you going to complain about it every time Cloudflare sneezes?” Daoudi adds. “… or are you going to build around it?”