Kubernetes monitoring eases migration, security at scale
Whether in making their first move to Kubernetes or staying ahead of security threats in a massive container infrastructure, a novel take on monitoring has helped some IT pros at large companies manage the shift to cloud-native microservices.
Enterprises have a plethora of Kubernetes monitoring tools to choose from, such as application performance monitoring and AIOps. But IT pros at video hosting company JW Player and online retail service provider Shopify chose Kubernetes monitoring tools that use extended Berkeley Packet Filter (eBPF), an embedded Linux kernel utility.
The successor to BPF (a decades-old mechanism that creates a mini-VM inside the Linux kernel to perform network routing functions), eBPF has grown popular in the last four years alongside Kubernetes. Tools that use eBPF can tap into every system call between containers and hosts without changes to the Linux kernel, and provide detailed data on performance and security operations in lieu of custom instrumentation.
Products from Sysdig and its open source project Falco added support for eBPF in 2019, and can observe system and network calls with minimal interference to running infrastructure, users say.
“[Falco is] great for security because it gives us such detailed visibility, but it doesn’t hog a lot of system resources or introduce a lot of lag when processing those calls,” said Shane Lawrence, senior infrastructure engineer in cloud security at Shopify, in an online interview at KubeCon EU Virtual last month. “It can be set up as read-only, so we don’t need to worry about it interfering with any of the system calls it’s monitoring, and the rest of the application runs in user space, reducing its attack surface.”
Kubernetes monitoring ensures performance amid migration
At JW Player, Kubernetes monitoring with Sysdig’s eBPF instrumentation proved crucial to migrating a large set of monolithic apps to Kubernetes microservices with minimal performance disruption.
The company hosts and distributes video content for tens of thousands of online media entities and serves videos to 1 billion unique devices worldwide every month. Its petabyte-scale infrastructure comprised hundreds of AWS EC2 instances in early 2019, when teams began to break down those apps into microservices to run in a 100-node Kubernetes environment.
This was a huge undertaking, not only in scale, but also in sensitivity — the company must meet an SLA of 99.99% infrastructure availability, even while navigating complex app conversions. JW Player engineers used Sysdig to pick apart the multiple network paths handled by each monolith that would be separated into individual microservices in Kubernetes, while ensuring that they continued to perform well.
“We could get that level of visibility with Sysdig immediately, so we could either roll back or roll forward,” said Kamil Sindi, CTO at JW Player, which is based in New York. “We knew, ‘Was it a TCP connection drop-off, or a load-balancing [issue]?'”
Because Sysdig’s eBPF instrumentation can see all the system calls on Kubernetes nodes, the product interface automatically traces metrics such as query performance in MySQL databases, without custom instrumentation from Sindi’s team, which also saved time during the migration.
Next, JW Player plans to add Sysdig Security, which uses the same eBPF data collection to monitor and enforce compliance and IT security policies. In the meantime, Sindi said he’d like Sysdig to make the tool easier to use for new engineers.
“Because you get so much data, there’s a more of a learning curve there” than with other monitoring tools, Sindi said. “[We’d like] to figure out how to make it really easy for a new engineer to dive deep into things and also, go back and have a high-level view.”
Sysdig added features on July 27 such as guided onboarding and prepackaged dashboards that are meant to help new users, according to a company spokesperson. The vendor also released a new SaaS-based Essentials tier at that time, with five basic workflows for security, compliance and performance monitoring.
Shopify taps Falco for Kubernetes security monitoring
Shopify had already moved to Google Kubernetes Engine when it began to explore open source Falco in 2018 for security purposes. But with tens of thousands of services spread across more than 50 Kubernetes clusters that serve an average of 170,000 requests per second in Shopify’s environment, the company faced a similarly difficult transition to Kubernetes security.
“We couldn’t put an [intrusion detection system] in, normalize it for a week and switch to [intrusion prevention],” Shopify’s Lawrence said in a KubeCon EU Virtual keynote presentation. “With rapid growth and frequent changes, a rule that was a little bit noisy in the beginning would be completely unmanageable within a year.”
Many security features Kubernetes operators now take for granted were missing in version 1.7 at that time, such as role-based access control and access to metadata and cloud audit logs. The company looked to Falco, which was donated to open source by Sysdig in 2016 and accepted as an incubating project in 2018 by the Cloud Native Computing Foundation (CNCF), to bridge those gaps.
Falco processes system calls at runtime, with the option of instrumentation through eBPF. Unlike Sysdig, which collects such data for both security and performance use, Falco uses that data to create and enforce security and compliance policies.
Falco helps Shopify identify subtle vulnerabilities in its infrastructure, such as the one uncovered when a security researcher gained access to secrets in Shopify’s lower-tier screenshot environment in 2018.
“If we had been running Falco in that Tier 2 environment at the time, it would’ve been possible to detect this unexpected activity,” Lawrence said. “Then we would’ve seen [Falco] moving [the alert] along to Slack … and this alert would tell us exactly which container it was run in, what the IP addresses were and exactly what command the attacker had run.”
Since the company rolled out Falco, upstream Kubernetes security has improved, and prevention should remain the top priority for IT security teams, Lawrence said. But IT pros must also continue to monitor Kubernetes infrastructures for new threats.
“No matter how good a job we do on [configuration], there’s always going to be the issue that prevention is behind,” he said.
While useful, Falco also isn’t magic, Lawrence cautioned the KubeCon audience.
“It’s great that we have Kubernetes awareness and we can monitor every [system] call, but that’s useless if we don’t have rules that make use of that information,” he said. “All this flexibility doesn’t mean anything if you don’t use it to tell Falco what is normal in your environment.”
Falco is still an incubating project, in version 0.25. Lawrence said in the virtual interview that he’d like to see separation between Falco functions that monitor system calls and those that process data against its rules engine.
“That’s planned for the 1.0 release, but I don’t know when that will be,” he said. “I am looking forward to the additional compartmentalization, since I think it will allow for more flexible scaling of performance on really large and busy nodes.”