Would your rather observe an eclipse through a pair of new Ray-Bans, or a used Shade 12 welding helmet? Undoubtably the Aviators are more fashionable, but the permanent retinal damage sucks. Fetch the trusty welding helmet.We’ve made a number of security choices when building Canary that have held us in pretty good stead. These choices are interesting in that they don’t involve the purchase of security products, they don’t get lots of discussion in security engineering threads, and they verge on being unfashionable. One major unsexy architectural choice has proved itself: complete customer isolation.BackgroundFundamentally, Canary relies on two components: the Canary devices (hardware or virtual) that are deployed in customer infrastructure, and the Console (which we run) that these Canaries report into. Very broadly this is identical to most cloud-managed device or appliance products: appliances send telemetry to the cloud. It’s typical for cloud-managed devices to report to a single endpoint (e.g. one HTTPS service), or perhaps a region-specific endpoint.1 In those products, devices are managed via a website that is multi-tenanted (i.e. the same management site is shared by multiple customers). This comes with multiple operational and cost benefits, and is a natural choice.Except, we don’t make that choice. Every Canary customer has their own tenant, the Console.Each customer gets their own ConsoleWhy? Well, that’s the point of this post. Canary is assessed by external security testers on a regular basis. It’s always fun watching security testers’ reactions when they’re told that Canary customers aren’t colocated. Faces fall, furious notes are taken, and followup questions come thick and fast to try to figure out new (meaningful) angles of attack.I used to do the same kind of security consulting, and unauthorized access to other users’ data was an issue we reliably found when testing multi-tenant applications (especially web applications). There are so many avenues to explore. Insecure direct object references (e.g. you can access anyone’s bank statement by iterating through the numeric statement ID parameter in the URL https://vulnerablebank.com/account/statment?id=4728309), Cross-site scripting bugs using in-app messaging between users, server-side bugs which let the attacker gain local access to the server and thereby view all data, query injection bugs (such as SQLi) which let the attacker extract data directly out the DB without further checks, and many more. A root cause for those attacks we found was that unrelated organisations’ data was colocated.When we designed Canary’s architecture, we explicitly wanted to keep customer data separated by a strong boundary, not just application-level permissions checks. We picked a model that gave us complete isolation between customers: they wouldn’t share services, datastores, and resources like IP addresses. This came with drawbacks on the operational side (and influenced the product design), but we’ve been extremely happy with the trade. In this post, we’ll explore the model more, and point to clear examples of where it worked for us. Let’s start with a short explanation of what services we actually run.Service RequirementsIn delivering Thinkst Canary to our customers, we have several services that must be run. The Console is a website run on a uWSGI and Nginx; there’s honeypot configuration and alert data that must be persistently stored and retrieved quickly; there’s services to communicate with the deployed devices; there are Canarytoken services which serve HTTP and DNS traffic on a separate IP from the management interface; and there’s a separate service to ship alerts via Syslog for customers who opt for it.All combined, these make up the Thinkst Canary product.Our architecture: VM isolationAll the aforementioned Canary Console services for a single customer are contained within their own AWS EC2 instance. In other words, we run a VM per customer.In some circles, this approach isn’t considered sexy. There are no containers, no serverless functions, no cloud databases, no hyperscale support, no message buses (well… there are but they exist on the instance only), no load balancers, no k8s clusters. Nothing here will earn us a speaking invite to CNCF events, or an architecture diagram full of AWS service icons, and we happily accept this trade off.The thing is, cloud architectures can become caricatures of themselves. AWS’ reference design to host WordPress inside AWS looks like this:All this, and still no one reads my blogBroadly speaking, the main beneficiaries of Cloud-native architectures are the developers and operators of a service, not their customers. Your customers don’t care whether your database is hosted at RDS, or Aurora, or if it’s MySQL running on a 1U in a colocation data centre, so long as they can access the data reliably, safely, and quickly. Cloud-native tech makes it easier for developers and operators to build such systems, but customers (on average) don’t have requirements for whether Cloud tech is used (or not).This is what Customers care about: can I login and see my data?Choosing Cloud-native technologies and approaches comes with its own baggage. For us, the primary issue is that the security boundary that separates customers from each other becomes an application issue, and that is too risky. As a security vendor, a breach of customer data is a nightmare scenario for us.Consider, for example, a standard deployment model where customer data all resides in a single database or small number of databases (could be relational DB, or a document database, or similar). The boundaries between that data would either be enforced by the database through different users and roles, or (more typically) through the querying application applying authorisation checks after data has been retrieved from the database or as part of the query supplied by the application to the database. If the application is responsible for maintaining the boundary, then any vulnerability which allows an attacker to bypass the authorisation check would violate the boundary. Every query, every data retrieval is a potential source of vulnerabilities, and must be carefully guarded. No mistakes. This is true whenever either the data is colocated, or users rely on the same endpoints and web addresses to access data. It’s frighteningly common for these issues to occur.Reworking our web backend to rely on Lambdas would be a terrible approach for several reasons, and also ignores the interrelationship between the other services (such as device communications). Likewise, AWS IoT is a non-starter for managing our devices; we operate in networks where outbound MQTT and HTTPS is simply not allowed (which is why we rely on encrypted DNS traffic for device-to-Console communication). In other words, piecing together the same service from the Lego blocks of AWS services would result in a more cumbersome and less Customer-focused product. Instead, if we take on the responsibility of building those blocks ourselves, we can run a service that fits together beautifully, like an intricate custom puzzle.The isolated VM approach has drawbacks, which we’ll touch on shortly, but it has more benefits.BenefitsOutsourcing the hardest problemIn going with this battle-tested approach, we’re cleanly relying on the AWS hypervisor to maintain a silicone-assisted security boundary. Arguably entire AWS’ computing business rests on the strength of their hypervisor.2 Bugs in their hypervisor that yield cross-customer access are an existential threat to AWS, and they have a deeply vested interest in maintaining that boundary. They have proved themselves to be excellent in running and maintaining this hypervisor over a long period of time.3 And, in the extreme event that a 0day hypervisor bug is used to attack a Canary customer, even then a) an attacker would need to first need to colocate their attack VM on the same physical host as our customer’s VM, and b) only one customer would be affected; the splash zone is naturally limited.Performance and monitoringRunning all services related to a single customer on the same instance makes certain kinds of monitoring and investigating much simpler. We don’t need to collect performance data from multiple systems to understand what one customer is experiencing; it’s already all in the one place. We run a custom DNS server, and at times debugging involves packet captures to understand issues; it’s extremely helpful knowing that traffic at the Console is related to the customer issue being investigated.The inverse is also true. Isolating a customer’s services to a VM means any spike in usage or service degradation is constrained to just that Console. Faults occur; we see EC2 hardware issues on a regular basis and when those occur it’s a single Console affected (and automatically restored), not an entire service for all customers. In a hypothetical scenario where all our services run in, say, a Kubernetes cluster, all our eggs are in the k8s basket. A k8s failure takes out all customers4; in the VM model, you have to reach “AWS in several regions fails simultaneously” to find the single-point of failure.Isolated VMs are just a step away from the original horizontal scaling (i.e. more physical servers). Our scaling model when adding more customers is to duplicate the infrastructure for them, not to squeeze them in alongside everyone else. That means we don’t focus on hyper optimising a handful of services that all customers wind up using. Our Consoles have small compute needs and in most cases run “small” EC2 instances. Mean load is usually something like 0.04, and the 99th percentile load is 0.5. Regulatory compliancePer-customer VMs means we can easily meet regulatory burdens; some customers need to keep their data in one geographic region for compliance reasons. It’s trivial for us to handle these requests and currently can add entire new regions in the space of hours to our supported list. If, instead, we relied on shared infrastructure, then new regions would be big and expensive additions.Staged rolloutsThis approach also means we can rollout code and features to subsets of customers trivially before making them generally available. New code is deployed in a staggered manner across the fleet and, as a side-effect, rollouts will be paused if we see errors before code makes it to all customers.Operational securityWe can also implement IP-filtering at the security group level for customers, should they require it. This level of isolation is simply not possible on shared infrastructure.In building the application our user model is simplified because we don’t need to take into account an organisation boundary between data sets. We have users of different permission levels, but they all belong to the same organisation. This makes authorisation code easier to reason about, and helps speed up development. False BenefitsOne supposed argument in favour of the VM model is that it decreases your dependency on any single cloud provider, and makes it easier to switch because theoretically you can simply run your VM elsewhere. While it’s true that we’re not fully dependent on AWS for the compute environment (because we supply our own VM), we’re still reliant on other AWS services, especially around monitoring, orchestration, and network services. Switching to another provider would be non-trivial, and I don’t see the VM as a real benefit in this regard. The barrier to switching is still incredibly high.DrawbacksThe flipside of the coin is that we incur a greater operational cost in terms of both effort and spend. Maintaining thousands of instances requires us to be proficient in configuration management. We need to think both in terms of configuration at scale (managing thousands of instances), and very local issues (a recent example is Ubuntu changing the behaviour of /tmp permissions, necessitating customisations to /etc/sysctl.conf. In a container world, someone else would have likely handled that.) Our Linux sysadmin skills have stood us in good stead; without decent sysadmin skills this path would have been a tricky one to pursue.I’ve not even touched on the impact on product design that isolated VMs brings, but suffice it to say it’s deeply built into Canary; when we ship devices (hardware or otherwise), they need a path to discover their Console, which is a whole separate topic.Custom monitoringWhile AWS provides instance checks to track instance health, we’ve had to build all sorts of monitoring, because the built-in AWS monitoring tools weren’t sufficien t for what we needed. Consider that we have thousands of instances (and IPs), with corresponding DNS entries and multiple services (speaking both DNS and HTTP). We want to know within minutes if any of them become unresponsive, and we had to build that tooling ourselves. In addition, on-instance monitoring is also performed to understand whether we’re approaching a failure mode on a Console.Hardware reliance and limitationsI mentioned above that hardware fault isolation is a benefit, since it only affects that one Console on the broken hardware. The consequence of this is that one customer is offline until our tooling automatically restarts the VM. This typically happens within minutes of the failure.While we’ve not come close to approaching any real instance size limit, there’s a theoretical limit in having a single instance per customer: it’s the largest available EC2 instance. But it’s not a concern for us; we’re not yet at the halfway point in the EC2 instance size progression even for our most demanding customers (and instance sizes double on each step).Slower rolloutsRolling out code across thousands of instances usually takes a couple of hours because we perform the rollout in increasingly larger sets. If we simply ran a handful of massive servers or clusters, code rollout would likely be faster.CostLastly, this approach is almost certainly more expensive. Our instances sit idle for the most part and we pay EC2 a pretty penny for the privilege. This is likely the biggest drawback for most organisations considering this approach. (It’s worth noting that we bootstrapped Thinkst and never took external funding. While this approach is more costly than the traditional model, it’s certainly not prohibitive because the cost scales linearly as you gain customers. If you’re in the red and are hoping to improve operating margins by cramming more customers onto the same compute, this approach isn’t for you. It won’t work with negative profit margins, as every new customer scales the cost linearly.)With all these drawbacks, the benefits still far outweigh them in our view. We can point to concrete moments when the isolated VM model has proved its mettle.Case StudiesCase Study 1: Debug consoleIn our early days we had an incident on a Console, in which a developer left a debugger exposed to the Web. If customer data had been present on that Console, then we’d have had to roll external incident response along with publishing breach announcements. Instead, the isolated VM (which was the developer’s VM) meant there was no customer data on that VM at all. After detecting the incident we simply burned that infrastructure to the ground, and built in controls to ensure the same bug could never leave us open again.Case Study 2: Web application bugsThrough the years of security assessments we’ve undergone, the assessors have found several common bug classes. But they’ve never discovered access to another customer’s data. We prefer crystal box testing and hand over source code during tests; even after extremely clever exploits with full source code access, it’s impossible to read data that simply isn’t there.Case Study 3: osslsigncodeEarlier this year we found out that, despite paying for security patches in Ubuntu, we still had a vulnerable version of a package called osslsigncode installed due to Canonical refusing to upgrade the package in contravention of their advertising copy. The situation is now resolved, but while this vulnerable version was present we had to conduct a risk assessment to determine whether we shipped our own version of the package or resolved the underlying issue by upgrading to an Ubuntu version where the fix was already provided.If we relied on shared infrastructure, we’d have to scramble to resolve the situation because of the risk that one customer exploits the vulnerability and gains access to the host that also handles another customer’s data.However, because of the isolated VM model, the risk assessment had a different outcome. We reasoned that an attacker needed to be authenticated; that we don’t have open registration; that the attacker would need to craft a custom exploit (no public exploit was or is available for this issue); and if exploited they’d see only their own instance and their own data. The upside for the attacker would be: they can access their own data. This heavily discounts the value of the attack, and so the isolated VM gave us breathing room to resolve the issue in a deeper way.Should you default to isolated VMs?The obvious answer is… there is no obvious answer. We’ve thought about the tradeoffs, and for our risk profile the AWS hypervisor is the boundary we’re most happy with. Our Customers are more interested in having their data remain their own, than whether we’re running Canary with various pieces of Cloud tech. Returning to our opening question, we pick the unfashionable welding helmet over the Aviators for eclipse viewing, every time.We know this introduces a cost, product-wise, operationally, and in terms of compute expense. We’re happy to eat this because it lets us sleep better at night.The tradeoff depends on whether you can afford the large slack in paying for mostly idle systems, have the sysadmin capacity to run thousands of instances, and are willing to reimplement tools specifically for your own infrastructure. The upside is a security boundary that is very, very hard. We’re happy on this path.A cluster might service the endpoints, but embedded in the appliance or device is a single or limited set or URLs. ↩︎Even services which abstract away the idea of EC2 instances, like Lambdas, still run on VMs underneath and rely on the hypervisor. ↩︎When hypervisor bugs are published, AWS usually is given advanced notice in order to remediate the issue across all of AWS. ↩︎Perhaps it’s a mistyped administration command, or a configuration error, or an upgrade that didn’t go according to plan. A Kubernetes failure would affect all Customers at once. ↩︎