"The Evolution of a Planetary-scale Distributed Database" by Kevin Scaldeferri (Strange Loop 2022)

Learn how New Relic evolved its globally distributed time series database from a monolithic architecture to a scalable, high-performance system capable of handling exabytes of data and serving sub-second queries.

Key takeaways
  • Cellular architecture is a great approach for trying to cap the size of any individual servers.
  • We want to keep loads balanced across the cells.
  • We massively reduced the amount of local storage that we had in the system, brought costs down to 5% of the cost of the system and we were able to do this with no impact to glass when data shows up, time from when data is sent from a customer to when they’re able to view it in queries, queries are very fast, median is about a 50 millisecond response looking at like the last hour or the last day.
  • We need to actually think about what would the system have looked like if we had built it from the way that we’re treating cells and containers and all of that.
  • We were getting really sick of managing our own hardware.
  • We realized that most customers – i mean, almost all customers don’t care about this.
  • We could now actually deploy all of our systems, including our stateful systems, with the same orchestration and have much more standardization across the company even for these stateful systems.
  • We have a very dramatic parallel distribution in the various data streams that we’re ingesting, places to look at that, we want one place where you can look at all of that data together, you’re starting to, when you’re using Kafka is how you’re going to partition the data fails, we have impacted all of our customers rather than just a subset.
  • We need to actually realize that our customer’s data is not one of these commodity things but it was actually our sort of entire ingest processing pipelines which sort of terminate in the query layer there.
  • We also made another small change to actually decouple the processes that were doing that wants were around this change.
  • We also, now because we’re in the cloud, we’re actually much closer to S3 and transfer costs are easy.
  • We just took a whole bunch of copies of – i’m just showing kind of the nrdb picture here, all in one place, which had always been one of our goals in this system.
  • That was the approach that we needed to be taking with our cells instead of just with migrate to a cell architecture approach for our deployments, and at the same time to migrate to using it the way that S3 wants to be used.
  • We don’t want to keep vertically scaling our clusters.
  • We wouldn’t get any of those gains.
  • would be very inefficient.
  • of the data and be able to handle failures and still have all of the data available.
  • This also meant that we could switch our query tier to ephemeral instances and storage and for this system.
  • In order to do this with the lift and shift architecture, we had to build the system where
  • So we can scale up and down, but keep that hot storage tier available for this.
  • But it’s also a workload that’s like fairly easy for us to predict what it’s going to numbers of query workers back when we were in a single cluster, we want to have different
  • They have different CPU and memory allocations, they have different GC profiles and tuning
  • So we came up with a sort of, our work around for that is that we look at sort of the largest
  • We can send them, route them according to that.
  • You can go to the two different versions of the website, but you can’t get that stuff
  • So that’s not ideal.
  • so we make sure that that’s all evenly distributed across partitions, and then everything else
  • So to support that, we don’t require people to predefine any schemas for their data, we
  • Hi everyone, thanks for coming.
  • So sort of 2014 to 2015 we evolved into an architecture that looks like this.
  • So this gives us a picture that looks like this.
  • that we found weren’t actually going to last very long, and that we needed to actually tier that can be shared across all of them and now we don’t have to move data around.
  • So we got started, as most people do, with a lift and shift architecture.
  • And by and large, customers don’t care about your architecture.
  • We did introduce a couple of problems though.
  • of the diagram.
  • directions that we’re planning on.
  • coupling query routing to those ingest rules.
  • globally distributed, mostly immutable time series database for schemaless telemetry data, actually been released yet, so we were running JVMs directly on a big pile of dedicated hardware
  • I hope you enjoy the rest of the conference.
  • cluster in the world.
  • are for the last hour or the last day of data and so we keep that data set in the hot set
  • So it’s a sort of fun kind of fractal property to the system.
  • storing in there.
  • So, first, like being the biggest in the world of something is not actually a place that any help from the rest of the community.
  • You’d rather scale horizontally with more instances of the service.
  • things as necessary to keep an even workload across all of these workers.
  • So now we have a picture that looks like this and we’re actually very much back to the original
  • I’m just going to talk about, you know, recap some of the lessons that we learned and advice.
  • Some of the design goals that are specific to our choices when we were building this, that no one else has run into, and you’re going to have to solve them pretty much without the system.
  • Our customer data, it’s their collectibles like they really, really care about.
  • And then over on the query side, again, we’ve learned the lesson like we don’t want to be of future steps.
  • So we get a picture now that kind of looks like this.
  • It’s not interchangeable with any other data.
  • The one thing that didn’t work out was trying to use stateful sets so that seems like appealing it in the cloud from the start and with the cell architecture from the start.
  • We want the system to be very scalable and very high performance, so sub-second time changed.
  • think about, you know, a different approach to how we were going to continue to scale conceptual architecture except that everything is like one level higher abstraction from longer like correlate your data that’s in the US with that or you couldn’t see the history
  • This was probably version 0.7 when we first started using Kafka, but we switched our ingest
  • And what we thought at the time was that we wanted to actually just like really wall off put into a storage layer, queries come in, they go into a query routing system, which against our customers and trying to solve this bin packing problem is really, really
  • You’ll get a nice even distribution.
  • Unfortunately there’s a couple of reasons why that didn’t actually work out for us in you want to be.
  • So S3 is a great durable place to store data and what we needed to do was just go back but meanwhile, like, your ops teams are really not having a good time.
  • Try not to expose any of this stuff to them if you don’t have to.
  • So at this time, as I said, we were running in our own data centers, Docker had not yet
  • So this helps us for operations.
  • data wherever it happens to be.
  • had a single direct cluster that ran in our data center in the Chicago area.
  • So that’s kind of where we are today.
  • And if you wanted to create new accounts that were homed in the EU region, you could no two in the morning, it’s really not fun.
  • off segments of your data.
  • outside of the query path.
  • to be in.
  • You may have small numbers of customers who do have interest in this, but give people providers, there’s tens of thousands of JVM instances involved and we talk about exabytes
  • So in the large, there’s this sort of daily pattern that, you know, this is sort of the onboard very quickly, we want to make it really, really easy for them to start sending us data.
  • you want to be able to tie those things together in one place as well, we don’t want to wall
  • So we’re going to be pulling this stuff out of S3 anyways.
  • So this is, you know, if you’ve been to any past talks at this conference or others by
  • They don’t have to know about all of those different clusters and all of those different
  • That bin packing problem, the realization there was that the single multidimensional to query performance up to the 99th percentile.
  • Some people send us lots of data and actually never really query it and some people query for this problem.
  • local storage on the query workers as just a cache or a performance optimization.
  • They’re only using metadata to decide how to execute a query.
  • So one thing that potentially has been bugging some of you is this line through the middle analytics query, and that can be like ten times as expensive as everything else that’s we can go to the hash based distribution.