HubStor's Geoff Bourgeois and Microsoft's Kumail Hussain discuss the new Azure Archive Blob Storage and how cloud archiving is evolving to meet customer requirements for convenient, low-cost storage.
Full Interview Transcript
[00:00:05] Geoff: All right. We're now live. With me today is Kumail who's a senior program manager at Microsoft, responsible for the archive tier that's just gone generally available and we're excited to have him today. Welcome, Kumail.
[00:00:23] Kumail: Thank you, Geoff, for having me today.
[00:00:24] Geoff: Excellent. What I'd like to do is to talk about this new archive tier. We're very excited about it at HubStor. I think customers are going to be very excited about it, but help us understand first and foremost what is it? I don't think a lot of folks yet are aware of this kind of cold storage concept, or this deep archive concept that's in the cloud. Can you enlighten us and tell us what it's all about?
[00:00:45] Kumail: Yeah, definitely. I'm very excited to share with you that today. Archive storage is a new storage tier that we've released for cold data, for our blob storage. We have hot and cool storage today and archive is an even cooler and colder tier for data than our cool tier. The way it provides benefits to customers is that it has an extremely low storage price as you'd expect. However, access price is higher. As you tier your data from hot, cool to archive the storage price does decrease. However, access pricing does go up. It's really meant for data that will be stored for a long period of time, is rarely accessed and can tolerate several hours of retrieval latency.
[00:01:39] Geoff: Okay, that's excellent. Very interesting, this concept of deep archiving in the cloud. I was at a Gartner Infrastructure Operations conference earlier, and the analyst that covers the public cloud storage market said that there's something like 52 exabytes of data in the cloud. He said a very small percentage of this data was actually archive, backup and DR. He said that the cold storage tiers in the cloud are typically used for in-cloud storage tiering.
But I think based on what we're seeing there is a very a strong appetite in the market to get on premises workloads archived into the cloud. I think that companies are struggling with, "Well how do I get it there?" and that whole issue. When you built the archive tier, when you engineered this, the use cases that you were aiming for, was it a mix of on-prem workloads coming in or were you just thinking about in-cloud storage tiering or was it a combination of both?
[00:02:40] Kumail: It was really a combination of both. There is an IDC report that basically suggested in 2025 the global data sphere is expected to grow to more than a 160 zetabytes and there's another study where found that around that time period, 60% of all data will be classified as archive. The need to store data for long periods of time and the percentage of data that falls within this archive in cold storage I think is going to continually increase both on-prem and in the cloud.
I think having such an offering in the cloud has significant benefits because customers today are using things like tape media or other storage to manage all this and it can become very complicated and costly just to manage this. As hardware changes and just the operational complexity that goes along with it, so having a very low-cost option in the cloud I think makes a lot of sense, not only from a financial perspective, but also from just a complexity and operational standpoint.
It removes a lot of that burden for customers as well which I think makes it a great use case for this growing need to store compliance data or just any type of data that you're not really accessing, but does need to be around for several years.
[00:04:26] Geoff: You mentioned compliance data and I think our view of compliance data is going to change drastically in the year ahead with the GDPR coming into play. You familiar with the GDPR?
[00:04:37] Kumail: Yes, I am.
[00:04:38] Geoff: Okay, so the GDPR, where any data can be compliance data all of a sudden. I think one of the shifts that we'll see in the market place, and let me know if you agree with this or not, but when we think about the cloud for long-term retention, it makes a lot of sense because we have very efficient compute sitting beside it. When you think about customers today, how they've been handling long term retention, a lot of them are still into the world of, "Well, it's our backup," or you know, it's tape.
If you put yourself in their shoes you are off-siting things maybe to Iron Mountain or something like that. When you have tape archives, it's really a backup and now you're getting hit with five GDPR right-to-be-forgotten requests per day. Tape is very inefficient in terms of, "Well, we've got to do discovery against this essentially." I think that the cloud offers a great advantage in that sense, as a long-term archive, because of the efficiency, the nimbleness that exists. The agility where yes, it's a deep archive and it's very low cost, but we can now easily get at the data. It's right at our fingertips at any given point. Do you see that being a use case for the cold archive tier as a replacement for tape, as a long-term retention mechanism?
[00:06:01] Kumail: Definitely. Yes, I think there's a lot of things that go into play aside from just cost of managing something like that. One of the things I mentioned earlier was the operational complexity. When you look at tape hardware, it's an equipment that has a lot of physical moving parts. Not only do you have to manage the hardware, but there's things breaking down, they have to be serviced. Also you have new types of drives and media coming out that may not be supported.
You have to manage the migration of your data and then there's things like what you mentioned where you had GDPR and other compliance requirements such as within the financial industry you have warrant requirements and things like that and with our broad set of certifications and a lot of the features that we're working on, customers just get a lot of those for free without any additional work on their end, and having a low cost storage option that meets all those requirements and also provides the durability that they're looking for.
With all of our storage options we have a locally redundant storage option which stores several copies within one data center, but then also a geo-redundant option that automatically replicates that to another region that's a few a hundred miles away. Those I think are just some of the examples that a customer just automatically gets without having to do any additional work which would be difficult to get and maintain on their own.
[00:07:45] Geoff: I agree. Especially when we start talking about the low cost of the archive tier. Many companies will say, IT folks will say, about the cloud is that it looks attractive, but it's not when you consider the egress cost. A lot of times I hear people talking about the egress cost and I find that it's a lot of fear, uncertainty and doubt that's really being injected into the conversation. If you take certainly something like archiving workloads where you don't see a lot of retrieval activity, but even you make a very large estimate on how much egress you're going to see, and model those costs and bake that into the all-in figure, the cloud at many times will work out to be much more attractive than the all-in cost on prem.
I think what happens a lot of times when companies compare the cloud cost and the on-prem cost is they aren't doing an apples-to-apples comparison. They get into this apples and oranges where they'll take many of the cost variables on-prem and put them aside because while they've already incurred those costs maybe last year or the year before, it's a capital expenditure and so they'll compare the top of the iceberg with the cloud's complete iceberg and they go, "Well, the cloud's not that great."
But I think there are some people out there that are really looking at the all-in cost on-prem and they are seeing a major advantage to going into the cloud. When you talk about the low cost of the archive tier can you tell us about the pricing exactly? Just a locally redundant price point in say, the lowest priced Azure region, where does it start?
[00:09:30] Kumail: Sure. There's a few different variables, namely the region and type of redundancy you choose that does change the price, but our lowest price, which is very exciting, is in where-- we launched archive at an industry leading price of two-tenths of a cent per gigabyte per month. That essentially means you can store a terabyte for roughly two dollars a month, which is incredible. There's a few different meters involved but they're not new. They're similar to what we have on hot and cool today. Basically, when you write something into archive, the only thing you're paying is for a right transaction cost, which is 50 cents for 10,000 transactions and the per GB cost is free.
Storing it is two-tenths of a cent, as I mentioned earlier, for LRS in our lowest priced regions. Then when you go to retrieve that data, there's two different meters. There's a re-transaction cost, which is fairly high. That's $5 for 10,000 transactions and there's a per GB cost of two cents per GB that you also pay.
Now, even though the re-transaction costs are high, if you're using it in the right way where it's rarely accessed, large files stored for a longer time, what you'll find is that the re-costs are such a small percentage of your total cost of owning this is that it's really negligible. Now, if you mis-estimated or the re-patterns are excessively higher than what you expected, you could get into a situation where it's actually more expensive than our cool storage. For that reason, we recommend customers try to estimate what their read access pattern is before putting it into archive or if they don't know just leaving it in the hot and cool tier and then aging it off to archive after let's say, 90 days or a longer period.
To give them higher confidence that it won't be accessed beyond a certain point. The other aspect of pricing that customers should be aware of is that there's also an early deletion period of 180 days. Which means that if a customer deletes or changes a tier of a blob out of archive to hot and cool before 180 days, they will be still charged for that 180-day amount, which is prorated.
[00:12:18] Geoff: Okay, that makes sense. I'm sure the folks that maybe aren't familiar with the cloud pricing model they probably just felt like they were drinking from a fire hose a little bit but to summarize, to play that back to you, I think what we're saying is A; you want to make sure that you're putting the right workloads into the archive tier. For example, let's just pause there and chew on that for a second, the right workloads. When you have the right transaction cost, was it 50 cents per 10,000 writes. A bad workload would probably be something like IOT data or e-mail data where you might have billions of objects, right?
A good workload would be something where you have large files. A small number of large files like video and media content or geospatial data or medical images, things like this, right? That's where you're going to get your best bang for your buck.
[00:13:14] Kumail: That's correct, yes. When we say transactions you can think of one tiering one object or one file as a single transaction. The larger those are, the more cost effective it will be for you. Now, what you can do also on your end is you can create one object out of several small files so that it is a larger object. We do allow objects to go as high as five terabytes. You can optimize your cost in that way but yes, what you said is exactly correct.
[00:13:49] Geoff: Okay, so take e-mail for example. We might have something that says, "All right, we're going to write individual messages to hot or cool. We can do nice message level search and discovery and wholes and recalls," but let's say, if we're again dealing with e-mail we wanted to move it down to cold tier we would perhaps package up certain mailboxes into a PST file and write the PST files down, just as an example.
[00:14:12] Kumail: That's correct.
[00:14:12] Geoff: Containerized the data in some way. That makes sense. But the other element then is when we talk about the early recall or the early deletion from the archives. Your early delete cost variable is new, right? That doesn't exist for hot and cool. That tells us that we want to make sure that it's not just the right data that's going to the archive tier, but it's going there at the right time in its life cycle. In other words, if it's going to be accessed soon, we don't want to be putting it there. We want to make sure that it's cold data. It's rarely abscessed data. Things like closed projects, leaver data or X-employee data, long-term compliance type data that we just know we have to keep it for some reason but nobody's touching this stuff.
[00:14:58] Kumail: That's exactly correct, yes. I mean, it's important to know that your data has reached this point where you don't expect to need it, but if you do need it there is a way to get to it. Again, it'll be impossible for customers to know exactly 100%, but you can do the math and even within some range of error you'll find that even if I do have to incur this charge on a small percentage of my data you'll still find that it's worthwhile. It doesn't have to be precise, but you do have to have a good idea of what the retention of that data would be in order to make sure that the archive tier makes sense.
[00:15:45] Geoff: Okay. Yes, that makes sense. All right, tell us a little bit about blob-level tiering. You touched on it a little bit. We didn't say blob-level tiering before but you touched on it when you said moving things from hot or cool down to the archive tier. Now, my understanding is when we write data into a storage account, we can't go directly to the archive tier. It needs to go to hot or cool first, right?
j[00:16:07] Kumail: That's correct.
[00:16:07] Geoff: Okay, can you explain blob-level tiering though? I think it's a really cool feature. It sets Microsoft's approach to cold archives storage apart and I think it's very efficient design, but can you enlighten us and tell us about blob-level tiering?
[00:16:21] Kumail: Sure, yes. In the release of archive, this gets overshadowed a little bit, but I think it's very important to point out that this is a very nice feature that I think customers will appreciate. Today, or before we released archive you had the tier of cool or hot and it was set at the account level. Which meant that when you created a blob storage account you would specify the access tier as hot and cool and then anything you put into that account, every blob had that same tier. Which was great if you wanted all of your data to be in the same tier.
Because then let's say you put it all in hot. If you want to move it to cool you just change the account level setting. However, if you have some data that needs to be tiered to a different tier while other data needs to remain in a different tier, then you do get into a situation where the only way to do that is to create a new storage account and manage moving the data between those two which can get complicated. With blob-level tiering, you can now set the tier of an object at the blob level. Which means that when you create an account it will still have an access to your hot and cool.
But this will be more of the default tier that's applied if you don't specifically and explicitly set the tier at the blob level. Now you can do the same thing, but as let's say 20% of your data needs to move to the cool tier, instead of creating a new account you can just go and set the tier to cool on those individual blobs. The way we've integrated archive makes this feature even more important because instead of just being able to go between hot and cool at the blob level, you now have archive and you have three tiers at your disposal. The great thing about the way we've integrated archive into our current tiered storage offering is that the APIs are 100% consistent.
Any valid archive API is an existing operation that exists today for hot and cool. However, the only new API we've introduced is that blob tier. This is the operation that enables this blob level tiering functionality. There's no restrictions on which direction you can go. You can go from hot, cool to archive or archive cool to hot, or from hot directly into archive or from archive directly to hot. This really simplifies a lot of things and makes it easy for customers to seamlessly integrate it into their current platform today.
[00:19:04] Geoff: Right, so if I don't want to use it, I don't need to use it, but if I do, it's right there. I don't have to migrate data from one storage account to another or anything like that. It's just this in place feature that exists in blob storage accounts.
[00:19:18] Kumail: Correct.
[00:19:19] Geoff: Very cool. Awesome, and you mentioned Azure regions. We talked about this a little bit earlier. Some of the Azure regions and the pricing being different. I think we're up to 42 Azure regions now. My understanding is that the archive tier is it's available today in West and North Europe, all of the US regions, Korea and a couple of the India regions. I'm getting asked a lot already, is it going to be available when South Africa comes online in 2018? When's it going to be available in Canada? When's it going to be available in the UK region or regions that are in the UK there?
Can you tell us a little bit about the rollout plan to go GA and to the other regions?
[00:20:00] Kumail: Sure. As you mentioned, we do have a global presence in 14 regions today, so this includes all the eight US public regions, Europe, North Europe West, both regions in Korea and then two of the regions in India. I can't share the exact roll-out and dates with you. However, I can tell you that we have a very aggressive plan in place to roll-out in significantly more regions in a very short period of time. We plan to expand in several regions in 2018 and we'll continue to update our availability in those regions as we get there.
[00:20:44] Geoff: That's fantastic. I know many of our costumers are excited to start using it. We plan in the next couple of weeks to start upgrading HubStor tenants and these features will just be on, our customers will come in and it'll be there, and they can start using it, but we have a lot of deals in our pipeline that are very interested in this.
They're talking about very old data on-prem, very large amounts of it, and the archive tier really seems to attract a lot of interest in a lot of people.
It seems to be what's going to push them over the edge to go all in into the cloud in a big way. We're pretty excited about it and I think you guys have done a great job with it. Kumail, this has been a very fun conversation. I appreciate you making the time to do this with me.
[00:21:30] Kumail: Thank you, Geoff. I had a great time talking to you today.
[00:21:33] Geoff: Great. Thank you