Azure Storage Blob is an Azure Storage offering that allows you to store GigaBytes of data in from hundreds to billions of objects in hot, cool, or archive tiers, depending on how often data access is needed. Store any type of unstructured data—images, videos, audio, documents and more—easily and cost-effectively.
In the latest offering it comes with strong consistency, Geo redundant storage, Multiple Blob Types (Block Blob vs Page Blob vs Append Blob). There is also option to update portion of blobs thus making it bandwidth efficient if used as such. Azure guarantees 99.99% up-time and have support for reading from secondary region in case of failure (Eventual consistency).
These features make it a strong candidate for storing serialized Machine Learning Models if you have models per tenant.
But, if the application is READ + WRITE heavy and the size of objects are large it can lead to a variety of issues, primarily:
Large I/O calls
If the size of objects that are persisted and read from Blob is large, the process will spend decent amount of time and network bandwidth in uploading and downloading data respectively from blob.
If not properly implemented this could lead to low throughput and large overall processing time. If you are dealing with a 100MB serialized file – it could take up-to 200 seconds (worst case!).
OutOfMemory Exceptions (OOM)
JSON is very popular schema for serializing objects for persistence and transport. There are a variety of libraries available for serializing and de-serializing objects in .Net. However, when dealing with really large objects you can encounter out of memory issues.
The common language runtime cannot allocate enough contiguous memory to successfully perform an operation. This exception can be thrown by any property assignment or method call that requires a memory allocation. Following cases are popular reasons for OOM exceptions:
Your app runs as a 32-bit process: 32-bit processes can allocate a maximum of 2GB of virtual user-mode memory on 32-bit systems, and 4GB of virtual user-mode memory on 64-bit systems. This can make it more difficult for the common language runtime to allocate sufficient contiguous memory when a large allocation is needed. In contrast, 64-bit processes can allocate up to 8TB of virtual memory. This kind of issues are particularly popular when testing code in debug environment in local system.
Lack of available memory due to memory leaks: Although the garbage collector is able to free memory allocated to managed types, it does not manage memory allocated to unmanaged resources such as operating system handles (including handles to files, memory-mapped files, pipes, registry keys, and wait handles) and memory blocks allocated directly by Windows API calls or by calls to memory allocation functions such as malloc.
You are repeatedly concatenating large strings: Because strings are immutable, each string concatenation operation creates a new string. The impact for small strings, or for a small number of concatenation operations, is negligible. But for large strings or a very large number of concatenation operations, string concatenation can lead to a large number of memory allocations and memory fragmentation, poor performance, and possibly OutOfMemoryExceptionexceptions.
So far I have experimented with following ways to improve efficiency and throughput of the system using Azure Storage Blob:
JSON Serialization & De-serialization libraries
A lot of libraries are available in .Net to deal with JSON format – primarily serialization and de-serialization. JSON.Net is the most popular package available. While it comes with a lot of features it’s not necessarily the fastest library out there. However, you need to ensure if other libraries works best for you. Based on performance tests I did results looked like this:
The amount of time to download / upload data to blob can be considered proportional to the size of object in question. Thus, a machine will spend decent amount of time in I/O if the size of objects are large. Generally, the flow is:
.Net Objects > Serialized to JSON String > Upload to blob
Download from Blob > Deserialize the JSON String > .Net Objects
Now if the size of object is large, compression can come in really handy. For JSON Strings GZIP can give as good as 10:1 compression ratio. This can help reduce I/O time and increase CPU cycles – thus increase the throughput and efficiency of the system.
Code: This way you can zip and upload data to blob without double serialization
Code: This way you can unzip and download data from blob without double de-serialization
Caching – In Memory Private caching
If you are working with same data again and again you can reduce the no of time you have download the data from Azure Storage Blob by implementing a simple in memory cache. However, this has it’s own set of limitations – it can come in handy if the size of the objects can grow really large and your system can handle concurrency.
In this post I’ll share how to use in memory cache but note that for highly distributed systems with multiple nodes shared caches could be a good solution in many cases – but is out of scope of this article.
The flow for In Memory Caching would be something like this:
This can help resolve multiple serialization, de-serialization and OOM issues.
MD5 Checks – remove redundant integrity checks
Azure Storage Blob uses MD5 hash check for checking the integrity of data. This is done at different levels; In case of PUT operations – the MD5 is computed at the client and checked at the service and vice versa in case of GET operations. However, this is not needed if you enforce HTTPS only transport mechanism – which can be done in Azure Storage Blob. If HTTPS only is enabled then MD5 check is a redundant integrity check. It can be done in following way:
Code: This way you can disable MD5 Validation with Azure Storage Blob
I haven’t done much research on this topic but a top level reading suggests promising results. Following article may help further, I’ll perform some experiments and share the results:
I work as Software Engineer for Microsoft Azure Production & Infrastructure Engineering team. My day to day work revolve much around distributed systems and machine learning. I am excited to explore areas like Natural Language Processing and Knowledge Bases and see if they can help solve bunch of problems yet to be commercially solved.