Since last couple of months we are working on some of the Academic Research project where in we are trying to hit the corner cases or you can say nailing down the narrow passages. I thought it is a good idea to make these public and see how people are thinking on these lines. Let’s talk about our latest academic research project.
We all know how important the data access pattern can be in Storage subsystem. We also know that we have either Sequential IO or Random IO in Storage subsystems. You may also know that the Sequential IO is significantly faster than Random IO on a per thread basis.
However, file sequential multi-threaded access can appear Random on the back end Storage subsystem if users are accessing same disks. In that case those users that are accessing the data sequentially will definitely see performance degradation.
So in a nutshell, on a per-thread basis, sequential I/O is significantly faster than random I/O. If you increase the count of concurrent sequential threads to the same disk then that begins to generate random behavior on the back-end Storage subsystem due to the sequential streams being multiplexed together.
If your production environment is dealing with mostly sequential I/Os, you always need to make sure that you don’t have multi-threaded sequential access to the same set of disks. This will appear random to the storage system, since the data will be in different locations, the disks will have to search all over the disk, rather than reading all of the sequentially files serially. We call it as “Perceived Random IO“.
For example, three users are trying to sequentially read same/different files that are located on the same disk. Even though these users are sending sequential I/Os, they are accessing different locations on the disk at the same time, which for the storage array, feels like random access. Let me show you visually how it will look like to a Storage subsystem.
If you test it using 32 KB client I/Os and add more thread to the test and each thread is accessing a single file on the same file system, you can see, the per-thread performance will decrease as more threads are being used. However, the aggregate bandwidth will be increased. Obviously, more data is being accessed as the number of threads increase, but each client will see a performance decrease individually.
Now we need to tackle this problem smartly and think about a scalable solution.
Summary of the Solution
We need to make some assumptions before we start detailing out the solution.
- This solution is for multi-threaded sequential reads coming from various sources for the same file. This solution does not factor in the Random I/O (be a single or multi-threaded).
- The complete file resides on the same disk in the Storage Subsystem. This solution will not be applicable if the file is distributed across multiple disks.
As I said earlier we need to think about a scalable and modularize solution. Looking at that we wanted to start with modules and it’s functions.
Converting pseudo-random reads to sequential reads a.k.a Converter Module
Whenever we get multiple read requests for different blocks of the same file also known as pseudo-random reads (sequential reads which appear as random reads to storage subsystem), we cache the requests at the Distributed Virtual SCSI Optimization Layer (DVSOL) in a queue. Once in the queue the requests are sorted based on the following algorithm:
- Get the last read block number known as current.
- Until you reach the end of queue
- If the block number is greater than current
- Then add it to the queue A
- Else if the block number is less than current
- Then add it to the queue B
- Sort queue A in ascending order
- Sort queue B in ascending order
- Send queue A to the storage subsystem
- Send queue B to the storage subsystem
- Read the blocks and store them in the cache at DVSOL
- Send the respective blocks back to the requester
Caching on-demand a.k.a Caching Module
This cache sits in the DVSOL layer and has a table to track the blocks per file. Every time a read is made on a particular file, the blocks which were read are copied and stored in the cache and the block numbers are updated in the block per file table. So next time whenever there is a request for the same blocks, the request is fulfilled by reading the blocks from cache and avoiding the need to make a read call to the storage subsystem.
Let us take some example situation and look at the saving that it can bring to your environment.
Suppose we get four read requests for file X. We get one read request first to read block number 10 (assuming one track contains 100 blocks) then after Y ms we get 3 parallel requests to read blocks 20, 15 and 1 respectively.
Time taken to complete the requests as of today
10 (time taken to move the read/write head from 10th to 20th block) + 85 (move the read/write head from 20th to 15th block) + 95 (move the read/write head from 15th to 1st block) = 190
Time taken to complete the requests using our proposed solution
Since we rearrange the queue, our existing queue of 20,15 and 1 will be look like the following:
A | B
15 | 1
Now when we perform the reads it will be:
5 (move the read/write head from 10th to 15th block) + 5 (move the read/write head from 15th to 20th block) + 80 (move the read/write head from 20th to 1st block) = 90
From the above example you can see that there is a direct benefit of approximately 53% plus if there are any more read requests for the same blocks they can be directly serviced by the Caching Module which will add to the time saving.
DVSOL can be implemented as a module at the ESXi kernel level which will talk to other such module sitting across other ESXi hosts combined together to achieve the distributed architecture. Hence this layer will be able to decide if the multi-threaded read requests received across hosts requires optimization or not.
Final state of this solution will be as follows:
So what is so unique in this solution and why someone should pursue this path? Answer lies below 🙂
- Converting pseudo-random IO (multi-threaded sequential IO, perceived as random IO by the storage subsystem)
- Introducing Distributed Virtual SCSI Optimization Layer
Note: Data caching in the DVSOL may not be a unique idea as there is already caching implemented in the storage subsystem. However, we have added it to further compound the benefits of our solution.