Academic Research – Multi-Threaded Sequential IO can be termed as “Perceived Random IO”

Since last couple of months we are working on some of the Academic Research project where in we are trying to hit the corner cases or you can say nailing down the narrow passages. I thought it is a good idea to make these public and see how people are thinking on these lines. Let’s talk about our latest academic research project.

We all know how important the data access pattern can be in Storage subsystem. We also know that we have either Sequential IO or Random IO in Storage subsystems. You may also know that the Sequential IO is significantly faster than Random IO on a per thread basis.

However, file sequential multi-threaded access can appear Random on the back end Storage subsystem if users are accessing same disks. In that case those users that are accessing the data sequentially will definitely see performance degradation.

So in a nutshell, on a per-thread basis, sequential I/O is significantly faster than random I/O. If you increase the count of concurrent sequential threads to the same disk then that begins to generate random behavior on the back-end Storage subsystem due to the sequential streams being multiplexed together.

If your production environment is dealing with mostly sequential I/Os, you always need to make sure that you don’t have multi-threaded sequential access to the same set of disks. This will appear random to the storage system, since the data will be in different locations, the disks will have to search all over the disk, rather than reading all of the sequentially files serially. We call it as “Perceived Random IO“.

For example, three users are trying to sequentially read same/different files that are located on the same disk. Even though these users are sending sequential I/Os, they are accessing different locations on the disk at the same time, which for the storage array, feels like random access. Let me show you visually how it will look like to a Storage subsystem.

Multi-Threaded Sequential IO

If you test it using 32 KB client I/Os and add more thread to the test and each thread is accessing a single file on the same file system, you can see, the per-thread performance will decrease as more threads are being used. However, the aggregate bandwidth will be increased. Obviously, more data is being accessed as the number of threads increase, but each client will see a performance decrease individually.

Now we need to tackle this problem smartly and think about a scalable solution.

Summary of the Solution

We need to make some assumptions before we start detailing out the solution.

Assumptions

  1. This solution is for multi-threaded sequential reads coming from various sources for the same file. This solution does not factor in the Random I/O (be a single or multi-threaded).
  2. The complete file resides on the same disk in the Storage Subsystem. This solution will not be applicable if the file is distributed across multiple disks.

As I said earlier we need to think about a scalable and modularize solution. Looking at that we wanted to start with modules and it’s functions.

Converting pseudo-random reads to sequential reads a.k.a Converter Module

Whenever we get multiple read requests for different blocks of the same file also known as pseudo-random reads (sequential reads which appear as random reads to storage subsystem), we cache the requests at the Distributed Virtual SCSI Optimization Layer (DVSOL) in a queue. Once in the queue the requests are sorted based on the following algorithm:

  1. Get the last read block number known as current.
  2. Until you reach the end of queue
      1. If the block number is greater than current
      2. Then add it to the queue A
    1. Else if the block number is less than current
      1. Then add it to the queue B
  3. Sort queue A in ascending order
  4. Sort queue B in ascending order
  5. Send queue A to the storage subsystem
  6. Send queue B to the storage subsystem
  7. Read the blocks and store them in the cache at DVSOL
  8. Send the respective blocks back to the requester

Caching on-demand a.k.a Caching Module

This cache sits in the DVSOL layer and has a table to track the blocks per file. Every time a read is made on a particular file, the blocks which were read are copied and stored in the cache and the block numbers are updated in the block per file table. So next time whenever there is a request for the same blocks, the request is fulfilled by reading the blocks from cache and avoiding the need to make a read call to the storage subsystem.

DVSOL Layer 1

Let us take some example situation and look at the saving that it can bring to your environment.

Suppose we get four read requests for file X. We get one read request first to read block number 10 (assuming one track contains 100 blocks) then after Y ms we get 3 parallel requests to read blocks 20, 15 and 1 respectively.

Time taken to complete the requests as of today

10 (time taken to move the read/write head from 10th to 20th block) + 85 (move the read/write head from 20th to 15th block) + 95 (move the read/write head from 15th to 1st block) = 190

Time taken to complete the requests using our proposed solution

Since we rearrange the queue, our existing queue of 20,15 and 1 will be look like the following:

A    |    B
15  |    1
20  |

Now when we perform the reads it will be:

5 (move the read/write head from 10th to 15th block) + 5 (move the read/write head from 15th to 20th block) + 80 (move the read/write head from 20th to 1st block) = 90

From the above example you can see that there is a direct benefit of approximately 53% plus if there are any more read requests for the same blocks they can be directly serviced by the Caching Module which will add to the time saving.

Implementation point-of-view

DVSOL can be implemented as a module at the ESXi kernel level which will talk to other such module sitting across other ESXi hosts combined together to achieve the distributed architecture. Hence this layer will be able to decide if the multi-threaded read requests received across hosts requires optimization or not.

Final state of this solution will be as follows:

DVSOL Final State

So what is so unique in this solution and why someone should pursue this path? Answer lies below :-)

  • Converting pseudo-random IO (multi-threaded sequential IO, perceived as random IO by the storage subsystem)
  • Introducing Distributed Virtual SCSI Optimization Layer

Note: Data caching in the DVSOL may not be a unique idea as there is already caching implemented in the storage subsystem. However, we have added it to further compound the benefits of our solution.

I have to give full credit to Rishi Sharda (Consultant, VMware PSO) and Vineet Sinha (Staff Engineer, vCHS R&D) for collaborating with me on this project and nurture it from day one. You guys rock.

 

5 thoughts on “Academic Research – Multi-Threaded Sequential IO can be termed as “Perceived Random IO”

  1. From a conceptual PoV this idea looks very similar to Infinio Accelerator.
    Solution implementation may be different though.

    Anyway, the big idea is that cache is king and should be the closest to compute :)

    • That is good to know. I did not look at their solution, but yeah I agree that caching + some other logic can do miracle. But it has to be carefully placed and implemented.

  2. Nice article for me because I’m very interested in this topic. But I hope the fact that multiple sequential workloads to shared storage subsystem is actually more random workload is nothing new, well known and pretty logical.

    Rearrange random IO workload pattern into more sequential workload is generally good idea and any intelligent shared storage is trying to do so. I absolutely agree that the architecture and implementation really matters. The advantage of your proposal is that acceleration layer is close to initiators and it is plugged in to server OS (hypervisor) which bring better flexibility and possibility to use less intelligent storage subsystems. But I’m sure you know these solutions also already exist – for example http://www.pernixdata.com.

    BTW I felt VMware VSAN is accelerating local storage like that or in similar fashion. Your DVSOL looks to me like VSAN’s vmkernel module + local Flash Disk, leveraging latency of Flash and keeping there data buffers from where it is intelligently rearranged and destaged into local rotation disks. To be honest I had no time to study VSAN deeply so far but I would expect such behavior.

    Maybe I missed something and your proposed solution is somehow unique. Maybe you want to better organize existing LUN queues without another external data layer but your drawing contain “data cache layer”. So what is this layer? RAM, HBA queues, or Flash?

    Thanks for investing your time into interesting blog posts.

    David.

  3. BTW I would like to have another storage related vmkernel module. The module doing storage (VMDK) encryption using external multi-tenant key manager. Can you imagine how such solution would be cool for public cloud providers? I call for it almost 4 years. When I worked for CISCO Advanced Services (PSO) colleagues were telling me it is not important because CISCO has SAN encryption on MDS Fibre Channel switches. I countered that it is better to have it in hypervisor. One year ago CISCO stopped SAN encryption so there is IMHO big opportunity and need for it especially in public IaaS clouds.

  4. Thanks for the excellent innovative ideas. I would like to know how do you calculate the time taken for the movement of the head from one particular block to the other particular block on the storage.

    As you said, the movement of the head from
    20th Block to 1st block – 80
    15th Block to 1st block – 95.

    Could you please let me know how did you calculate the above time frame.
    Thanks and Regards
    Ram

Leave a Reply