VMFS Deep Dive

Slide 1

Agenda



VMFS Deep Dive

ESX Storage Stack and VMFS

VMFS Vs RDM

SCSI reservation conflicts

Multipathing

Snapshot LUNs and resignaturing


The Storage Stack in VI3




VMFS – A Clustered filesystem for today’s dynamic IT world

Ø Built-In VMFS Cluster File System

Ø Simplifies VM provisioning

Ø Enables independent VMotion and HA restart of VMs in common LUN

Ø File-level locking protects virtual disks

Ø Separates VM and storage administration

Ø Use RDMs for access to SAN features

Raw Disk Mapping (RDM)

Mapping files in a VMFS volume

Ø Presented as virtual SCSI device

Ø Key contents of the metadata include location and locking of mapped device

Ø Virtual machine must interact with a real disk on the SAN

Ø Microsoft Cluster Services (MSCS)

Storage – VMFS vs. RDM

RAW VMFS

RAW may give better performance Leverage templates and quick provisioning

RAW means more LUNs – More provisioning time Fewer LUNs means you don’t have to watch Heap

Advanced features still work Scales better with Consolidated Backup

Preferred Method

Skeleton of a VMFS


A VMFS holds files and has its own metadata

Metadata gets updated through

Creating a file

Changing a file’s attributes

Powering on a VM

Powering off a VM

Growing a file



  • When metadata is updated, the VMkernel places a non-persistent SCSI reservation on the entire VMFS volume
  • Lock held on volume for the duration of the operation
  • Other VMkernels are prevented from doing metadata updates

VMFS 3 & SCSI Reservations

  • Concurrent-access filesystem
  • Most I/O happens simultaneously from all hosts
  • Filesystem metadata updates are atomic and performed by the requesting host



Locking a file for read/write (e.g. vmdk when powering on VM)

Creating a new directory or file

Growing a file etc.

For the time needed by the locking operation (NOT metadata update), a LUN is reserved (=locked for access) to a single host

SCSI Reservation Conflict – What it is


What happens if we try to perform I/O to a LUN that’s already reserved?

A retry counter is decreased and the I/O operation is retried

The retry is scheduled with a pseudo-random algorithm

If the counter reaches 0, we have a SCSI reservation conflict

SCSI: 6630: Partition table read from device vmhba1:0:6 failed: SCSI reservation conflict (0xbad0022)

SCSI: vm 1033: 5531: Sync CR at 64

SCSI: vm 1033: 5531: Sync CR at 48

SCSI: vm 1033: 5531: Sync CR at 32

SCSI: vm 1033: 5531: Sync CR at 16

SCSI: vm 1033: 5531: Sync CR at 0

WARNING: SCSI: 5541: Failing I/O due to too many reservation conflicts

WARNING: SCSI: 5637: status SCSI reservation conflict, r

status 0xc0de01 for vmhba1:0:6. residual R 919, CR 0, ER 3

Who’s holding a SCSI Reservation?

One ESX host (persistent reservation)

vmkfstools –L reserve : This should NEVER EVER be done

Interaction with installed third-party management agents

Multiple ESX hosts, alternatively

High latency/slow SAN

o Critical lock-passing between ESX hosts during vmotion

SAN firmware slow in honoring SCSI reserve/release

o Synchronously mirrored LUNs

One non-ESX host

LUN erroneously mapped to e.g. a Windows host

No host

Persistent reservation held by the SAN

Needs investigation by the SAN vendor

ESX Server Multipathing

Multipathing – vmhbaN:T:L:P notation





























Determined at boot, install / rescan:

N = adapter number

T = target number (generally 1 SP = 1 target)

Determined by the SAN

L = LUN ID

SCSI identifier of the LUN (not shown here)

Determined at datastore or extent creation

P = partition number (if 0 or absent = whole disk)

Per-LUN Multipathing Failover Policy

VMware supports using only one path at a time

MRU = Most Recently Used

Fixed = choose a preferred path & failback to it

multiple ESX hosts or multiple LUNs, allows for manual load balancing between SPs

Never setup Fixed policy with an active/passive SAN! Why?

Path Thrashing

Ø Only possible on active/passive SANs

Ø Host 1 needs access to the LUN through SP1

Ø Host 2 needs access to the LUN through SP2

Ø The LUN keeps being trespassed between SPs and it’s never available for I/O

Multipathing

Active/Active

LUNs presented on multiple Storage Processors

Fixed path policy

Failover on NO_CONNECT

Preferred path policy

Failback to preferred path if it recovers

Active/Passive

LUNs presented on a single Storage Processor

MRU (Most Recently Used) path policy

Failover on NOT_READY, ILLEGAL_REQUEST or NO_CONNECT

No preferred path policy, no failback to preferred path

Load Balancing

Fixed (Preferred Path)

1st active path discovered or user configured.

Active/Active arrays only

Most recently used (MRU)

Active/Active arrays

Active/Passive arrays

Snapshot LUNs and Resignaturing

How VMware ESX Identifies Disks

Ø Each LUN has a SCSI identifier string provided by the SAN vendor

Ø The SCSI ID stays the same amongst different paths

Ø The vmkernel identifies disks with a combination of LUN ID, SCSI ID and part of the model string

# ls -l /vmfs/devices/disks/

total 179129968

-rwxrwxrwx 1 root root 72833679360 Nov 13 12:16 vmhba0:0:0:0

lrwxrwxrwx 1 root root 58 Nov 13 12:16 vmhba1:0:0:0 -> vml.020000000060060160432017002a547c3e7893dc11524149442035

lrwxrwxrwx 1 root root 58 Nov 13 12:16 vmhba1:0:1:0 -> vml.02000100006006016043201700a99d1c3bb9c5dc11524149442035

lrwxrwxrwx 1 root root 58 Nov 13 12:16 vmhba1:0:10:0 -> vml.02000a000060060160432017000db2f61d17d3dc11524149442035

(…)

Snapshot LUNs & Resignaturing – Key Facts

Ø ESX identifies objects in a VMFS datastore by path e.g. /vmfs/volumes//

Ø The VMFS UUID (aka signature) is generated at VMFS creation

Ø The VMFS header includes hashed information about the disk where it’s been created

The Check for Snapshot LUNs

VMFS relies on SCSI reservations to acquire on-disk locks, which in turn enforce atomicity of filesystem metadata updates”

SCSI reservations don’t work across mirrored LUNs

To avoid corruption, we need to prevent mounting a datastore and a copy of it at the same time

Ø On rescan, the information about the disk in the VMFS header metadata (m/d) is checked against the actual values

Ø If any of the fields doesn’t match, the VMFS is not mounted and ESX complains it’s a snapshot LUN

LVM: 5739: Device vmhba1:0:1:1 is a snapshot:

LVM: 5745: disk ID:

LVM: 5747: m/d disk ID:

ALERT: LVM: 4903: vmhba1:0:1:1 may be snapshot: disabling access. See resignaturing section in SAN config guide.

LUNs Detected as Snapshots – Causes

Ø LUN ID mismatch

Ø SCSI ID change (e.g. LUN copied to a new SAN)

Ø They are effectively snapshots (e.g. DR site)



LUNs Detected as Snapshots – How to Fix

Are they mirrored/snapshot LUNs?

If yes: will the ESX host(s) ever see both original and copy at the same time?

Yes – resignature

No – either allow snapshots or resignature

If no: do multiple ESX hosts see the same LUN with different IDs?

Yes – fix the SAN config; if not possible allow snapshots

No – IDs permanently changed: either allows snapshots or resignature

Resignaturing Issues

Never ever resignature while the VMs are running

resignaturing implies changing UUID and datastore name

All paths to filesystem objects (vmdks, VMs) will become invalid!

About Prasenjit Sarkar

Prasenjit Sarkar is a Product Manager at Oracle for their Public Cloud with primary focus on Cloud Strategy, Oracle Openstack, PaaS, Cloud Native Applications and API Platform. His primary focus is driving Oracle’s Cloud Computing business with commercial and public sector customers; helping to shape and deliver on a strategy to build broad use of Oracle’s Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings such as Compute, Storage, Java as a Service, and Database as a Service. He is also responsible for developing public/private cloud integration strategies, customer’s Cloud Computing architecture vision, future state architectures, and implementable architecture roadmaps in the context of the public, private, and hybrid cloud computing solutions Oracle can offer.