MIGRATION OF VIRTUAL MACHINES TO PROXMOX HOST FAILS AFTER NODE MAINTENANCE

ENVIRONMENT

- Multi-node Proxmox cluster (v6.0+)
- Storage volume in which VMs or CTs are stored is not shared between nodes

SCENARIO

A few days ago, I faced a faulty hard drive problem in one of my servers. After dealing with some I/O errors on a physical volume containing multiple virtual machines and containers on a Proxmox clustered node, I found the following error message while trying to migrate back the machines potentially affected by this problem once I replaced the drive and fixed it:
 


ERROR: found stale volume copy '${VOLUME}:${VM_ID}/vm-${VM_ID}-disk-0.raw' on node '${AFFECTED_PROXMOX_NODE}'

 
The root problem was that after migrating the virtual machines to other cluster nodes in order to shut down properly the affected host and replace the broken hard drive, the files related to the machines in the affected node could not automatically be removed due to the 'readonly' flag of the mount point at that moment.

So, in order to avoid that error message and get those machines back to their original node, I just had to remount the volume with the rw flag and manually delete those files only in the affected node and before trying to migrate them back:

CAUTION: Only issue the following command in the node in which the machines are no longer running and have previously been migrated to other nodes to avoid potential data loss.
 


rm -rf /mnt/pve/${VOLUME}/images/${VM_ID}

 
After issuing that command, the migration process finished successfully and the machines were up and running in their original cluster node without any errors.

DISCLAIMER

Note that this would be only necessary in case you don't replace the faulty disk containing the volume in which the machines and containers are hosted. In other case, if you replace the volume with a new one the file and directory structure will be just reset and thus this procedure will be no longer useful at all.