Procedure: Replacing a failed HSM

Written by Rick van Rein in category: Procedures, Resilience, Security

When?
This procedure is needed when one (not both) Hardware Security Module (or HSM) has failed. Before doing this, it should be established that there is no repair possible.

What?
The replacement of a single HSM in high-availability mode is foreseen by the HSM vendor, so their procedures can be followed.

Why?
The purpose of replacing an HSM is to regain a situation where the secure key management hardware is redundant. Running on a single HSM should be considered a fragile mode of operation, because the key backups are the only step between fully functioning DNSSEC and total anarchy (a.k.a. unsigned DNS).
To accommodate high availability mode, it is vital that a failed HSM is replaced with the utmost speed. Monitoring facilities will be required to detect the need of this procedure.

How?

  1. Establish that a HSM has failed, and that no recovery is possible.
  2. Follow the procedures from the HSM manual to fence the broken HSM.
  3. Order a new HSM; a support level agreement may help to speed up this procedure. Have it delivered directly to the place where it needs to replace the broken one.
  4. If the backup token was used with the broken HSM, consider shipping any backup-related hardware to the other location. If the ordering time is minor, the only service disruption from not being able to backup is that no new keypairs can be put to use; in other words, signing will continue but key rolling and possibly creating new zones will have to wait.
  5. Upon arrival of the new HSM, set it up immediately and integrate it with the other HSM, following the HSM manual.
  6. Verify that monitoring tools pick up on the new HSM and feel free to take a deep breath.

Respond

*