Procedure: Signer failure

Written by Rick van Rein in category: Procedures, Resilience

When?
If either the master or slave instance of OpenDNSSEC fails, follow the procedures below. It is assumed that an empty instance is available as a cloneable virtual machine.

What?

  • The failed signer is fenced and/or destroyed
  • At most one master is active (at any time)
  • One master is active after completing the procedure below
  • Eventually, one slave signer is back in sync with the master signer

Why?
Since OpenDNSSEC is currently unsuitable for multi-master mode, there may never be more than one actively signing master instance of OpenDNSSEC. We do setup slave instances which copy-in the work done on the master, and having one of those on active standby is the end-situation that this procedure re-establishes.
What to do depends on whether the master or slave fails. If the master fails, the best step to take is to turn the slave into a master; if the slave fails, it must be replaced.
Fencing a failed signer is useful because it avoids erratic behaviour of the signing service as a whole. Before doing this, it is generally useful to establish that the problems are not of a passing nature, such as connectivity problems.
Having a standby node is probably the best balance between maintenance overhead and emergency recovery work. Setting up a new instance of OpenDNSSEC from scratch or from another, active node may not be constructive.

How?

Handling slave failure:

  1. Establish that the slave is really failed, and that the problem is not of a passing nature.
  2. Fence the slave.
  3. Destroy the slave.
  4. Undo fencing of the slave.
  5. Clone the running master instance of OpenDNSSEC.
  6. Update its hostname and IP setup and bring it up, but do not start OpenDNSSEC.

Handling master failure:

  1. Establish that the master is really failed, and that the problem is not of a passing nature.
  2. Fence the master.
  3. Destroy the master.
  4. Clone the current slave to the place of the just-destroyed master.
  5. Setup IP and hostname on the new master, and bring it up. Configure it with the current list of zone/policy pairs. Then start OpenDNSSEC.
  6. Undo fencing of the master.

2 Comments

Procedure: Normal KSK rollover and parent sync

Written by Rick van Rein in category: Procedures, Resilience, Security

When?
Every once in a while, for example once a year, the KSK for each zone needs to be rolled over. This involves communication with the parent zone, which publishes the hash of the KSK as a DS record or, if multiple hash algorithms are supported, as multiple DS records describing one KSK. Depending on the parent, it may support or even enforce the use of one or multiple simultaneous KSK records during a rollover.

What?

  • After the procedure there is a single KSK in the zone being rolled.
  • After the procedure the parent’s DS records represent only a single KSK.
  • Throughout the procedure, the parent’s DS list is supported with at least as many KSK.
  • After the parent removes a DS, it takes at least the DS’ TTL before the matching KSK may be removed from public DNS.
  • At any time during the procedure, the KSK signs at least one ZSK that signs the entire zone.

Why?
The parent generally wants to have uncluttered zones, so limiting the number of KSK represented in its DS records is good. Having a minimum number of KSK in a zone also helps to limit clutter in one’s own records.
The need to keep a KSK for a DS TTL after a DS vanishes from the parent is due to the possibility that a cache may hold the DS and thus expect to find the matching KSK.

How?

For parents which allow one KSK per domain:

  1. Introduce a new KSK to the zone.
  2. Have the zone signed with the new KSK as well as the old one.
  3. Wait until all caches have access to the newly signed records; that is, wait the longest RRSIG TTL after all authoritatives have picked up on the additional signatures.
  4. Publish the DS for the new KSK in the parent, replacing the old DS because the parent demands that.
  5. Wait until the parent publishes the new DS record.
  6. Wait the longest time of the old and new DS’ TTL time.
  7. Depracate the old KSK from the signer.

For parents which welcome multiple DS per domain during rollover:

  1. Add a new KSK to the zone.
  2. Sign the zone with the new KSK as well as the old one.
  3. Publish the new DS in the parent zone, alongide the old one.
  4. Wait until the new DS is published.
  5. Wait the old DS’ TTL time.
  6. Depracate the old KSK from the signer.
  7. Remove the old DS from the parent zone.

No Comments

Procedure: Key backup and recovery

Written by Rick van Rein in category: Crypto, Procedures

When?
Making backups is a regular task. Recovering from a backup is only done when both Hardware Securite Modules (or HSMs) have lost their data; if only one got damaged, follow the procedure for HSM replacement.

What?

  • Keys may never be used for signing before they are backed up.
  • Sufficent keys must always be available for signing until the next scheduled backup.

Why?
Key backup is important because it enables the recovery of keys in case of same-time trouble in both the HSM locations. By their nature, a HSM will not release private key information and as a consequence the loss of an HSM means the loss of all private key material contained in it.
Keys may not be used for signing before they are backed up because that would make it impossible to fully rely on the backup for recovery of all zones.
Sufficient key material must be available to rollover keys until the next backup because it is undesirable if rollovers cannot complete simply because no backup has been made.

How?

Key backup:

  1. Prepare OpenDNSSEC for key backup
  2. Remove the/a backup token from its secure storage location
  3. Plug the backup token into the HSM
  4. Instruct the HSM to make a backup of the HSM contents
  5. Remove the backup token from the HSM
  6. Return the backup token in its secure storage location
  7. Instruct OpenDNSSEC if the keys were properly backed up

Key recovery:

  1. Be certain that both HSMs have lost the keys needed by OpenDNSSEC, because it would otherwise be better to recover at the HSM-level
  2. Ignore OpenDNSSEC, which will complain heavily
  3. Order new HSMs and install them as their manual dictates
  4. If delivery takes longer than 3 days, consider rolling zones to insecure DNS by removing their DS in the parent; signatures will expire if they are not refreshed
  5. Remove the backup token from its secure storage location
  6. Follow HSM manufacturer’s procedures for recovery from backup token
  7. After removal from the HSM, return the backup token to its secure storage location

No Comments

Thoughts on procedures and checklists

Written by Rick van Rein in category: Procedures, Technical

WikiMedia Commons

These are a few general thoughts about procedures and checklists, before diving into the detail level required by some of them. In general, we see procedures as predefined steps that can satisfy a checklist without further thought for an operator with normal skills.

Procedures can be helpful because they take the creativity (and anxiety) out of a series of steps to be taken. They are useful in some places, but may be frustrating and limiting in others. In general, we like procedures if we are in a rush, such as when emergency recovery is needed. We also like procedures for things that are hard to verify but that must be consistently reliable, such as backups. Finally, we like procedures if we want to know how things are done, rather than just what is done; this may be helpful to formalise the co-operation with external parties.

A general list of questions to be answered for each issue would be

  1. when is it needed (in which situations, and in which should it not be done)
  2. what is the task at hand (checklist to be fulfilled)
  3. why is it important, potential problems to be avoided
  4. how it is realised in practical-grainsized steps

The last point, how, is the essential procedure. In cases with more room for creativity or even a need for some intelligent action, it may suffice to state a checklist under what but instead of detailing the why rely on the cleverness of an operator to satisfy each check on the list.

No Comments

Actor responsibilities towards DNSSEC

Written by Rick van Rein in category: Procedures, Security, Technical, Users

Layers of system induce layers of responsibilities

We are working towards a DNS signing system with various roles at a number of levels. At each of these levels we assign responsibilities, many of which will not be new to the people involved. We are not primarily worried about people with bad intentions (wihtin our organisation), so we do not split roles as zealously as we would if we’d have a cryptographic ball. What we came up with probably makes sense for any community with internal trust, including registrars and companies who do DNSSEC in-house. It is because of the internal trust that we do not object against humans fulfilling more than one role. The roles that we distinguish are:

Conceptual DNS user. This is the end-user who edits DNS zones over a web interface, without any knowledge about operational issues surrounding DNS or DNSSEC. Their responsibilities are straightforward by design:

  • Translating user requirements to conceptual DNS requirements
  • Editing zones accordingly through our web interface “SURFdomeinen”
  • Make a deliberate choice whether to use DNSSEC or not

DNS operator. These are the technicians working behind the SURFdomeinen web interface who maintain the actual publication of the web-entered zone information. In case DNSSEC is requested, the zone data must be sidetracked through a signer before being published. Responsibilities are more technical in nature, but do not involve cryptographic knowledge:

  • Server management: Authoritative DNS servers, notification processing, redundancy.
  • Sidetracking through OpenDNSSEC: If DNSSEC is requested, do not publish the unsigned zone but get it from the OpenDNSSEC signer instead.
  • Monitoring: Presence of valid zone data, processing updates in reasonable time.
  • Processing/propagating zone updates.
  • Zone backup, recovery, redundancy.

Security Officer. These are the people responsible for mindful use of the cryptographic facilities of OpenDNSSEC. They preferably have an active working knowledge of cryptography in general. Responsibilities center around:

  • Choose algorithms, key sizes, timing. Stay up to date with developments.
  • Ensure ZSK rollovers take place in a timely fashion.
  • Ensure KSK rollovers take place in a timely fashion, including parent exchanges.
  • Possibly monitor key material for freshness, signature correctness, key sizes.
  • Overview key backup and recovery, as well as redundancy.
  • Manage the Hardware Security Modules: Partitioning, access control, emergency handling.
  • Manage the OpenDNSSEC signers: Access control, emergency handling.
  • If selected: secure PIN entry to bootstrap Hardware Security Modules after downtime.

Backup Officer. These are responsible for ensuring that fairly recent information is available in a backup location, and can be recovered in disastrous cases. Responsbilities are:

  • Key backups: Arranging the responsible party, regularly making the backups, informing OpenDNSSEC about the success in doing so.
  • Database backups: Dumping the database used by OpenDNSSEC to couple keys to zones and to know that lifecycle state; moving that dump offsite; being able to restore it if need be.

Some people actually play multiple of these roles. As stated, we assume that parties are mutually trusted, so there is no need for separation of roles with the aim to avoid too much control. If you run a digital mint and plan to publish unspent coin identifiers in signed DNS, you should not copy this setup without thinking it over twice.

No Comments

Monitoring signature expiration online

Written by Roland van Rijswijk in category: Uncategorized

One of the things we discovered while we were rolling out our deployment is that it is very important to monitor the availability of signed zones (see also this post by Migiel de Vos on monitoring). We have deployed default monitoring based on Nagios, with checks that verify if all signer components are running. One of the things we cannot check that way is whether signatures are valid for long enough. And that is a very important indicator of the status of the signer. Even if the signer daemon is running, that does not guarantee that it is actually resigning the zone correctly.

We therefore decided that we should also monitor the validity of signatures online. To achieve this, we created a small tool that plugs in to Nagios and that can check the validity time of the signatures for either a single resource record or for a whole zone using an AXFR-style transfer.

You can download this tool using the link below; the source distribution includes a README with instructions on building and using the tool. The tool is released under a BSD-style license (included).

Download the tool here: sigvalcheck-0.1.tar.gz

UPDATE: The trunk of OpenDNSSEC also includes a very useful monitoring tool that integrates in Nagios; it is written in Ruby and available through the OpenDNSSEC subversion repository.

3 Comments

Master/slave replication with OpenDNSSEC

Written by Rick van Rein in category: Architecture, Resilience

In a previous article we discussed the idea of a high-availability Hardware Security Module (or HSM) service. To make the entire DNSSEC signing service act in high-availability mode there is one more part to replicate, namely the OpenDNSSEC signer machines. These manage the procedures and are aware of the DNS and timing intricacies of DNSSEC. If all signing is done on one machine, its failure would introduce a risk of missing the signing deadline of some domains, which would lead to those domains being rejected by secure resolvers. Domain downtime is generally bad, and may even block DNSSEC adoption, so we will setup redundant signer machines.

OpenDNSSEC is not really designed (at least in version 1.1) to function in redundant mode, but there are ways of getting it to function redundantly. If one signer is an actively signing master and the other is a slave that clones the master’s output, then it ought to be possible to switch them from master to slave, and from slave to master. This is possible because OpenDNSSEC from version 1.1 on will not sign a full zone but instead it tries to reuse what signatures it already has. If we feed the master’s signatures back into the slave, we should be able to have it pickup where a (failing) master left off. As a result, the signing service would commence without noticeable interruption.

This makes our network diagram a little more complex:

One of the identical pair of HSMs will be backed up regularly

Image components by OpenClipArt.org

Furthermore, the serving OpenDNSSEC master will push its signed zones to the authoritative name servers that publish the zone. In fact, the only part of the infrastructure that is not necessarily redundant is the webinterface that runs the end-user application “SURFdomeinen”; this interface could go down without domains dropping off the Internet. It would only impair end users from changing their domains, which is not a concern to address in this exploration of a DNSSEC service.

Needless to say that the HSMs and the signer machines are spread over two locations, and managed by two independent parties. The cross-over in the diagram actually covers quite a few kilometers, but since it won’t carry much data there is hardly a burden from that slight inefficiency. And, because we run the OpenDNSSEC instances as virtual machines, we will actually keep a third one ready to clone so it can quickly replace a failing instance. As long as data is replicated over both active instances, it does no harm to simply destroy a failed one, and start a fresh instance from scratch.

Note that an HSM generally does not mind being addressed by multiple clients at the same time; this is a normal mode of operation for PKCS #11 implementations.

What remains to be done is to define master/slave handover procedures. Since these will usually be emergency-time actions, it is best if these are spelled out in detail in a separate place. Our general approach will be to ensure that the original signer is really down, then we assure that database and HSM are in a proper state, after which we start the signing service on the replacement signer.

1 Comment

Validation rate growing week by week

Written by Roland van Rijswijk in category: General

All SURFnet’s DNS resolvers perform DNSSEC validation and we use the Cacti plug-in for Unbound to graph our nameserver statistics. This yields some interesting data since we can observe the DNSSEC validation rate. And that rate has been showing signs of significant growth since the root got signed.

Let me first show you a snapshot graph for one of our resolvers:

Snapshot for ns0.amsterdam1.surf.net

In the graph you can see some stats for the nameserver. The interesting metric here is the pink line (incorrectly labeled “Answer serure”). What is significant is that this line is actually visible. Before the root was signed, this line was glued to the x-axis, and since a few weeks it has become visible for the first time. The numbers, by the way, should be interpreted as follows: the total number of queries that are potentially validatable is NOERROR + NXDOMAIN (in this case that is 533.54 qps + 94.60 qps = 628,14 qps). The number of queries that could actually be validated is 20,67 qps. This gives a validation rate (at the point in time this snapshot was made) of about 3,3%.

Now this may not sound very significant or spectacular until we look at a graph for the validation rate over time:

Validation rate graphed over 12 weeks

The graph shows the average validation rate over the past twelve weeks (week numbers are shown on the x-axis). The root was signed in week 28, and it is clearly visible that since then the validation rate has been climbing steadily. I’m going to be watching this trend to see if this growth continues, but for now it’s showing that DNSSEC deployment clearly benefited from the root signing 😉

No Comments

Nearly there :-)

Written by Roland van Rijswijk in category: Uncategorized

Wikimedia Commons

We’ve put the champagne on ice and the cake has been ordered… Since yesterday afternoon 13:00h CET surfnet.nl is signed! All we have to wait for now is the ability to get a DS record in the .nl zone, which will hopefully happen later this month. This means that our DNSSEC system is now in full production. So all the months of hard work are about to pay off.

No Comments

User study results

Written by Roland van Rijswijk in category: General, Users

Wikimedia Commons

One of the goals of our project was to perform a user study among our constituency (higher education, academia and research) to find out what the interest in DNSSEC is in our community. We finished this study in August and have just published the results which are quite interesting, quite a number of respondents to the study show a keen interest in DNSSEC.

You can download the report here.

No Comments