(insert some picture here). What Keccak actually does is this: Keccak-f[1600] works on an array of 1600 bits (therefore its name for this setting). The array is called the state. r bits are added (XOR'ed) to this state and the Keccak permutation applied to it. If more bits are available, the procedure is repeated. In case less than r bits are available, some padding has to be done before the addition step.
After all data has been fed to Keccak, it moves to the sponge phase. The user reads the first r bits from the state. In case more bits are required, the user calls the Keccak permutation again on the state and reads again the first r bits. This way the length of the digest can be arbitrary long.
Our implementation uses only Keccak-f1600, which is also the available reference implementation. But depending on the hardware resources one can choose smaller values like 800, which would divide the memory needs of Keccak by half. The smallest value considered to be secure is 25. But it should be noted that for other values than 1600 it is up to the implementor to create the correct internal tables (round constant table, rho offset table).
At the heart of a Keccak permutation are five functions, named as theta, rho, pi, chi and iota. One of the difficulties of the GPU in general is to avoid branching as far as possible. Our implementation merges these five functions above into five lines of code. Using proper special tables, it runs branch-free. There is one problem left, though. The implementation generates so-called bank conflicts. One reason is that Keccak-f1600 is 64-bit, another one is that the state access pattern of Keccak. We tried a 32-bit implementation of Keccak-f1600 with an interleaving technique, but the profiling results have shown no advantage compared the 64-bit reference implementation. Using the new Fermi architecture would be very helpful here for reducing the bank conflicts. Another possibility might be to exploit texture memory.
We first give the code of the five functions mentioned above (taken from KeccakPermutationReference.c contained in the reference implementation). Then we show how to merge them into the code used by our implementation.
The Keccak permutation of Keccak-f[1600] consists of 24 rounds, each round calling those five function in sequence:The code above has to be compiled into a more GPU-friendly form to be executable by multiple threads in parallel. Here we give the main parts of our implementation of the Keccak permutation.
CUDA threads in the same block communicate via so-called shared memory. We declare four arrays in shared memory: A denotes the state of Keccak, while B, C and D are used as temporary buffer.
We redefine ROL64 as: It works on the GPU also for b = 64 or c = 64, so we can get rid of the ternary operator in ROL64.As the Keccak state A consists of 5×5 lanes of 64 bits size, the Keccak permutation can be executed by 25 threads in parallel. A so-called warp in CUDA language denotes a group of 32 threads. A warp is the entity which is actually schedulded by the thread manager on the GPU. Hence the Keccak permutation is executed by a warp. The remaining seven threads cannot be used on the GTX 295, for this would mean crossing a warp boundary, creating serious thread synchronization problems.
As already mentioned, we have to aim for an implementation which is as branch-free as possible. In order to achieve that, we extended the standard tables and introduced some new ones. They are saved in constant memory of the GPU, for constant memory on the GPU is cached. Keep in mind for the following tables that constant values cannot be initialized as shown below. The values have to be copied from the host CPU to the GPU at kernel call time. It written here in this way for brevity.
We extend the round constants table by zeros. iota can the be executed by a thread without checking the index of A it is dealing with. Therefore, we avoid introducing a branch by using the following table.
The next table id the rho-offsets. Note that for each entry pair the respective sum is 64. Only the first entry of each pair is a rho-offset. The second part is used in the R64 macro. This way it can be written without the ternary operator.