MD5 Revisited
TweetYears ago I implemented MD5 in C language as a college assignment. It was fun to implement something complex yet very straight forward. You just go over the RFC and implement each step accurately. Only difficult part was debugging because even if you make a small mistake any where in the code the result is completely different and does not give any clue as to where the error is.
Years later I gave another look at the algorithm to get better understanding of the it. MD5 hash algorithm has three basic construction blocks.
1. Encryption Function
Converts a fixed sized input to fixed sized output such that output is a random permutation of the input. Encryption function also takes a key which can alter the permutations generated by the function. Encryption function by definition is reversible i.e. there exists a Decryption function such that given the output in can produce original input. In case of MD5, the encryption function takes a 32bit key and 128 bit input and 128 bit output. MD5 uses following encyptions function
as you can see encryption function is parameterized by f,T and S. MD5 uses 64 different variations of encrypt function.
2. Compression Function
Compression Function takes a fixed sized input and converts it to a smaller fixed sized output. This function is one way i.e. there exists no function which can generate the input for a given output.Compression Functions are created using Encryption Function by applying it to a larger block again and again.
In case of MD5 the compression function takes 512 bits as input and generates 128 bits of output.
The 512 bit input is divided into 16 32 bit words. Each of these 32 bit words is used as key to the encryption function. This is repeated four times in four phases. In each phase a different permutation of 32 bit words is used. Thus, phases differ in two ways, first, the permutation of input and second, the encryption function used. Following function creates a four permutations, one for each phases and concatenates them. This results in 64 32 bit words.
The permute function takes a function and generates a permutation of data List according to it. permutations function returns four permutations one for each phase of compression function.
Once we have input data ready, we need to generate 64 different encryption functions, 16 for each phase. For this we need values for f,T and S. MD5 Compression function uses four different values of f, one for each phase.
Above defines four different values of f. We also create a list these functions. listyFy function converts a function which takes three ints to a function which takes a list ints.
Following a list of functions which generate values for S for each phase.
Now below is the code which generates 64 encryption functions, which are used in each step of compression function .
Here we create EList, a list of 64 encryption function, by passing appropriate value of f,T and S to the encrypt function, which returns an encryption function corresponding to these values. Now we can use this list of encryption function and permutation to define our compression Function.
compressMD5 takes a state, a list of four 32 bit words, and chunk, a list of 64 bytes. It converts 64 bytes in chunk to an array of 16, 32 bit words, using asWorkds utility function. It then creates permutations which results in 64 32 bit words, four permutation of 16 32 bit words. Then each of these words is applied as key to the corresponding encryption function in EList. The state variable is uses as initialization vector for this step which encrypted and then fed as data in to next encryption function. Finally, IV and resulting state and added word by word.
3. Hash Function
This is the final function which given input of any arbitrary length generates an output of fixed length. It usually uses merkel-damgard construction to apply a compression function to arbitrary large data.
This is the most straight forward step. It appends padding as required by merkelDamgard construction and divides resulting bytes in to chunks of 64 bytes which is equal to size of input of our compression function.