External Hashing

The Purpose of External Hashing
External hashing is useful when the data we want to hash does not fit nicely into the RAM memory available.

Using External Hashing Over External Sorting
Sometimes, we don't require from our result
 * 1) Removing duplicates
 * 2) Forming groups

General Steps in External Hashing

 * 1) Divide stage. Use a hashing function $$h_{B-1}$$ to hash the stream of incoming data into B - 1 output buffers that are connected to B - 1 partitions on disk
 * 2) ReHash/Conquer Stage. Use a hashing function to read the B - 1 partitions created in the first stage into a RAM hash table. Then write out the complete hash table to disk.

Divide Stage
As the image to the right portrays, the hash function splits the input into B - 1 partitions on disk potentially, based on the effectiveness of the hashing function.

Conquer Stage
As the image to right portrays, the conquer stage involves reading the partitions generated from the divide stage into main memory using a hash function $$h_r$$

Cost of External Hashing
The cost of external hashing is $$4 * N$$ I/Os. This is because the divide stage reads and writes all the data once, giving 2N. Then, the conquer stage reads all of the data again and writes it all back to disk again, yielding another 2N. Therefore, the total cost of external hashing, given that the hash algorithm divides the data evenly yields a total cost of 4N.

Biggest Table that can be Hashed in 2 Passes
The biggest table that can be hashed in 2 passes is B(B - 1). The first stage creates B - 1 partitions and the partitions can be no bigger than B pages in order to fit into main memory in the conquer stage.

Solution for Data that cannot be Hashed in 2 Passes
Recursive partitioning can be used