External Sorting

The Purpose of External Sorting
External sorting algorithms are very important when there is more data to be sorted than can be fit into RAM.

Steps in External Sorting
The general idea is to do a preprocessing stage (Pass 0). And to run a generalized in-memory sorting algorithm (like quick sort ) on the data that was loaded into memory.

Then, we run the merge sort on the runs produced by pass 0. The merge is done by taking the largest element in the B - 1 buffers.

Pass 0
In Pass 0, assuming we have N pages of data to sort and B buffer pages of RAM available, with N > B. The condition N > B arises from the fact that if N < B, we don't need to apply an external sorting algorithm to sort the data.

In Pass 0, we are creating sorted runs. Runs are consecutively stored on disk.

The Number of Runs Generated in Pass 0
$$N = \lceil N/B \rceil$$

The above equation simply says that the number of runs in the first pass must be the ceiling of N / B because we cannot have a partial run. Therefore, 2.1 runs counts as 3 runs.

The Length of Each Generated Run
The length of each run is B

Pass 1,2,3,...
In consecutive passes, we will merge the sorted runs in into longer sorted runs. In these stages however, we must use 1 buffer page to act as the output buffer and therefore are only left with B-1 pages to use as input buffers.

High Level Summary

 * 1) The biggest element from Input 1 to B-1 is removed from the input buffer and appended to the output buffer
 * 2) When the output buffer fills, the page is written out to disk and the output buffer is emptied
 * 3) The input buffers are connected to sorted runs. Therefore, when the input buffer is emptied, the next page from the corresponding sorted run fills its place
 * 4) When all of the input buffers have no more pages to pull from the sorted runs they are connected to, a new set of B-1 runs are selected and merged until at the end, there is 1 run of N length

The Number of Passes Needed to Sort N Pages
The number of passes needed to sort N pages is $$P = \lceil log _{B-1} \lceil N/B \rceil \rceil + 1$$

The Number of Disk I/Os Needed to Sort N Pages
The number of disk I/Os needed to sort N pages in this manner is $$I = 2N \times P$$

The Maximum Number of Pages that can be Sorted in 2 Passes
The maximum number of pages that can be sorted in 2 passes is B(B - 1). This is because in pass 0, we make runs of length B. Then, we have B - 1 buffer pages to merge the resulting runs from pass 0.