Joins

What
A join operation occurs between two tables with some join condition that relates the two tables.

Implementation

 * Simple Nested Loop Join

Simple Nested Loop Join
This approach requires that for every tuple in the outer relation, we loop through every tuple in the inner relation to check the join condition for each rtuple in R:    for each stuple in S:         if join_condition(rtuple, stuple): add  to result

Cost
As mentioned before, we scan the inner relation once for every tuple in the outer relation. Therefore, assuming that there are $$M$$ and $$N$$ pages in the outer and inner relations respectively, and that the outer relation has $$R$$ tuples per page, the total cost is:

$$M + (R * M) * N$$

Note: it is important to note that this method of implementation relies on which relation is the inner or outer relation. We save on computation time if the smaller relation is on the outside.

Page-Oriented Nested Loop Join
This simple refinement reduces the amount of I/Os by not scanning the inner relation for every tuple in the outer relation, but rather after every page. for each rpage in R:    for each spage in S:         for each rtuple in rpage: for each stuple in spage: if join_condition(rtuple, stuple): add  to result

Cost
The cost is lower than before. Again assuming that there are $$M$$ and $$N$$ pages in the outer and inner relations respectively, we see that the cost of the join is:

$$M + (M)*N$$

Note: This method is dramatically better than the naive nested loop join described above and highlights the importance of page oriented operations for minimizing disk I/O.

Block Nested Loops Join
The major problem with the last two approaches is that they do not effectively utilize buffer pages. In the best case scenario, the smaller relation will be able to fit in memory and there will be two buffers left over. In that case, we only have to read both relations once.

In the general case, when the smaller relation can't fit into memory completely, we split the relation into blocks that do fit into memory. If we have $$B$$ buffers, then we divide the relation in blocks of size $$B-2$$. for each rblock of B-2 pages of R:    for each spage of S:         for all matching tuples in spage and rblock: add  to result

Cost
Assume for this implementation that there are $$M$$ and $$N$$ pages in the outer and inner relations respectively. Then, the cost of the join would be:

$$M + \lceil \frac {M}{B-2} \rceil * N$$

Index Nested Loops Join
The big advantage that is gained when one of the relations is indexed is that we do not have to scan through the entire relation anymore. for each rtuple in R:    for each stuple in S where join_condition(rtuple, stuple): add  to result

Cost
Given that the outer relation has $$M$$ pages and $$R$$ tuples per page, we see that cost of performing the join is:

$$M + (M * R * \textrm{cost~of~retrieving~matching~tuples~of~inner~relation})$$

Cost of Retrieving Matching Tuples
Note: when the number of matching inner tuple for each outer tuple is small (on average), the cost of the inde nested loops join is likely to be much less than the cost of a simple nested loops join.
 * Typical I/O cost of B+ tree index is 2-4 I/Os. For hash index, the cost of finding the appropriate bucket is 1-2 I/Os.
 * If the index is clustered, then the cost per matching page of inner tuples is typically one more I/O. If it is not clustered, the cost could be one I/O per matching inner tuple (since each of these could be on a different page in the worst case)

Sort-Merge Join
The basic idea behind this algorithm is to sort both relations along the join attribute and to look for tuples that satisfy the join condition by merging the matching tuples.

The major advantages of this approach are:
 * When the tuples are already sorted along the join attribute
 * When the result is required to be sorted

Cost
Assuming that the first and second relation have $$M$$ and $$N$$ pages respectively, the total cost of running this algorithm will be:

$$\textrm{Sort}M + \textrm{Sort}N + M + N$$

Note: We could use external sorting to perform the initial sorting necessary

Refinement
There is a refinement of the sort-merge join algorithm if $$B > \sqrt L$$ where L is the number of pages in the larger relation. If this is the case, then we can apply a refinement by combining the sorting and merging stages.

The refinement advises that we generate sorted runs for the two relations (run pass 0 of external sort). Then, we perform the merge and join simultaneously because we have enough buffers in memory to do so. The total cost will be:

$$3(M + N)$$ (this cost is not including the final write)

Hash Join
Hash join uses two stages to accomplish the join. The initial partitioning phase, followed by the probing phase.



First Stage
In the first stage, we hash the two relations into partitions on disk using a hash function that partitions based on the join condition. The relation $$R$$ gets partitioned into $$R_i$$ and the relation $$S$$ gets partitioned into $$S_i$$. The diagram on the right demonstrates the first stage of the hash join. We will see that in the second stage, we will match $$R_i$$ to $$S_i$$ partitions

Second Stage
In the second stage, we use an in-memory hash table to perform the joining
 * 1) Read in partition $$R_i$$ and hash it into the in memory hash table
 * 2) Scan partition $$S_i$$ and probe the in memory hash table for matches. Matches go to the output buffer

Cost
The cost of hash join can be evaluated in two stages Therefore, the total cost of hash join is $$3(M+N)$$ assuming no partition overflows
 * Cost of first stage: $$2(M + N)$$ because we have to read both relations and write both partitions to disk
 * Cost of second stage: $$M+N$$ because we scan each of the generated partitions