Cs186 Wiki
Advertisement

General Info[]

What[]

A join operation occurs between two tables with some join condition that relates the two tables.

Implementation[]

  • Simple Nested Loop Join

Simple Nested Loop Join[]

This approach requires that for every tuple in the outer relation, we loop through every tuple in the inner relation to check the join condition

for each rtuple in R:
    for each stuple in S:
        if join_condition(rtuple, stuple):
            add <rtuple, stuple> to result

Cost[]

As mentioned before, we scan the inner relation once for every tuple in the outer relation. Therefore, assuming that there are and pages in the outer and inner relations respectively, and that the outer relation has tuples per page, the total cost is:

Note: it is important to note that this method of implementation relies on which relation is the inner or outer relation. We save on computation time if the smaller relation is on the outside.

Page-Oriented Nested Loop Join[]

This simple refinement reduces the amount of I/Os by not scanning the inner relation for every tuple in the outer relation, but rather after every page.

for each rpage in R:
    for each spage in S:
        for each rtuple in rpage:
            for each stuple in spage:
                if join_condition(rtuple, stuple):
                    add <rtuple, stuple> to result

Cost[]

The cost is lower than before. Again assuming that there are and pages in the outer and inner relations respectively, we see that the cost of the join is:

Note: This method is dramatically better than the naive nested loop join described above and highlights the importance of page oriented operations for minimizing disk I/O.

Block Nested Loops Join[]

Block nested loops join

The major problem with the last two approaches is that they do not effectively utilize buffer pages. In the best case scenario, the smaller relation will be able to fit in memory and there will be two buffers left over. In that case, we only have to read both relations once.

In the general case, when the smaller relation can't fit into memory completely, we split the relation into blocks that do fit into memory. If we have buffers, then we divide the relation in blocks of size .

for each rblock of B-2 pages of R:
    for each spage of S:
        for all matching tuples in spage and rblock:
            add <rtuple, stuple> to result

Cost[]

Assume for this implementation that there are and pages in the outer and inner relations respectively. Then, the cost of the join would be:

Index Nested Loops Join[]

The big advantage that is gained when one of the relations is indexed is that we do not have to scan through the entire relation anymore.

for each rtuple in R:
    for each stuple in S where join_condition(rtuple, stuple):
        add <rtuple, stuple> to result

Cost[]

Given that the outer relation has pages and tuples per page, we see that cost of performing the join is:

Cost of Retrieving Matching Tuples[]

  • Typical I/O cost of B+ tree index is 2-4 I/Os. For hash index, the cost of finding the appropriate bucket is 1-2 I/Os.
  • If the index is clustered, then the cost per matching page of inner tuples is typically one more I/O. If it is not clustered, the cost could be one I/O per matching inner tuple (since each of these could be on a different page in the worst case)

Note: when the number of matching inner tuple for each outer tuple is small (on average), the cost of the inde nested loops join is likely to be much less than the cost of a simple nested loops join.

Sort-Merge Join[]

The basic idea behind this algorithm is to sort both relations along the join attribute and to look for tuples that satisfy the join condition by merging the matching tuples.

The major advantages of this approach are:

  • When the tuples are already sorted along the join attribute
  • When the result is required to be sorted

Cost[]

Assuming that the first and second relation have and pages respectively, the total cost of running this algorithm will be:

Note: We could use external sorting to perform the initial sorting necessary

Refinement[]

There is a refinement of the sort-merge join algorithm if where L is the number of pages in the larger relation. If this is the case, then we can apply a refinement by combining the sorting and merging stages.

The refinement advises that we generate sorted runs for the two relations (run pass 0 of external sort). Then, we perform the merge and join simultaneously because we have enough buffers in memory to do so. The total cost will be:

(this cost is not including the final write)

Hash Join[]

Hash join uses two stages to accomplish the join. The initial partitioning phase, followed by the probing phase.

Hash join new

First Stage[]

In the first stage, we hash the two relations into partitions on disk using a hash function that partitions based on the join condition. The relation gets partitioned into and the relation gets partitioned into . The diagram on the right demonstrates the first stage of the hash join. We will see that in the second stage, we will match to partitions




Second Stage[]

Hash join probing

In the second stage, we use an in-memory hash table to perform the joining

  1. Read in partition and hash it into the in memory hash table
  2. Scan partition and probe the in memory hash table for matches. Matches go to the output buffer



Cost[]

The cost of hash join can be evaluated in two stages

  • Cost of first stage: because we have to read both relations and write both partitions to disk
  • Cost of second stage: because we scan each of the generated partitions

Therefore, the total cost of hash join is assuming no partition overflows

Advertisement