Aggregate Operations

What
Aggregate operations are ones that provide information about a combined set of data. For example, MIN, AVG, SUM, etc.

Not Using Group By
When we are not using the group by clause, the only way to perform the aggregate operator is to scan the entire relation

When we have an index, we can make the process faster by performing an index only scan

Using Group By
When we are using group by, we can:
 * Sort on the group by attributes and then scan to compute aggregates for each group
 * Hash on group by attributes, and use the hash buckets to compute aggregates
 * Given a tree based index and that all of the SELECT, WHERE, and GROUP BY attributes are in the prefix of the index, we can perform a index only scan instead of sorting first.

Sort Group By: Naive Approach
The aggregate node keeps some state variables/running info: When the aggregate node sees a tuple from a new group:
 * For COUNT, seen-so-far
 * For SUM, sum-so-far
 * For AVG, seen-so-far and sum-so-far
 * 1) Produces an output for the old group based on the aggregate function
 * 2) Reset the running info
 * 3) Update the running info with the new tuple's information

Hash Group By: Naive Approach
This approach is very similar to the sort by approach and the only difference is that we swap out the sort node for a hash node. The aggregate node works in the exact same way

HashAgg
This process is an improvement on the normal hashing process because, in the rehash stage, instead of storing tuples, we store pairs of .
 * If the tuple's groupVal already exists in the hash table, we simply update the transVals
 * Otherwise, we add the new  pair to the hash table

Advantages

 * In the rehash stage, we only have to store as many pairs as the number of groupVals.

Hybrid Hashing
Hybrid hashing is a process that is even better than the HashAgg process described above because it seeks to remove the wastefulness of the first hash phase. We end up writing the tuples out to disk and then reading them back in for the rehash phase.

The solution is to maintain some buffers in memory hash table in the first phase to store the . In the best case, all of the tuples can be hashed into this table. In the worst case, we end up with the HashAgg process. No matter what, this method of hashing is the same as the HashAgg process at least, but most of the time, it is better!