Cube Operation

Basic CUBE

Considering the following case:

We have user event data with 4 dimensions:

  1. A/B Test bucket (prod/test)
  2. client Type (web/mobile)
  3. module (order/report)
  4. event (click/view)
test    mobile    order_module    click
prod    web    order_module    view
prod    mobile    order_module    click

A report system might want to report metrics for different combination, such as:

  1. what's the total click count from mobile user?
  2. what's the total click count from different test bucket?
  3. what's the total web page view for order_module in prod bucket?

As you can see there are many combinations. It's not quite time efficient if we only store smallest granularity metrics and then roll them up when receiving a query. So one solution is to pre-compute ALL combinations.

Here's how we could do that with PIG's CUBE operation.

example = LOAD './cube.example' AS (product:chararray, client:chararray, module:chararray, action:chararray);

cubed_data = CUBE example BY CUBE(product, client, module, action);

final_data = FOREACH cubed_data GENERATE $0, COUNT_STAR($1);

dump final_data;

It will produce output of all combinations and total counts. See the output of previous dump -- with this stats, we could answer previous questions with direct answer. No further aggregation needed.


apache-pig Pedia