rPref allows an efficient computation of Pareto frontiers (also known as *Skylines* in the context of databases) and slight generalizations (*database preferences*). This vignette will explain how to compose Skyline queries / preferences and evaluate them on a data set (i.e., the preference selection).

A classical Skyline query optimizes two dimensions of a data set simultaneously. Usually these dimensions tend to anticorrelate. Consider the `mtcars`

data set and the dimensions `mpg`

(miles per gallon, i.e., inverse fuel consumption) and `hp`

(horsepower). To get those cars with a low fuel consumption (i.e., high mpg value) and high power we create the preference and evaluate it on `mtcars`

. Using `select`

from the dplyr package we restrict our attention to the relevant columns:

```
p <- high(mpg) * high(hp)
res <- psel(mtcars, p)
knitr::kable(select(res, mpg, hp))
```

mpg | hp | |
---|---|---|

Merc 450SL | 17.3 | 180 |

Fiat 128 | 32.4 | 66 |

Toyota Corolla | 33.9 | 65 |

Lotus Europa | 30.4 | 113 |

Ford Pantera L | 15.8 | 264 |

Ferrari Dino | 19.7 | 175 |

Maserati Bora | 15.0 | 335 |

The `*`

operator is the Pareto composition. The result contains all cars from `mtcars`

which are not Pareto-dominated according to this preference. This means, we are not interested in those cars, which are strictly worse in at least one dimension and worse/equal in the other dimension (i.e., they are *dominated*).

We can add a third optimization goal like minimizing the 1/4 mile time of a car. Additionally to the preference selection via `psel`

, preference objects can be associated with data sets and then processed via `peval`

(preference evaluation). For example

```
p <- high(mpg, df = mtcars) * high(hp) * low(qsec)
p
## [Preference] high(mpg) * high(hp) * low(qsec)
## * associated data source: data.frame "mtcars" [32 x 11]
```

creates a 3-dimensional Pareto preference which is associated with `mtcars`

. We can evaluate this using `peval(p)`

which returns the Pareto optima:

```
res <- peval(p)
knitr::kable(select(res, mpg, hp, qsec))
```

mpg | hp | qsec | |
---|---|---|---|

Mazda RX4 | 21.0 | 110 | 16.46 |

Merc 450SE | 16.4 | 180 | 17.40 |

Merc 450SL | 17.3 | 180 | 17.60 |

Fiat 128 | 32.4 | 66 | 19.47 |

Toyota Corolla | 33.9 | 65 | 19.90 |

Porsche 914-2 | 26.0 | 91 | 16.70 |

Lotus Europa | 30.4 | 113 | 16.90 |

Ford Pantera L | 15.8 | 264 | 14.50 |

Ferrari Dino | 19.7 | 175 | 15.50 |

Maserati Bora | 15.0 | 335 | 14.60 |

Using `psel`

instead of `peval`

we can evaluate the preference on another data set (which does not change the association of `p`

). Using dplyr we can first pick all cars with automatic transmission (`am == 0`

) and then get the Pareto optima:

```
res <- mtcars %>% filter(am == 0) %>% psel(p)
knitr::kable(select(res, am, mpg, hp, qsec))
```

am | mpg | hp | qsec | |
---|---|---|---|---|

1 | 0 | 21.4 | 110 | 19.44 |

2 | 0 | 18.7 | 175 | 17.02 |

4 | 0 | 14.3 | 245 | 15.84 |

5 | 0 | 24.4 | 62 | 20.00 |

6 | 0 | 22.8 | 95 | 22.90 |

9 | 0 | 16.4 | 180 | 17.40 |

10 | 0 | 17.3 | 180 | 17.60 |

14 | 0 | 14.7 | 230 | 17.42 |

15 | 0 | 21.5 | 97 | 20.01 |

16 | 0 | 15.5 | 150 | 16.87 |

18 | 0 | 13.3 | 245 | 15.41 |

19 | 0 | 19.2 | 175 | 17.05 |

Database preferences allow some generalizations of Skyline queries like combining the Pareto order with the lexicographical order. Assume we prefer cars with manual transmission (`am == 0`

). If two cars are equivalent according to this criterion, then the higher number of gears should be the decisive criterion. This is known as the lexicographical order and can be realized with

`p <- true(am == 1) & high(gear)`

where `true`

is a Boolean preference, where those tuples are preferred fulfilling the logical condition. The `&`

operator is the non-commutative “Prioritization” creating a lexicographical order.

The constructs `high`

, `low`

and `true`

are the three base preferences. They also accept arbitrary arithmetic expressions (and accordingly logical, for `true`

). For example, we can Pareto-combine `p`

with a wish for an high power per cylinder ratio:

```
p <- p * high(hp/cyl)
res <- psel(mtcars, p)
knitr::kable(select(res, am, gear, hp, cyl))
```

am | gear | hp | cyl | |
---|---|---|---|---|

Maserati Bora | 1 | 5 | 335 | 8 |

Here the two goals of the lexicographical order, as defined above, and the high `hp/cyl`

ratio are simultaneously optimized.

In the above preference selection we just have one Pareto-optimal tuple for the data set `mtcars`

. Probably we are also interested in the tuples slightly worse than the optimum. rPref offers a top-k preference selection, iterating the preference selection on the remainder on the data set until k tuples are returned. To get the 3 best tuples we use:

```
res <- psel(mtcars, p, top = 3)
knitr::kable(select(res, am, gear, hp, cyl, .level))
```

am | gear | hp | cyl | .level | |
---|---|---|---|---|---|

Maserati Bora | 1 | 5 | 335 | 8 | 1 |

Ford Pantera L | 1 | 5 | 264 | 8 | 2 |

Duster 360 | 0 | 3 | 245 | 8 | 3 |

The column `.level`

is additionally added to the result when `psel`

is called with the `top`

parameter. It counts the number of iterations needed to get this tuple. The k-th level of a Skyline is also called *the k-th stratum*. We see that the first three tuples have levels {1, 2, 3}. The top-k parameter produces a nondeterministic cut, i.e., there could be more tuples in the third level. To avoid the cut, we use the `at_least`

parameter, returning all tuples from the last level:

```
res <- psel(mtcars, p, at_least = 3)
knitr::kable(select(res, am, gear, hp, cyl, .level))
```

am | gear | hp | cyl | .level | |
---|---|---|---|---|---|

Maserati Bora | 1 | 5 | 335 | 8 | 1 |

Ford Pantera L | 1 | 5 | 264 | 8 | 2 |

Duster 360 | 0 | 3 | 245 | 8 | 3 |

Camaro Z28 | 0 | 3 | 245 | 8 | 3 |

Ferrari Dino | 1 | 5 | 175 | 6 | 3 |

Additionally there is a `top_level`

parameter which allows to explicitly state the number of iterations. The preference selection with `top_level = 3`

is identical to the statement above in this case, because just one tuple resides in each of the levels 1 and 2.

Using the grouping functionality from the dplyr package, we can perform a preference selection on each group separately. For example, we search for the cars maximizing `mpg`

and `hp`

in each group of cars with the same number of cylinders. This is done by:

```
grouped_df <- group_by(mtcars, cyl)
res <- psel(grouped_df, low(hp) * low(mpg))
knitr::kable(select(res, cyl, hp, mpg))
```

cyl | hp | mpg |
---|---|---|

4 | 93 | 22.8 |

4 | 62 | 24.4 |

4 | 52 | 30.4 |

4 | 97 | 21.5 |

4 | 109 | 21.4 |

6 | 105 | 18.1 |

6 | 123 | 17.8 |

8 | 205 | 10.4 |

8 | 150 | 15.2 |

The first line is the grouping operation from dplyr and the second line is the preference selection from rPref, which respects the grouping. The result is again a grouped data frame, containing the Pareto optima for each group of cylinders.