Haura

This book is dedicated as a Getting Started or Developer Guide to the Haura research storage stack. We give you here a quick overview of how Haura is used and its basic structure. If you are intereted in any of the specific parts and functions of the project navigate in the "Structure" menu on the left to your desired section.

Haura works different from file system such as ext4 or btrfs which would run in the kernel. Haura only runs in the userspace and is not designed to be permanently active nor provide usual interfaces like POSIX. Rather you use a key-value and object interface common to databases and mass storage applications.

Below we lay out two scenarios which show how Haura is commonly used in conjunction with any arbitrary client code.

Scenario 1 - Using Haura directly
%3OwnClient ProcessHauraHauraSystem ResourcesSystem ResourcesHaura->System ResourcesClient 1Client 1Client 1->Haura

The simplest way to use Haura, as you will also see in the examples, is the direct initialization of Haura with your own configuration. This will let Haura run in your own process context meaning that with the termination of your program Haura will be stopped as well.

Scenario 2 - Using a Wrapper
%3ForeignJULEA ProcessProcess1Client ProcessProcess2Client ProcessProcess3Client ProcessJULEAJULEAHauraHauraJULEA->HauraSystem ResourcesSystem ResourcesHaura->System ResourcesClient 1Client 1Client 1->JULEAClient 2Client 2Client 2->JULEAClient 3Client 3Client 3->JULEA

Using a wrapper around the functionality like JULEA will allow for more flexible usage of the storage stack as the JULEA process will own and run the Haura instance, meaning that the storage stack will surpass the lifetime of an application process. Additionally, this allows for more than one client to be connected to an active Haura instance.

Building

To build the complete project including all included crates requires some configuration beforehand.

Install Dependencies

First install the Rust compiler to actually compile the project. Use your package manager or set up in your local user home.

# For rpm based distributions (Fedora, CentOS, Rocky, ...)
$ sudo dnf install cargo

# For apt based distributions (Debian, Ubuntu, ...)
$ sudo apt update
$ sudo apt install cargo

# For Arch Linux and its derivatives
$ sudo pacman -Sy cargo

You'll need atleast version 1.61. Most package manager should deliver this or more recent version - rustup will always deliver the most up-to-date version.

Then install the required system dependencies for bindgen and the JULEA bindings.

Fedora/RHEL

$ sudo dnf install glib2 glib2-devel libbson libbson-devel clang make pkgconf libpmem libpmem-devel

Ubuntu/Debian/...

$ sudo apt update
$ sudo apt install libglib2.0-0 libglib2.0-dev libbson-1.0-0 libbson-dev clang make pkg-config libpmem1 libpmem-dev

Arch Linux+

Arch Linux does not package libpmem, if you want to use it you may try the AUR. Otherwise, compile without the nvm feature.

$ sudo pacman -Sy glib2 clang make libbson pkgconf

Prepare the environment

This step can be skipped if you do not need to use the JULEA interface. See betree.

To compile the bindings you'll need JULEA present and specify it's headers in your environemnt.

$ git clone https://github.com/parcio/julea.git

To build the complete Haura project from this state, execute:

$ export JULEA_INCLUDE=$PWD/julea/include
$ export BINDGEN_EXTRA_CLANG_ARGS="$(pkg-config --cflags glib-2.0) $(pkg-config --cflags libbson-1.0)"
$ cd haura
$ cargo build

Structure

Haura is structured into multiple subprojects for some separation of concerns. Mainly the can be divided into implementation and bindings.

Implementation:

  • betree_storage_stack: The actual implementation and of main interest when considering any modifications to the algorithmic makeup of the storage logic and interfaces. This crate also contains C bindings.
  • bectl: Allows for a basic acces to the storage stack as an CLI application.

Bindings:

B-epsilon Tree Storage Stack

This crate is the main event of the repository as it contains all algorithmic logic required in the actual implementation of the B-epsilon tree storage. We give you a quick introduction into the structure of the repository here and give you starting points where to start your implementation.

Build & Test

To build the storage stack on its own navigate to betree/ and execute:

$ cargo build

This should build the storage stack after a few minutes.

When executing cargo build normally the library will be built without any optimizations in debug mode. This is helpful when developing as you get information such as backtraces with understandable names. If you are planning to test for performance or need fast execution always use cargo build --release.

Tests

We perform a number of tests as unit and intergration tests. Some of them require a considerable amount of time as they are run multiple times with different input values.

Unit

Navigate to betree/ and execute:

$ cargo test

Integration

Due to the implementation of tests a large amount of memory is taken up during the integration tests affecting the remaining system considerably, please be aware that the tests will consume several GiB of memory and 4 GiB of storage space as some temporary files will be created in the betree/tests directory.

Navigate to betree/tests/ and execute:

$ ./scripts/test.sh

The provided script limits the usage of test threads proportionally to the available system memory with each test thread at maximum requiring 4 GiB in memory.

Additionally to the memory usage two files will be created with 2 GiB in size each. They allow us to test some persistency guarantees our storage stack gives.

Design Overview

In this section we want to give you an initial overview of all the present structures in Haura and how they relate to another. If you want to understand how the database works in detail it is best to gain a general idea of how each individual parts interact with another.

Haura Structure

An overview of the different layers defined in the betree architecture

Database, Dataset and Object store

At the top of the overview, see above, we can see the user interface part of Haura. The Database provides the main introduction here for the rest of the modules to be used. From an active Database we can then initiate a Dataset or Objectstore. Also notable from the graphic, the Database creates the AllocationHandler which is subsequently used in the DataManagement. The Database also contains due to this also the storage configuration which is then initiated in the StoragePool. The implementation of the Objectstore is a wrapper around Dataset whereas the keys are chunk ids and the value is there chunk content.

B-epsilon-tree
%3tblTreeLayerin1Internal Nodecb1CBin1->cb1cb2CBin1->cb2cb3CBin1->cb3cb4CBin1->cb4in2Internal Nodecb6CBin2->cb6cb7CBin2->cb7cb8CBin2->cb8cb9CBin2->cb9rootrootc1CBroot->c1c2CBroot->c2c1->in1c2->in2l1Leafcb1->l1l2Leafcb2->l2l3Leafcb3->l3l4Leafcb4->l4l6Leafcb6->l6l7Leafcb7->l7l8Leafcb8->l8l9Leafcb9->l9

The Dataset interacts mainly with the actual B-epsilon-tree, which receives through its root node messages from the Dataset. By default these implement insert, remove and upsert, although this can be exchanged if required. Solely the MessageAction trait is required to be implemented on the chosen type to allow for its use in the tree. An example for this can be seen in the MetaMessage of the Objectstore.

In the figure above we show how this tree is structured. We differentiate between three different kinds of nodes in it, first Leaves shown in blue, Internal Nodes shown in red, and Child Buffers (abbr. "CB") shown in grey. The root node of the tree is highlighted with TreeLayer to indicate that this is the node providing the interface for Datasets to attach to. From the root node messages are received.

Once passed, the tree propagates the message down the tree until it reaches a leaf node where the message will be applied. This might not happen instantaneously, though, and multiple buffers (ChildBuffers) might be encountered which momentarily hold the message at internal nodes. This way we avoid additional deep traversals and might be able to flush multiple messages at once from one buffer node.

Vital to understanding the handling and movement of nodes and their content within Haura is the object state cycle, this is illustrated at the leaf nodes and in more detail in the following figure.

State Diagram of the object lifecycle

Adjacent to the internals and construction of B-epsilon-trees are the commonalities between existing trees in an open database. Hidden to the user, the root tree is used to store internal information concerning the created datasets (their DatasetIds and ObjectPointers) and Segments information. Segments are previously not mentioned here as they are considered in the Data Management Layer, but can be thought of as containers organizing the allocation bitmap for a range of blocks. Additionally to avoid conflicts with another, all trees share the same Data Management Unit to assure that no irregular state is reached in handling critical on-disk management such as allocation of blocks and updating of bitmaps.

%3DMUData Management UnitrootRoot Treeroot->DMUds1Dataset 1Treeroot->ds1Object Pointerds2Dataset 2Treeroot->ds2Object Pointerds3Dataset 3Treeroot->ds3Object Pointerds1->DMUds2->DMUds3->DMU

Data Management

On-disk operations and storage allocation are handled by the Data Management layer. This layer also implements the copy-on-write semantics required for snapshots, done in delayed deallocation and accounting of a dead-list of blocks.

The Handler manages the actual bitmap handling for all allocations and deallocations and is also responsible for tracking the number of blocks distributed (Space Accounting).

To keep track of specific locations of allocated blocks, or free ranges of blocks rather, bitmaps are used. Wrapped around SegmentAllocators, these can be used to allocate block ranges at any position in a specific SegmentId or request specific allocations at given offsets.

SegementIds refer to 1 GiB large ranges of blocks on a storage tier, though the Id is unique over all storage tiers.

Copy on Write

The Data Management Layer is also responsible to handle copy-on-write preservation. This is handle by checking if any snapshots of the dataset contain the afflicted node (via Generation), if this is the case the dead_list contained in the root tree is updated, to contain the storage location of the old version of the node on the next sync.

Storage Pool

As the abstraction over specific hardware types and RAID configurations the data management unit interacts for all I/O operation with the storage pool layer. Notable here is the division of the layer into (of course) storage tiers, Vdev and LeafVdevs. There are 4 storage tiers available (FASTEST,FAST,SLOW,SLOWEST) with each at maximum 1024 Vdevs. Each Vdev can be one of four variants. First, a singular LeafVdev, this is the equivalent of a disk or any other file path backed interface, for example a truncated file or a disk dev/.... Second, a RAID-1 like mirrored configuration. Third, a RAID-5 like striping and parity based setup with multiple disks. Fourth and last, a main memory backed buffer, simply hold as a vector.

Implementation

This section should help you find the module you'll want to implement your changes in by giving you a brief description of each module together with its location and references in the layer design.

NameDescription
cacheClock Cache implementation used internally
compressionCompression logic for indication and usage of compression algorithm (zstd only atm)
data_managementAllocation and Copy on Write logic for underlying storage space
databaseThe Database layer & Dataset implementation with snapshots
metricsBasic Metric collections
objectThe object store wrapper around the dataset store
storage_poolThe storage pool layer which manages different vdevs
treeThe actual b-epsilon tree
vdevImplement the use of different devices for storage (Block, File, Memory) with different modes parity, raid, single

Note that traits are heavily used to allow interaction between objects of different modules, traits implemented by one module might be located in multiple other modules.

Known Bugs

  • On large write operations (easy to achieve with Objectstore) which overfill the storage, unexpected errors can be returned. This has been reduced by the introduction of space accounting but some errors might still occur as not all checks have been implemented yet.
  • Some tests are expected to fail in the current state of Haura. These tests are marked as should_panic in the cargo test output. These tests are largely concerned with the storage space utilization and test extreme situations where no storage space is left on the available devices.

bectl

This application allows for usage with a CLI albeit suboptimally as the complete stack has to be reinitialized on each access. It is best fit for debugging at the moment and demonstration purposes.

We shortly demonstrate how to build it and how to configure and use it.

Build & Test

Building bectl is relatively simple but does require all dependencies of betree_storage_stack to be available. If you have not done this yet refer to the Building chapter.

Given all prerequisites are fulfilled simply run

$ cargo build

from the bectl directory to build bectl on its own.

To avoid specifiyng the location of the binary or running bectl with cargo run on each invocation you can also install the app via cargo to your local user.

$ cargo install --path .

There are not yet any tests provided for the bectl as the functionality is a rather simple mapping to betree_storage_stack functions. If we want to expand this in the future we might want to ensure that bectl behaves correctly internally too.

Basic Usage

To use the storage stack via bectl you will first need to define a valid configuration. This can be done either by modifying a small example like shown below or by creating a valid configuraiton with the betree_storage_stack crate and calling write_config_json on the created Database object.

Example Configuration
{
  "storage": {
    "tiers": [
      {
        "top_level_vdevs": [
          {
            "path": "/home/user/.cache/haura/cache.disk",
            "direct": true
          }
        ],
        "preferred_access_type": "Unknown"
      }
    ],
    "queue_depth_factor": 20,
    "thread_pool_size": null,
    "thread_pool_pinned": false
  },
  "alloc_strategy": [
    [
      0
    ],
    [
      1
    ],
    [
      2
    ],
    [
      3
    ]
  ],
  "default_storage_class": 0,
  "compression": "None",
  "cache_size": 4294967296,
  "access_mode": "OpenOrCreate",
  "sync_interval_ms": null,
  "migration_policy": null,
  "metrics": null
}

Store this configuration in a convenient place for example, if defined, under $XDG_CONFIG_HOME/haura.json or alternatively $HOME/.config/haura.json. To avoid specifying this path on each access, store the location of your configuration in your environment as $BETREE_CONFIG. For example as so:

# excerpt from .your_favorite_shellenv

export BETREE_CONFIG=$HOME/.config/haura.json

We use in our example $XDG_CACHE_HOME/haura/cache.disk as the storage file, this integrates nicely with most common setups and indicates that this data is not essential. If XDG_CACHE_HOME is not set in your system you can use $HOME/.cache/haura/cache.disk instead. Create the directories if they do not exists.

$ mkdir -p $HOME/.cache/haura
$ truncate -s 16G $HOME/.cache/haura/cache.disk

EDITOR NOTE: We may want to specify this as a yaml or toml in the future to ease the configuration and actually be able to write this by hand besides the zfs-like configuration string.

Usage

Once you have configured everything you are ready to use bectl. We show you here the very basics on how to start using the CLI, you may find further in the CLI itself from various help pages.

Initialization

Before using you'll need to initialize your storage.

$ bectl db init

Write Data

To write some value baz in the dataset foo in key bar.

bectl kv foo put bar baz

Read Data

To read some data from dataset foo in key bar.

bectl kv foo get bar

fio-haura

This directory contains an external module implementing a storage engine for the Flexible IO tester. Fio allows for a freely configuration for multiple IO patterns including pre-defined scenarios available as job files; have a look at their documentation to gather more information if you are unfamiliar with it.

Documentation: https://fio.readthedocs.io/en/latest/fio_doc.html

or

$ man fio

Build

Building Haura's fio ioengine requires some internal files which are not available in most provided fio packages. The Makefile automatically downloads and prepares the required files when building the module; to build the module run:

$ make

On build errors consult the configure.log in the fio directory.

You'll find the ioengine in the src/ directory.

Run fio with Haura

Important❗

To actually run the engine you need to first compile betree_storage_stack in release mode, for this execute cargo build --release in betree/. Furthermore, your system must be able to find the compiled library. For that, source the given environment file in fio-haura/.

$ source ./env.sh

Additionally, since Haura's configuration is more complex then what is provided in fio clients a configuration has to be loaded. The configurations path has to be stored in the environment under BETREE_CONFIG. See the bectl's Basic Usage chapter for more information.

Running fio

fio can be configured with CLI options and jobfiles, they both have the same capabilities, therefore for brevity we will use CLI options here. You can find multiple jobfiles which can be used with fio to specify these options in a more manageable way in fio-haura/jobfiles.

As an example to perform a simple IOPS test, you can use:

$ fio \
    --direct=1 \
    --rw=randwrite \
    --random_distribution=zipf \
    --bs=4k \
    --ioengine=external:src/fio-engine-haura.o \
    --numjobs=1 \
    --runtime=30 \
    --time_based \
    --group_reporting \
    --name=iops-test-job \
    --eta-newline=1 \
    --size=4G \
    --io_size=2G

This starts an IO benchmark using --direct access in a --rw=randwrite pattern using a blocksize of --bs=4k for each access. Furthermore, haura is specified as --ioengine=external:src/fio-engine-haura.o and runs for --runtime=30 seconds with --numjobs=1. The total size of IO operations for each thread is --io_size=2GB which is the upper limit if runtime is not reached.

❗ Random Workloads Caution

When using random workloads which surpass the size of the internal cache or explicitly sync'ing to disk, extensive fragmentation might appear. This leads to situations where (even though enough space is theoretically available) no continuous space can be allocated, resulting in out of space errors.

To counteract this it is advised to:

  • Increase the size of the cache
  • Increase the underlying block size while retaining the same io_size
  • Choose a random distribution with a higher skew to specific regions (e.g. zipf) to avoid frequent evictions of nodes from the internal cache
  • Reduce the number of jobs; More jobs put more pressure on the cache leading to more frequent evictions which lead to more writeback operations worsening fragmentation

As a general rule this leads to two things: reduce the amount of write operations, enlarge the allocation space.

fio prints a summary of the results at then end which should look similar to this output:

fio --direct=1 --rw=randwrite --bs=4k --ioengine=external:src/fio-engine-haura.o --runtime=10 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --size=4G --thread
iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=haura, iodepth=1
...
fio-3.30
Starting 4 threads
Jobs: 4 (f=4): [w(4)][30.0%][w=12.3MiB/s][w=3147 IOPS][eta 00m:07s]
Jobs: 4 (f=4): [w(4)][54.5%][w=12.7MiB/s][w=3244 IOPS][eta 00m:05s]
Jobs: 4 (f=4): [w(4)][72.7%][w=12.8MiB/s][w=3283 IOPS][eta 00m:03s] 
Jobs: 4 (f=4): [w(4)][90.9%][eta 00m:01s]                          
Jobs: 4 (f=4): [w(4)][100.0%][w=6422KiB/s][w=1605 IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=46232: Wed Mar  1 13:42:32 2023
  write: IOPS=2182, BW=8729KiB/s (8938kB/s)(85.4MiB/10024msec); 0 zone resets
    clat (nsec): min=12, max=30256, avg=184.83, stdev=460.93
     lat (nsec): min=1122, max=1629.4M, avg=1828274.84, stdev=26162931.82
    clat percentiles (nsec):
     |  1.00th=[   14],  5.00th=[   17], 10.00th=[   57], 20.00th=[   60],
     | 30.00th=[   62], 40.00th=[   64], 50.00th=[   66], 60.00th=[   68],
     | 70.00th=[   99], 80.00th=[  179], 90.00th=[  692], 95.00th=[  812],
     | 99.00th=[ 1272], 99.50th=[ 1672], 99.90th=[ 3088], 99.95th=[ 3568],
     | 99.99th=[23424]
   bw (  KiB/s): min= 1664, max=24480, per=100.00%, avg=10387.06, stdev=1608.07, samples=67
   iops        : min=  416, max= 6120, avg=2596.72, stdev=402.01, samples=67
  lat (nsec)   : 20=5.94%, 50=2.40%, 100=61.71%, 250=14.51%, 500=4.29%
  lat (nsec)   : 750=1.49%, 1000=8.25%
  lat (usec)   : 2=1.07%, 4=0.30%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=16.99%, sys=7.73%, ctx=15964, majf=0, minf=383252
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,21874,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=8729KiB/s (8938kB/s), 8729KiB/s-8729KiB/s (8938kB/s-8938kB/s), io=85.4MiB (89.6MB), run=10024-10024msec

Haura-specific flags

The engine implemented comes with some additional flags to modify the configuration of the started Haura instance. These flags only deactivates the translation of certain conditions usually created in fio benchmarks to Haura itself. Which can be useful for example when using tiered storage setups which cannot be described with fio.

--disrespect-fio-files

    Avoid transferring fio file configuration to haura. Can be 
    used to use specific disks regardless of fio specification.

--disrespect-fio-direct

    Use direct mode only as specified in haura configuration.

--disrespect-fio-options

    Disregard all fio options in Haura. This only uses the I/O 
    workflow as executed by fio. Take care to ensure 
    comparability with results of other engines.

More examples

Have a look at the examples directory of fio for more usage examples and jobfiles.

When performing read-only benchmarks the benchmarks include some prepopulation which might take depending on the storage medium some time to complete.

julea-betree

To expose the betree_storage_stack usable to JULEA, we need to encapsulate the functionality by implementing the Backend type. For documentation of specifics refer to the JULEA documentation.

You will most likely not need to modify this code when implementing features in Haura.

Build

To build you will need the generated bindings to JULEA, check the Chapter Building. Furthermore ensure that all dependencies are installed.

Once you are done navigate to the julea-betree directory.

$ cargo build

The build process produces a shared library object which we further use with JULEA.

TODO


All this is not automated at the moment, although we would like to. We may simply evaluate the JULEA environment, if set, and copy our shared object there, so that we can actually use JULEA in the process.

Usage in JULEA

When the build finishes you'll find a libobject_betree.so in the target directory, either under Release or Debug depending on your build (Debug is default, so check there when you are unsure).

With the built library, we can switch to JULEA to integrate our results into their configuration. If you have not setup JULEA have a look at their documentation. Continue their documentation until you configure JULEA with julea-config, then this documentation will use a modified procedure.

Configure JULEA

For JULEA to use betree_storage_stack as its backend (for object storage) we have to set the appropriate flags for julea-config. The example below uses the user instance of haura as used in bectl documentation.

julea-config --user \
  --object-servers="$(hostname)" --kv-servers="$(hostname)" --db-servers="$(hostname)" \
  --object-backend=betree --object-path="$HOME/.config/haura.json" \
  --kv-backend=lmdb --kv-path="/tmp/julea-$(id -u)/lmdb" \
  --db-backend=sqlite --db-path="/tmp/julea-$(id -u)/sqlite"

Prepare the library

JULEA can't find the library in its current position as it is located in the build directory of Haura. For debugging purposes we can simply copy the produced libobject_haura.so to the JULEA backend directory.

TODO


How JULEA loads backend and where is still work in progress and likely to change in the future. This documentation holds for the current progress that has been made but only in the Debug build of JULEA.

To copy the library to JULEA with the loaded JULEA environment run:

# The change between underscore and hyphen is not a typo but a mismatch
# between allowed names in cargo and expected names in JULEA.
$ cp target/debug/libobject_betree.so $JULEA_BACKEND_PATH/libobject-betree.so

Start JULEA

To start JULEA, navigate back to it's directory and run

$ ./scripts/setup.sh start

To check if everything worked check your logs best with

$ journalctl -e GLIB_DOMAIN=JULEA

On success this should look like this (some details will look different on your machine):

Apr 10 11:46:31 nicomedia julea-server[15742]: Loaded object backend betree.
Apr 10 11:46:32 nicomedia julea-server[15742]: Initialized object backend betree.
Apr 10 11:46:32 nicomedia julea-server[15742]: Loaded kv backend lmdb.
Apr 10 11:46:32 nicomedia julea-server[15742]: Initialized kv backend lmdb.
Apr 10 11:46:32 nicomedia julea-server[15742]: Loaded db backend sqlite.
Apr 10 11:46:32 nicomedia julea-server[15742]: Initialized db backend sqlite.

If everything worked fine so far you can run the JULEA test suite with

$ ./scripts/test.sh

A number of tests are executed then and reports on failures will be generated for you.

julea-sys

julea-sys contains generated bindings to the JULEA headers via bindgen. They are used internally by the julea-betree crate to provide the Haura JULEA backend. You'll likely do not need to modify the code itself (which is not a lot) but take care of the building process which we will explain here shortly.

Build

Most of the build process should work without any intervention, you just need to point the generation to the JULEA headers.

Importantly you have to provide the JULEA_INCLUDE environment variable which points to the include directory of the JULEA repository. Also ensure that you have already installed all depending all depending libraries, for reference check Chapter Building.

$ git clone https://github.com/parcio/julea.git
$ export JULEA_INCLUDE=$PWD/julea/include
$ export BINDGEN_EXTRA_CLANG_ARGS="$(pkg-config --cflags glib-2.0) $(pkg-config --cflags libbson-1.0)"

Documentation

You can open similar to the other packages a basic documentation of the crate with cargo.

$ cargo doc --open

RFCs

This chapter contains some RFCs (Request for Comments). Since not too many people cooperated in this project they are held relatively light-weight and may not be equivalent RFCs as you know from other projects.

A template can be found in the source of this documentation under docs/src/rfc/0-template.md. All RFCs are meant to give introspect as to why we should modify certain behavior and which drawbacks and limitations this change implies.

  • Title: Pivot Keys
  • Status: DRAFT

Summary/Motivation

This RFC adds an identifier which allows node defined access to structural elements in the B-epsilon tree. This measure allows us to perform operations which are more dependent on the actual present layout of the tree rather than the semantic content which is stored by individual keys. Furthermore, in combination with the recently added migration policies we gainthe option to migrate singular nodes and improve the reporting scheme by using more stable identifiers compared to the previously used DiskOffsets.

Description

        p1      p2      p3
┌───────┬───────┬───────┬───────┐
│       │       │       │       │
│ Left  │ Right │ Right │ Right │
│ Outer │       │       │       │
└───┬───┴───┬───┴───┬───┴───┬───┘
    │       │       │       │
    ▼       ▼       ▼       ▼

This RFC proposes a new identification for nodes within the B-epsilon tree. We design it to be as uninstrusive as possible to the struture and current implemenation of the tree to avoid undesired side effects later on. This includes reducing the dependency on state as we show in the description here. The basis of "PivotKey" are pivot elements already present in the tree. We use an enum indicating the position of the node based on its parent.

#![allow(unused)]
fn main() {
type Pivot = CowBytes;

enum LocalPivotKey {
    LeftOuter(Pivot),
    Right(Pivot),
    Root,
}
}

The property we are piggy-backing on is that pivots are searchable and unique, furthermore we can structurally define PivotKey directions which keeps the required memory space relatively low as only a single pivot key is required. Finally, the pivot keys are more persistent than normal keys in the collection, as they are only refreshed when the strcture of the tree changes. This might happen on rebalancing. Although, when we consider this than any node based algorithm needs to reconsider decisions anyway.

To make the pivot key ready to use over all datasets in a database (which can have overlapping key sets) we require an additional information to direct the pivot key to the correct dataset. This can be done by adding a DatasetId to the key.

#![allow(unused)]
fn main() {
type Pivot = CowBytes;

enum PivotKey {
    LeftOuter(Pivot, DatasetId),
    Right(Pivot, DatasetId),
    Root(DatasetId),
}
}

Also, as the root node of a tree does not have a parent which could index the node by its pivot - we add another variant which simply denotes that the root of a given dataset is to be chosen.

We propose that we internally use two kinds of pivot keys. First, the global PivotKey, which is structured as shown above including the DatasetId. Second, the scoped LocalPivotKey, which can offer of us some advantages when designing interfaces for node operations, which do not have the knowledge in which tree they are located in. This alleviates the need for passing around DatasetIds to layers which are normally unaffected by it. The transformation from LocalPivotKey to PivotKey is always possible, as all local keys which a tree layer will encounter belong to the tree itself, the reverse direction is not given to be valid and should therefore be excluded in the implementation.

Integration

The given PivotKey structure can be used in ObjRef to add a measure of identification to the cache location or disk location. We can then use the PivotKey of a tree node to perform operations on specific nodes which fulfill certain conditions such as access frequency or access probability. This is helpful in a number of scenarios such as data prefetching or disk-to-disk migrations.

Since pivots are stored in the internal nodes we are required to read a substantial amount of data to retrieve knowledge about all exisiting pivot keys. This limits the efficient usage to scenarios in which we retrieve pivot keys, for example from messages emitted by the DMU, and record these as they are used by the user. Previously a similar scheme has been done by the migration policies which recorded disk offsets and set hints to the DMU to which tier a node is advised to be written. This limited hints to often accessed nodes which are likely to be migrated to faster storage, as not often accessed nodes would not encounter the sent hints. With PivotKeys we could actively migrate them downwards, which is the main advantage of PivotKeys. This can be useful in scenarios with additional small layers like NVRAM where we are then more flexible to migrate granular amounts of data for better tier utilization.

Drawbacks

The implementation does not need to require extra members in the node as we can generate the valid PivotKeys when reading the tree from disk. Although, this creates with the current deserialization scheme (deserializing directly into valid nodes) hidden states, either nodes with reconstructed PivotKey or nodes missing valid PivotKeys which would encounter errors when trying to actual read data from their children. To avoid this the deserialization scheme would need to be adjusted to serialize proto-types "InternalNodeProto" or whichever name which would need to be validated via a transformation InternalNodeProto -> InternalNode. This makes the deserialization code a bit more involved, but more clear to avoid misuse.

Alternatives

The search for alternatives is difficult as not many characteristics are present which allow nodes to be identifiable and searchable. Something like Ids for example could make nodes identifiable but from an Id we cannot search for the node in the tree.

An alternative to the pivot method could be the path construction which relies on the structural uniqueness of the tree, in which we construct a path from the root by giving directions for each step downwards until the desired node is located. It can look similar to this:

#![allow(unused)]
fn main() {
enum Path {
    Child(usize, Path),
    Terminate,
}
}

This method does not provide many advantages and carries the disadvantage of having to maintain additional state when constructing keys about the just traversed distance. Arguably, this is semantic-wise not complicated but many methods will be affected by this change. Also, misidentification may become possible as with reconstruction of subtrees paths may be shifted around. With PivotKeys these will result in a failed search indicating an outdated key. Currently the only advantage this method would have to the PivotKey method is that the size can be expected to remain comparatively low, with 5 bytes for each element in the search path. A restriction of key size as discussed in the corresponding issue could solve this problem and is already in discussion.