Haura
This book is dedicated as a Getting Started or Developer Guide to the Haura research storage stack. We give you here a quick overview of how Haura is used and its basic structure. If you are intereted in any of the specific parts and functions of the project navigate in the "Structure" menu on the left to your desired section.
Haura works different from file system such as ext4 or btrfs which would run in the kernel. Haura only runs in the userspace and is not designed to be permanently active nor provide usual interfaces like POSIX. Rather you use a key-value and object interface common to databases and mass storage applications.
Below we lay out two scenarios which show how Haura is commonly used in conjunction with any arbitrary client code.
Scenario 1 - Using Haura directly
The simplest way to use Haura, as you will also see in the examples, is the direct initialization of Haura with your own configuration. This will let Haura run in your own process context meaning that with the termination of your program Haura will be stopped as well.
Scenario 2 - Using a Wrapper
Using a wrapper around the functionality like JULEA will allow for more flexible usage of the storage stack as the JULEA process will own and run the Haura instance, meaning that the storage stack will surpass the lifetime of an application process. Additionally, this allows for more than one client to be connected to an active Haura instance.
Building
To build the complete project including all included crates requires some configuration beforehand.
Install Dependencies
First install the Rust compiler to actually compile the project. Use your package manager or set up in your local user home.
# For rpm based distributions (Fedora, CentOS, Rocky, ...)
$ sudo dnf install cargo
# For apt based distributions (Debian, Ubuntu, ...)
$ sudo apt update
$ sudo apt install cargo
# For Arch Linux and its derivatives
$ sudo pacman -Sy cargo
You'll need atleast version 1.61. Most package manager should deliver this or more recent version - rustup will always deliver the most up-to-date version.
Then install the required system dependencies for bindgen and the JULEA bindings.
Fedora/RHEL
$ sudo dnf install glib2 glib2-devel libbson libbson-devel clang make pkgconf libpmem libpmem-devel
Ubuntu/Debian/...
$ sudo apt update
$ sudo apt install libglib2.0-0 libglib2.0-dev libbson-1.0-0 libbson-dev clang make pkg-config libpmem1 libpmem-dev
Arch Linux+
Arch Linux does not package libpmem, if you want to use it you may try the AUR. Otherwise, compile without the
nvm
feature.
$ sudo pacman -Sy glib2 clang make libbson pkgconf
Prepare the environment
This step can be skipped if you do not need to use the JULEA interface. See betree.
To compile the bindings you'll need JULEA present and specify it's headers in your environemnt.
$ git clone https://github.com/parcio/julea.git
To build the complete Haura project from this state, execute:
$ export JULEA_INCLUDE=$PWD/julea/include
$ export BINDGEN_EXTRA_CLANG_ARGS="$(pkg-config --cflags glib-2.0) $(pkg-config --cflags libbson-1.0)"
$ cd haura
$ cargo build
Structure
Haura is structured into multiple subprojects for some separation of concerns. Mainly the can be divided into implementation and bindings.
Implementation:
- betree_storage_stack: The actual implementation and of main interest when considering any modifications to the algorithmic makeup of the storage logic and interfaces. This crate also contains C bindings.
- bectl: Allows for a basic acces to the storage stack as an CLI application.
Bindings:
- julea-betree: Bindings exposed to be used by JULEA. Specifies a betree backend.
- julea-sys: Generated bindings by bindgen for use in julea-betree.
- fio-haura: Engine for fio.
B-epsilon Tree Storage Stack
This crate is the main event of the repository as it contains all algorithmic logic required in the actual implementation of the B-epsilon tree storage. We give you a quick introduction into the structure of the repository here and give you starting points where to start your implementation.
Build & Test
To build the storage stack on its own navigate to betree/
and execute:
$ cargo build
This should build the storage stack after a few minutes.
When executing
cargo build
normally the library will be built without any optimizations in debug mode. This is helpful when developing as you get information such as backtraces with understandable names. If you are planning to test for performance or need fast execution always usecargo build --release
.
Tests
We perform a number of tests as unit and intergration tests. Some of them require a considerable amount of time as they are run multiple times with different input values.
Unit
Navigate to betree/
and execute:
$ cargo test
Integration
Due to the implementation of tests a large amount of memory is taken up during
the integration tests affecting the remaining system considerably, please be
aware that the tests will consume several GiB of memory and 4 GiB of storage space as some temporary files will be created in the betree/tests
directory.
Navigate to betree/tests/
and execute:
$ ./scripts/test.sh
The provided script limits the usage of test threads proportionally to the available system memory with each test thread at maximum requiring 4 GiB in memory.
Additionally to the memory usage two files will be created with 2 GiB in size each. They allow us to test some persistency guarantees our storage stack gives.
Design Overview
In this section we want to give you an initial overview of all the present structures in Haura and how they relate to another. If you want to understand how the database works in detail it is best to gain a general idea of how each individual parts interact with another.
Haura Structure
Database, Dataset and Object store
At the top of the overview, see above, we can see the user interface part
of Haura. The Database
provides the main introduction here for the rest of
the modules to be used. From an active Database
we can then initiate a
Dataset
or Objectstore
. Also notable from the graphic, the Database
creates the AllocationHandler
which is subsequently used in the
DataManagement
. The Database
also contains due to this also the storage
configuration which is then initiated in the StoragePool
. The implementation
of the Objectstore
is a wrapper around Dataset
whereas the keys are chunk
ids and the value is there chunk content.
B-epsilon-tree
The Dataset
interacts mainly with the actual B-epsilon-tree, which
receives through its root node messages from the Dataset
. By default these
implement insert, remove and upsert, although this can be exchanged if
required. Solely the MessageAction
trait is required to be implemented on the
chosen type to allow for its use in the tree. An example for this can be seen in
the MetaMessage
of the Objectstore
.
In the figure above we show how this tree is structured. We differentiate
between three different kinds of nodes in it, first Leaves
shown in blue,
Internal Nodes
shown in red, and Child Buffers
(abbr. "CB") shown in grey.
The root
node of the tree is highlighted with TreeLayer
to indicate that
this is the node providing the interface for Datasets to attach to. From the
root
node messages are received.
Once passed, the tree propagates the message down the tree until it reaches a
leaf node where the message will be applied. This might not happen
instantaneously, though, and multiple buffers (ChildBuffer
s) might be encountered which
momentarily hold the message at internal nodes. This way we avoid additional
deep traversals and might be able to flush multiple messages at once from one
buffer node.
Vital to understanding the handling and movement of nodes and their content within Haura is the object state cycle, this is illustrated at the leaf nodes and in more detail in the following figure.
Adjacent to the internals and construction of B-epsilon-trees are the commonalities
between existing trees in an open database. Hidden to the user, the root tree is
used to store internal information concerning the created datasets (their
DatasetId
s and ObjectPointer
s) and Segment
s information. Segment
s are
previously not mentioned here as they are considered in the Data Management
Layer, but can be thought of as containers organizing the allocation bitmap for
a range of blocks. Additionally to avoid conflicts with another, all trees
share the same Data Management Unit to assure that no irregular state is reached
in handling critical on-disk management such as allocation of blocks and
updating of bitmaps.
Data Management
On-disk operations and storage allocation are handled by the Data Management layer. This layer also implements the copy-on-write semantics required for snapshots, done in delayed deallocation and accounting of a dead-list of blocks.
The Handler
manages the actual bitmap handling for all allocations and
deallocations and is also responsible for tracking the number of blocks
distributed (Space Accounting).
To keep track of specific locations of allocated blocks, or free ranges of
blocks rather, bitmaps are used. Wrapped around SegmentAllocator
s, these can
be used to allocate block ranges at any position in a specific SegmentId
or
request specific allocations at given offsets.
SegementId
s refer to 1 GiB large ranges of blocks on a storage tier, though
the Id is unique over all storage tiers.
Copy on Write
The Data Management Layer is also responsible to handle copy-on-write
preservation. This is handle by checking if any snapshots of the dataset
contain the afflicted node (via Generation
), if this is the case the
dead_list
contained in the root tree is updated, to contain the storage
location of the old version of the node on the next sync.
Storage Pool
As the abstraction over specific hardware types and RAID configurations the data
management unit interacts for all I/O operation with the storage pool layer.
Notable here is the division of the layer into (of course) storage tiers, Vdev
and LeafVdevs
. There are 4 storage tiers available
(FASTEST
,FAST
,SLOW
,SLOWEST
) with each at maximum 1024 Vdev
s. Each
Vdev
can be one of four variants. First, a singular LeafVdev
, this is the
equivalent of a disk or any other file path backed interface, for example a
truncated file or a disk dev/...
. Second, a RAID-1 like mirrored
configuration. Third, a RAID-5 like striping and parity based setup with
multiple disks. Fourth and last, a main memory backed buffer, simply hold as a
vector.
Implementation
This section should help you find the module you'll want to implement your changes in by giving you a brief description of each module together with its location and references in the layer design.
Name | Description |
---|---|
cache | Clock Cache implementation used internally |
compression | Compression logic for indication and usage of compression algorithm (zstd only atm) |
data_management | Allocation and Copy on Write logic for underlying storage space |
database | The Database layer & Dataset implementation with snapshots |
metrics | Basic Metric collections |
object | The object store wrapper around the dataset store |
storage_pool | The storage pool layer which manages different vdevs |
tree | The actual b-epsilon tree |
vdev | Implement the use of different devices for storage (Block, File, Memory) with different modes parity, raid, single |
Note that traits are heavily used to allow interaction between objects of different modules, traits implemented by one module might be located in multiple other modules.
Known Bugs
- On large write operations (easy to achieve with
Objectstore
) which overfill the storage, unexpected errors can be returned. This has been reduced by the introduction of space accounting but some errors might still occur as not all checks have been implemented yet. - Some tests are expected to fail in the current state of Haura. These tests
are marked as
should_panic
in thecargo test
output. These tests are largely concerned with the storage space utilization and test extreme situations where no storage space is left on the available devices.
bectl
This application allows for usage with a CLI albeit suboptimally as the complete stack has to be reinitialized on each access. It is best fit for debugging at the moment and demonstration purposes.
We shortly demonstrate how to build it and how to configure and use it.
Build & Test
Building bectl
is relatively simple but does require all dependencies of
betree_storage_stack
to be available. If you have not done this yet refer to
the Building chapter.
Given all prerequisites are fulfilled simply run
$ cargo build
from the bectl
directory to build bectl
on its own.
To avoid specifiyng the location of the binary or running bectl
with cargo run
on each invocation you can also install the app via cargo
to your local
user.
$ cargo install --path .
There are not yet any tests provided for the bectl
as the functionality is a
rather simple mapping to betree_storage_stack
functions. If we want to expand
this in the future we might want to ensure that bectl
behaves correctly
internally too.
Basic Usage
To use the storage stack via bectl
you will first need to define a valid
configuration. This can be done either by modifying a small example like shown
below or by creating a valid configuraiton with the betree_storage_stack
crate
and calling write_config_json
on the created Database
object.
Example Configuration
{
"storage": {
"tiers": [
{
"top_level_vdevs": [
{
"path": "/home/user/.cache/haura/cache.disk",
"direct": true
}
],
"preferred_access_type": "Unknown"
}
],
"queue_depth_factor": 20,
"thread_pool_size": null,
"thread_pool_pinned": false
},
"alloc_strategy": [
[
0
],
[
1
],
[
2
],
[
3
]
],
"default_storage_class": 0,
"compression": "None",
"cache_size": 4294967296,
"access_mode": "OpenOrCreate",
"sync_interval_ms": null,
"migration_policy": null,
"metrics": null
}
Store this configuration in a convenient place for example, if defined, under
$XDG_CONFIG_HOME/haura.json
or alternatively $HOME/.config/haura.json
. To
avoid specifying this path on each access, store the location of your
configuration in your environment as $BETREE_CONFIG
. For example as so:
# excerpt from .your_favorite_shellenv
export BETREE_CONFIG=$HOME/.config/haura.json
We use in our example $XDG_CACHE_HOME/haura/cache.disk
as the storage file, this
integrates nicely with most common setups and indicates that this data is not
essential. If XDG_CACHE_HOME
is not set in your system you can use
$HOME/.cache/haura/cache.disk
instead. Create the directories if they do not
exists.
$ mkdir -p $HOME/.cache/haura
$ truncate -s 16G $HOME/.cache/haura/cache.disk
EDITOR NOTE: We may want to specify this as a
yaml
ortoml
in the future to ease the configuration and actually be able to write this by hand besides the zfs-like configuration string.
Usage
Once you have configured everything you are ready to use bectl
. We show you
here the very basics on how to start using the CLI, you may find further in the
CLI itself from various help pages.
Initialization
Before using you'll need to initialize your storage.
$ bectl db init
Write Data
To write some value baz in the dataset foo in key bar.
bectl kv foo put bar baz
Read Data
To read some data from dataset foo in key bar.
bectl kv foo get bar
fio-haura
This directory contains an external module implementing a storage engine for the Flexible IO tester. Fio allows for a freely configuration for multiple IO patterns including pre-defined scenarios available as job files; have a look at their documentation to gather more information if you are unfamiliar with it.
Documentation: https://fio.readthedocs.io/en/latest/fio_doc.html
or
$ man fio
Build
Building Haura's fio ioengine requires some internal files which are not available in most provided fio packages. The Makefile automatically downloads and prepares the required files when building the module; to build the module run:
$ make
On build errors consult the configure.log
in the fio directory.
You'll find the ioengine in the src/
directory.
Run fio with Haura
Important❗
To actually run the engine you need to first compile betree_storage_stack
in
release mode
, for this execute cargo build --release
in betree/
.
Furthermore, your system must be able to find the compiled library. For that,
source the given environment file in fio-haura/
.
$ source ./env.sh
Additionally, since Haura's configuration is more complex then what is provided
in fio clients a configuration has to be loaded. The configurations path has to
be stored in the environment under BETREE_CONFIG
. See the bectl's Basic Usage
chapter for more information.
Running fio
fio
can be configured with CLI options and jobfiles, they both have the same
capabilities, therefore for brevity we will use CLI options here. You can find
multiple jobfiles which can be used with fio to specify these options in a more
manageable way in fio-haura/jobfiles
.
As an example to perform a simple IOPS test, you can use:
$ fio \
--direct=1 \
--rw=randwrite \
--random_distribution=zipf \
--bs=4k \
--ioengine=external:src/fio-engine-haura.o \
--numjobs=1 \
--runtime=30 \
--time_based \
--group_reporting \
--name=iops-test-job \
--eta-newline=1 \
--size=4G \
--io_size=2G
This starts an IO benchmark using --direct
access in a --rw=randwrite
pattern using a blocksize of --bs=4k
for each access. Furthermore, haura is
specified as --ioengine=external:src/fio-engine-haura.o
and runs for
--runtime=30
seconds with --numjobs=1
. The total size of IO operations for
each thread is --io_size=2GB
which is the upper limit if runtime is not
reached.
❗ Random Workloads Caution
When using random workloads which surpass the size of the internal cache or explicitly sync'ing to disk, extensive fragmentation might appear. This leads to situations where (even though enough space is theoretically available) no continuous space can be allocated, resulting in out of space errors.
To counteract this it is advised to:
- Increase the size of the cache
- Increase the underlying block size while retaining the same
io_size
- Choose a random distribution with a higher skew to specific regions (e.g. zipf) to avoid frequent evictions of nodes from the internal cache
- Reduce the number of jobs; More jobs put more pressure on the cache leading to more frequent evictions which lead to more writeback operations worsening fragmentation
As a general rule this leads to two things: reduce the amount of write operations, enlarge the allocation space.
fio
prints a summary of the results at then end which should look similar to this output:
fio --direct=1 --rw=randwrite --bs=4k --ioengine=external:src/fio-engine-haura.o --runtime=10 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --size=4G --thread
iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=haura, iodepth=1
...
fio-3.30
Starting 4 threads
Jobs: 4 (f=4): [w(4)][30.0%][w=12.3MiB/s][w=3147 IOPS][eta 00m:07s]
Jobs: 4 (f=4): [w(4)][54.5%][w=12.7MiB/s][w=3244 IOPS][eta 00m:05s]
Jobs: 4 (f=4): [w(4)][72.7%][w=12.8MiB/s][w=3283 IOPS][eta 00m:03s]
Jobs: 4 (f=4): [w(4)][90.9%][eta 00m:01s]
Jobs: 4 (f=4): [w(4)][100.0%][w=6422KiB/s][w=1605 IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=46232: Wed Mar 1 13:42:32 2023
write: IOPS=2182, BW=8729KiB/s (8938kB/s)(85.4MiB/10024msec); 0 zone resets
clat (nsec): min=12, max=30256, avg=184.83, stdev=460.93
lat (nsec): min=1122, max=1629.4M, avg=1828274.84, stdev=26162931.82
clat percentiles (nsec):
| 1.00th=[ 14], 5.00th=[ 17], 10.00th=[ 57], 20.00th=[ 60],
| 30.00th=[ 62], 40.00th=[ 64], 50.00th=[ 66], 60.00th=[ 68],
| 70.00th=[ 99], 80.00th=[ 179], 90.00th=[ 692], 95.00th=[ 812],
| 99.00th=[ 1272], 99.50th=[ 1672], 99.90th=[ 3088], 99.95th=[ 3568],
| 99.99th=[23424]
bw ( KiB/s): min= 1664, max=24480, per=100.00%, avg=10387.06, stdev=1608.07, samples=67
iops : min= 416, max= 6120, avg=2596.72, stdev=402.01, samples=67
lat (nsec) : 20=5.94%, 50=2.40%, 100=61.71%, 250=14.51%, 500=4.29%
lat (nsec) : 750=1.49%, 1000=8.25%
lat (usec) : 2=1.07%, 4=0.30%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=16.99%, sys=7.73%, ctx=15964, majf=0, minf=383252
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,21874,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=8729KiB/s (8938kB/s), 8729KiB/s-8729KiB/s (8938kB/s-8938kB/s), io=85.4MiB (89.6MB), run=10024-10024msec
Haura-specific flags
The engine implemented comes with some additional flags to modify the configuration of the started Haura instance. These flags only deactivates the translation of certain conditions usually created in fio benchmarks to Haura itself. Which can be useful for example when using tiered storage setups which cannot be described with fio.
--disrespect-fio-files
Avoid transferring fio file configuration to haura. Can be
used to use specific disks regardless of fio specification.
--disrespect-fio-direct
Use direct mode only as specified in haura configuration.
--disrespect-fio-options
Disregard all fio options in Haura. This only uses the I/O
workflow as executed by fio. Take care to ensure
comparability with results of other engines.
More examples
Have a look at the examples directory of fio
for more usage examples and jobfiles.
When performing read-only benchmarks the benchmarks include some prepopulation which might take depending on the storage medium some time to complete.
julea-betree
To expose the betree_storage_stack
usable to JULEA, we need to encapsulate the
functionality by implementing the Backend
type. For documentation of specifics
refer to the JULEA
documentation.
You will most likely not need to modify this code when implementing features in Haura.
Build
To build you will need the generated bindings to JULEA, check the Chapter Building. Furthermore ensure that all dependencies are installed.
Once you are done navigate to the julea-betree
directory.
$ cargo build
The build process produces a shared library object which we further use with JULEA.
TODO
All this is not automated at the moment, although we would like to. We may simply evaluate the JULEA environment, if set, and copy our shared object there, so that we can actually use JULEA in the process.
Usage in JULEA
When the build finishes you'll find a libobject_betree.so
in the target
directory, either under Release
or Debug
depending on your build (Debug
is
default, so check there when you are unsure).
With the built library, we can switch to JULEA to integrate our results into
their configuration. If you have not setup JULEA have a look at their
documentation. Continue their
documentation until you configure JULEA with julea-config
, then this
documentation will use a modified procedure.
Configure JULEA
For JULEA to use betree_storage_stack
as its backend (for object storage) we
have to set the appropriate flags for julea-config
. The example below uses the
user instance of haura as used in bectl documentation.
julea-config --user \
--object-servers="$(hostname)" --kv-servers="$(hostname)" --db-servers="$(hostname)" \
--object-backend=betree --object-path="$HOME/.config/haura.json" \
--kv-backend=lmdb --kv-path="/tmp/julea-$(id -u)/lmdb" \
--db-backend=sqlite --db-path="/tmp/julea-$(id -u)/sqlite"
Prepare the library
JULEA can't find the library in its current position as it is located in the
build directory of Haura. For debugging purposes we can simply copy the
produced libobject_haura.so
to the JULEA backend directory.
TODO
How JULEA loads backend and where is still work in progress and likely to change in the future. This documentation holds for the current progress that has been made but only in the
Debug
build of JULEA.
To copy the library to JULEA with the loaded JULEA environment run:
# The change between underscore and hyphen is not a typo but a mismatch
# between allowed names in cargo and expected names in JULEA.
$ cp target/debug/libobject_betree.so $JULEA_BACKEND_PATH/libobject-betree.so
Start JULEA
To start JULEA, navigate back to it's directory and run
$ ./scripts/setup.sh start
To check if everything worked check your logs best with
$ journalctl -e GLIB_DOMAIN=JULEA
On success this should look like this (some details will look different on your machine):
Apr 10 11:46:31 nicomedia julea-server[15742]: Loaded object backend betree. Apr 10 11:46:32 nicomedia julea-server[15742]: Initialized object backend betree. Apr 10 11:46:32 nicomedia julea-server[15742]: Loaded kv backend lmdb. Apr 10 11:46:32 nicomedia julea-server[15742]: Initialized kv backend lmdb. Apr 10 11:46:32 nicomedia julea-server[15742]: Loaded db backend sqlite. Apr 10 11:46:32 nicomedia julea-server[15742]: Initialized db backend sqlite.
If everything worked fine so far you can run the JULEA test suite with
$ ./scripts/test.sh
A number of tests are executed then and reports on failures will be generated for you.
julea-sys
julea-sys
contains generated bindings to the JULEA headers via
bindgen. They are used internally by the
julea-betree
crate to provide the Haura JULEA backend. You'll likely do not
need to modify the code itself (which is not a lot) but take care of the
building process which we will explain here shortly.
Build
Most of the build process should work without any intervention, you just need to point the generation to the JULEA headers.
Importantly you have to provide the JULEA_INCLUDE
environment variable which
points to the include
directory of the JULEA repository.
Also ensure that you have already installed all depending all depending libraries, for reference check Chapter Building.
$ git clone https://github.com/parcio/julea.git
$ export JULEA_INCLUDE=$PWD/julea/include
$ export BINDGEN_EXTRA_CLANG_ARGS="$(pkg-config --cflags glib-2.0) $(pkg-config --cflags libbson-1.0)"
Documentation
You can open similar to the other packages a basic documentation of the crate with cargo
.
$ cargo doc --open
RFCs
This chapter contains some RFCs (Request for Comments). Since not too many people cooperated in this project they are held relatively light-weight and may not be equivalent RFCs as you know from other projects.
A template can be found in the source of this documentation under
docs/src/rfc/0-template.md
. All RFCs are meant to give introspect as to why we
should modify certain behavior and which drawbacks and limitations this change
implies.
- Title: Pivot Keys
- Status: DRAFT
Summary/Motivation
This RFC adds an identifier which allows node defined access to structural
elements in the B-epsilon tree. This measure allows us to perform operations
which are more dependent on the actual present layout of the tree rather than
the semantic content which is stored by individual keys. Furthermore, in
combination with the recently added migration policies we gainthe option to
migrate singular nodes and improve the reporting scheme by using more stable
identifiers compared to the previously used DiskOffset
s.
Description
p1 p2 p3
┌───────┬───────┬───────┬───────┐
│ │ │ │ │
│ Left │ Right │ Right │ Right │
│ Outer │ │ │ │
└───┬───┴───┬───┴───┬───┴───┬───┘
│ │ │ │
▼ ▼ ▼ ▼
This RFC proposes a new identification for nodes within the B-epsilon tree. We design it to be as uninstrusive as possible to the struture and current implemenation of the tree to avoid undesired side effects later on. This includes reducing the dependency on state as we show in the description here. The basis of "PivotKey" are pivot elements already present in the tree. We use an enum indicating the position of the node based on its parent.
#![allow(unused)] fn main() { type Pivot = CowBytes; enum LocalPivotKey { LeftOuter(Pivot), Right(Pivot), Root, } }
The property we are piggy-backing on is that pivots are searchable and unique,
furthermore we can structurally define PivotKey
directions which keeps the
required memory space relatively low as only a single pivot key is required.
Finally, the pivot keys are more persistent than normal keys in the collection,
as they are only refreshed when the strcture of the tree changes. This might
happen on rebalancing. Although, when we consider this than any node based
algorithm needs to reconsider decisions anyway.
To make the pivot key ready to use over all datasets in a database (which can
have overlapping key sets) we require an additional information to direct the
pivot key to the correct dataset. This can be done by adding a DatasetId
to
the key.
#![allow(unused)] fn main() { type Pivot = CowBytes; enum PivotKey { LeftOuter(Pivot, DatasetId), Right(Pivot, DatasetId), Root(DatasetId), } }
Also, as the root node of a tree does not have a parent which could index the node by its pivot - we add another variant which simply denotes that the root of a given dataset is to be chosen.
We propose that we internally use two kinds of pivot keys. First, the global
PivotKey
, which is structured as shown above including the DatasetId
.
Second, the scoped LocalPivotKey
, which can offer of us some advantages when
designing interfaces for node operations, which do not have the knowledge in
which tree they are located in. This alleviates the need for passing around
DatasetIds
to layers which are normally unaffected by it. The transformation
from LocalPivotKey
to PivotKey
is always possible, as all local keys which a
tree layer will encounter belong to the tree itself, the reverse direction is
not given to be valid and should therefore be excluded in the implementation.
Integration
The given PivotKey
structure can be used in ObjRef
to add a measure of
identification to the cache location or disk location.
We can then use the PivotKey
of a tree node to perform operations on specific
nodes which fulfill certain conditions such as access frequency or access
probability. This is helpful in a number of scenarios such as data prefetching
or disk-to-disk migrations.
Since pivots are stored in the internal nodes we are required to read a
substantial amount of data to retrieve knowledge about all exisiting pivot keys.
This limits the efficient usage to scenarios in which we retrieve pivot keys,
for example from messages emitted by the DMU, and record these as they are used
by the user. Previously a similar scheme has been done by the migration policies
which recorded disk offsets and set hints to the DMU to which tier a node is
advised to be written.
This limited hints to often accessed nodes which are likely to be migrated to
faster storage, as not often accessed nodes would not encounter the sent hints.
With PivotKey
s we could actively migrate them downwards, which is the main
advantage of PivotKey
s. This can be useful in scenarios with additional small
layers like NVRAM where we are then more flexible to migrate granular amounts of
data for better tier utilization.
Drawbacks
The implementation does not need to require extra members in the node as we can
generate the valid PivotKey
s when reading the tree from disk. Although, this
creates with the current deserialization scheme (deserializing directly into
valid nodes) hidden states, either nodes with reconstructed PivotKey
or nodes
missing valid PivotKey
s which would encounter errors when trying to actual
read data from their children. To avoid this the deserialization scheme would
need to be adjusted to serialize proto-types "InternalNodeProto" or whichever
name which would need to be validated via a transformation InternalNodeProto
->
InternalNode
. This makes the deserialization code a bit more involved,
but more clear to avoid misuse.
Alternatives
The search for alternatives is difficult as not many characteristics are present which allow nodes to be identifiable and searchable. Something like Ids for example could make nodes identifiable but from an Id we cannot search for the node in the tree.
An alternative to the pivot method could be the path construction which relies on the structural uniqueness of the tree, in which we construct a path from the root by giving directions for each step downwards until the desired node is located. It can look similar to this:
#![allow(unused)] fn main() { enum Path { Child(usize, Path), Terminate, } }
This method does not provide many advantages and carries the disadvantage of
having to maintain additional state when constructing keys about the just
traversed distance. Arguably, this is semantic-wise not complicated but many
methods will be affected by this change. Also, misidentification may become
possible as with reconstruction of subtrees paths may be shifted around. With
PivotKey
s these will result in a failed search indicating an outdated key.
Currently the only advantage this method would have to the PivotKey
method is
that the size can be expected to remain comparatively low, with 5 bytes for
each element in the search path. A restriction of key size as discussed in the
corresponding issue could solve
this problem and is already in discussion.