Databases

Databases lie at the center of data analysis. They provide the storage and retrieval functionality to enable quick modification and querying of data. While from a theoretical perspective, a database can be implemented as just a list of data, realistic implementations are concerned with the tradeoffs between time and space.

In order to speed up querying, databases create indices on different data in a table. These are represented in data structures which are more efficient to search, most notably as a tree or hashtable. With a tree we can perform a binary search on the data which has a time complexity of $\mathcal{O}(\log{}n)$. Most databases use a variant of the binary tree called a B-tree which allows multiple children to each node. This is preferable due to the characteristics of retrieving data from either memory or disk. Since the system will read the entire cache line or block, we may as well utilize that entire space describing each node. An example of this can be seen in the sqlite specification.

Indexing Techniques

Here is the list of indexing techniques.

  • B-tree
  • Quadtree, Octree
  • Bloom filter

What we are looking for is to be able to create a pipeline which is mutated based on statistics and operating conditions. For different types of data there are different approaches for indexing and different operations which can be performed on the data.

Graph Databases

Triples

Graphs are described in terms the relationships between nodes. The fundamental data structure is the triple which contains a subject, a predicate, and an object. The subject and object describes the two nodes and the predicate describes the relationship described by the edge.

Property Graph Model

While triples are the core data type of graph databases, properties on either the nodes or edges can be useful. This model extends the base triple format by allowing key value properties to be attached to nodes or edges. Neo4j is one example of a graph database which implements this model.

Time-series Databases

Time-series databases operate much like other databases except they optimize based on the nature of most time-series data. These optimizations are based on the pattern that most time-series data is appended and provided in chronological order. We can then expect the most recent entries to change and that statistics over regions of the database remain relatively consistent.

There are also a number of common mathematical functions which we often want to compute over different ranges of a time-series. We can cluster data and precompute many of these statistical values in chunks, drastically reducing the computation for an arbitrary given range.

Further Reading