DataRoaster is now open-sourced, why I created it

Kidong Lee
ITNEXT
Published in
2 min readSep 9, 2021

--

Photo by Nathan Dumlao on Unsplash

DataRoaster is a tool to provide data platforms running on kubernetes. Recently I have open-sourced it.

Before I developed DataRoaster, I used free data platforms like HDP(Hortonworks Data Platform) to build data lakes. After Hortonworks was acquired by Cloudera, HDP was not free any more. To build a data lake, there are serverless services like AWS EMR, but you have to consider the cost of using such serverless services provided by public cloud providers.

As mentioned in A Concept: Kubernetes based Private Cloud Platform, I have been looking for an alternative to commercial data platforms and serverless services. After I have developed the implementation for this concept for several months, DataRoaster has been created.

There are several components of data platform provided by DataRoaster, for instance, hive metastore, spark thrift server, trino, redash, jupyterhub, kafka. The component of spark thrift server is originated from the blog of Hive on Spark in Kubernetes which I have written. Spark thrift server as the concept of hive on spark now can be deployed on kubernetes with DataRoaster easily.

There is a DataRoaster demo. The architecture of this demo looks like this.

The scenario of the demo is:

  • create parquet table in s3 compatible object storage which is provided by ceph storage with running spark example job using hive metastore.
  • query data in parquet table saved in ceph using spark thrift server and trino which use hive metastore.
  • query data with the connectors to spark thrift server and trino coordinator from redash and jupyter.

However, to build such a data lake in this demo, you have to install several components like hive metastore, hive on spark, trino, redash and jupyterhub. You also need to know how to install them on container orchestrators like kubernetes. It is not easy to follow. As shown in the demo video, you can create such data lakes fast and easily using DataRoaster.

DataRoaster is open-source and free. To use DataRoaster, visit here.

--

--