Kickdrum delivers pipeline to empower advanced analytics for time tracking client

When a next generation time tracking software company approached Kickdrum about a new data migration project, they had several objectives.

First, they wanted to validate that data from their existing highly sharded database could be ingested into a single data warehouse and provide meaningful, relevant data. This was not guaranteed given the amount of customization available to customers throughout the application.
Second, they wanted to deploy a production-ready system that could be used for advanced analytics and machine learning.
Third, they want to make sure that additional data sources could be added to the data ingestion process on a regular basis without major development work.

In order to address these functional needs, Kickdrum approached the problem as a series of proof of concept implementations, opting for fast and iterative demonstrations followed by production hardening and feature development:

A Proof of Concept Data Ingestion Pipeline, ingesting the Postgres shards as individual data objects into a data lake alongside objects from a secondary data stream, served as the initial demonstration of technology.
Upon demonstration of the proof of concept, functionality was added to meet production-readiness requirements.
To streamline data onboarding for future sources, a code-free ETL process was designed and implemented.

To develop the initial Proof of Concept, Kickdrum implemented a variation of the AWS Reference Architecture for Data Analytics Pipelines. The initial data pipeline consisted of a Kinesis processor for inbound data, and S3 landing zone, a basic intermediate zone ETL function to perform basic casting and data transformation activities, and an S3 based intermediate zone to store the transformed data objects. To demonstrate the capabilities of the initial data pipeline, Kickdrum used AWS Athena to allow parsing and querying of the data stored in the intermediate zone.

Within a matter of weeks, Kickdrum demonstrated that a data pipeline could be used to ingest data from both the Postgres database and a secondary data stream and that both sources could be queried together to derive insights not possible otherwise. The implementation also provided visibility into the depth of the use of custom objects and enabled analysis of how to consolidate and standardize commonly used custom objects into a consistent set of application objects, a valuable insight for the application product team.

With an initial prototype complete, Kickdrum set about solidifying a production-ready architecture. First the team set about improving the ingestion mechanism. Three main classes of source systems existed: existing single-tenant client databases, application event data, and data from trackers implemented by end users. API Gateway served as the front-end for the sources to post data, with a simplified declarative metadata wrapper allowing for payloads from multiple source systems to be processed via a common set of APIs

To ensure the architecture could handle future data sources, the Kinesis processor was update to autoscale. To control costs at scale, Kickdrum developed an Event Aggregator to put objects from the Kinesis Stream to the S3 landing zone. Additional cost controls were achieved through scheduled transfers of data from S3 to glacier, and automated tenant deletion controls were created to support regulatory data destruction requirements. Finally, a Snowflake curated zone was created with data in a final state ready for analytics.

*High-level processing flow of objects through the ingestion pipeline.*

To assist in the Curated Zone ETL process, AWS DataBrew was used to view and manage the data from the intermediate zone. AWS DataBrew provides analysts with an easy-to-use GUI to view and manipulate sample data and create ETL processes. As an AWS Advanced Partner, Kickdrum developers and architects pay close attention to new products deployed by AWS, and took note of DataBrew’s announcement at RE:Invent 2020 and recognized it as a valuable tool for data pipeline and data migration projects. Once Kickdrum setup Glue jobs, DataBrew creates a preview of the data set which can be manipulated by a user with common data prep functions like SPLIT, MERGE, JOIN, REPLACE, and even allows users to create filters and remove rows with missing or duplicate fields.

*The AWS DataBrew visualization and manipulation screen.*

At this stage, the Kickdrum team could have utilized Lambdas, Step Functions, or Glue Python jobs to clean and transform data for the Curated Zone. But new sources would be added to the data pipeline frequently, and each new source would involve coordination between developers and business owners to define the appropriate transformations to be applied. Instead, Kickdrum developed tools to productize the use of DataBrew as the primary ETL design and deployment methodology.

Using DataBrew, Kickdrum analysts reviewed the data and defined operations called “Recipes” to ready the data for the Curated Zone. Because the team had selected Snowflake as the new data warehouse platform, these Recipes could then be used in scheduled Glue Jobs and the results inserted directly using the Snowflake Database connection type in DataBrew. Knowing the back-and-forth usually associated with adding a new data source, the Kickdrum team estimated that new data sources could be brought online weeks earlier than through other methods.

To implement a user-ready implementation for DataBrew, Kickdrum built additional infrastructure supporting the management of Recipes. Add, Edit, and Validate APIs were created to manage DataBrew Recipes. Because DataBrew recipes can be exported to JSON files, authorized users could directly manage creating, updating, and deploying new ETL processes without writing a single line of code.

By the end of the project, Kickdrum had delivered a pipeline that met all of the requirements of the project originally set forth:

A new data warehouse was deployed and data from the existing Postgres database was migrated, empowering analytics that were not possible in the legacy environment.
The data pipeline provided a way to accept data from multiple sources, processing data objects through a common API and landing/intermediate zones.
A GUI-based ETL tool in the form of AWS Databrew, coupled with APIs to manage recipes and declarative payload wrappers, enabled new data sources to be processed from API ingestion to ETL processing without any new lines of code.

About Kickdrum

Kickdrum partners with technology investors, boards, and executives to accelerate enterprise value. To learn more about Kickdrum, please visit https://kickdrum.com/.