Preprint of Ballet

Posted on Fri 18 December 2020 in research

I'm excited to share that we have posted a preprint to arXiv of our paper, "Enabling collaborative data science development with the Ballet framework." This preprint summarizes our work on the Ballet framework for collaborative, open-source data science development.

Though there is much potential in building predictive models in an collaborative, open-source setting, we haven't seen anything on the same scale that exists in open-source software. The main obstacle we identify is that existing informal ways to collaboration in data science don't allow a project to scale beyond a handful of collaborators. To address this, we introduce a conceptual framework for data science collaboration, based on defining modular data science patches – like feature definitions – that can be automatically combined into products – like a feature engineering pipeline. The acceptance procedure for a proposed patch is that it not only should pass some predefined unit tests that it works correctly, but it should also pass an evaluation of its machine learning performance, such as a streaming feature selection algorithm. We instantiate these ideas in Ballet, a general framework that comes with first class support for feature engineering collaborations. Ballet automatically collects feature definitions from a repository and composes them into an executable feature engineering pipeline, and provides functionality to validate contributed feature definitions both locally and in a continuous integration (CI) environment. At any time, consumers can install the latest version of the project and be assured that they will have access to a pipeline that will extract high-quality features for their data.

Contributors to a Ballet project can use Assemblé, a cloud-based development workflow, to easily develop feature definitions in a familiar notebook environment using an interactive client. The completed feature definitions can be formulated as pull requests and submitted to the upstream project without leaving the notebook with a one-click experience. But at the end of the day, a Ballet project is just a git repository, so experienced contributors can develop locally using their preferred tooling.

We evaluate Ballet in a detailed mixed-methods case study of an income prediction problem. In predict-census-income, 26 data scientists from around the world successfully contributed features to a feature engineering pipeline for predicting income from raw responses to the U.S. Census American Community Survey. The ML performance of the collaboratively generated features coupled with AutoML beats several baselines, showing the potential of hybrid human-machine systems for challenging tasks. We found that participants highly valued the workflow enabled by Assemblé, and that as a result, software engineering background did not have a significant effect on either participant experience or performance. Rather, domain expertise (i.e. working with survey data or Census data) was the most important background for success in the task. The collaborative nature of the project meant that many contributors appreciated learning by example and referred to existing features contributed by others in order to inform their own feature development. However, there was a need for additional support for distribution of work so that variables that were under-explored in features written by others could be highlighted as a potential input for new features.

I've written about this project in a variety of places, such as in an initial workshop paper, a demo of a collaborative project, and a recent status report on the project.

Please check it out and feel free to share any feedback.