Below is my (mostly) complete bibliography with links to articles, presentations, posters, videos, and suggested citations. Make sure to check my blog as well for additional informal discussion of some of this research.
Collaborative data science development
While the open-source model for software development has led to successful, large-scale collaborations in building software applications, chess engines, and scientific analyses, data science has not benefited from this development paradigm. In part, this is due to the divide between the development processes used by software engineers and those used by data scientists.
Ballet tries to address this disparity. It is a lightweight software framework that supports collaborative data science development by composing a data science pipeline from a collection of modular patches that can be written in parallel. Ballet provides the underlying functionality to support interactive development, test and merge high-quality contributions, and compose the accepted contributions into a single product.
We've evaluated Ballet in an extensive case study analysis of a personal income prediction project, and describe our ideas for collaborative data science development, the design of the framework, and the results of this evaluation in our preprint.
Frameworks for AutoML
In our experience developing and deploying ML systems in my research group, we realized that every project used a different set of libraries depending on the task at hand that fit together more or less poorly. To address this, we redesign our systems building approach to one based on the concepts ML primitives, ML pipelines, and AutoML components. The resulting software framework is used for everything from our entry to DARPA's Data-Driven Discovery of Models program to unsupervised time-series anomaly detection in satellite telemetry to ML on electronic health records. I designed the BTB library for model selection and hyperparameter tuning which has also been contributed to by many folks in the Data to AI Lab. We describe the framework, some of the ML and AutoML systems we have built with it, and a thorough evaluation in this paper.
Systems for AutoML
I am a developer on the ATM project, a full-fledged open-source system for joint model selection and hyperparameter tuning for classification. ATM is one of the first projects from the research community that went beyond the creation of libraries for model selection or hyperparameter tuning to create a system with a database backend designed for ease of use and high performance. On top of this, we collaborated with the VisLab at HKUST to create a frontend for ATM that allows users to monitor and control an ongoing AutoML search process. This led to the ATMSeer system which we describe in this paper.
"The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development." Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020. (Previously published at arXiv:1905.08942 [cs])
"ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning." Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 2019. (Also published at arXiv:1902.05009 [cs])