
Data processing for AI: Building Daft for blazing fast I/O on structured & unstructured data
I/O is a consistent bottleneck for large scale data processing workloads, often more painful than the actual compute on the data. Unstructured data introduces additional unique challenges for I/O. We present Daft, a data engine that is purpose-built for processing data of any modality and at any scale. Daft is used to query data of all different shapes and sizes, from tabular (Parquet, CSV) to semi-structured (JSON) to unstructured (text, images, audio). We'll dive into the technical details that allow Daft to accomplish all of that while maximizing I/O throughput, including distributed reads of large files, memory stability via morsel-based execution, and I/O-aware query optimizations.