The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 320 active open source projects and initiatives, announced Apache® DataFusion™ is now a Top-Level Project (TLP). DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.
DataFusion aims to be the query engine of choice for new, fast, data-centric systems such as databases, dataframe libraries, machine learning, and streaming applications by leveraging the unique features of Apache Arrow and Rust. By using DataFusion, projects can focus on developing specific features and avoid reimplementing standard features such as an expression representation, standard optimizations, parallelized streaming execution plans, file format support, etc.
DataFusion can be used without modification as an embedded SQL engine or can be customized and used as a foundation for building new systems. It is used for systems focused on analytic (high throughput), streaming and transaction (low latency) workloads such as:
- Specialized analytical database systems such as Apache HoraeDB
- New query language engines such as prql-query and accelerators such as VegaFusion
- Research platforms for new database systems, such as opt-d
- Streaming data platforms such as Synnada
- SQL support for another library, such as dask-sql
- Tools for reading / sorting / transcoding files such as qv
- Apache Spark runtime replacements such as Comet and Blaze
“Apache DataFusion has grown tremendously since its inception. What started as a modest project to provide a simple and efficient query engine has evolved into a robust, high-performance system that powers data-centric applications worldwide. This growth is a testament to the Apache Way,” said Andy Grove, Apache DataFusion PMC Member and original creator of DataFusion. “Becoming a Top-Level Project is a significant milestone, and I am excited to see how the project will continue to innovate and shape the future of data processing.”
DataFusion Feature Highlights
- Fast, vectorized, multi-threaded, streaming execution engine
- Support for Parquet, CSV, JSON, and Avro file formats via built in plugins
- Support for custom file formats and non file data sources via extension traits
- Many extension points: user defined scalar/aggregate/window functions, data sources, SQL, other query languages, custom plan and execution nodes, optimizer passes, and more
- A state-of-the-art query optimizer with expression coercion and simplification, projection and filter pushdown, sort and distribution aware optimizations, automatic join reordering, and more
- Streaming, asynchronous input/output directly from popular object stores, including AWS S3, Azure Blob Storage, and Google Cloud Storage (Other systems are supported via extensions)
- Support for Substrait to easily pass plans across language and system boundaries
- Implementation in Rust
“DataFusion’s capabilities have been integral to the development of InfluxDB 3.0. By building with and contributing to this project, we’ve been able to deliver a powerful, vectorized SQL engine to our users, all while benefiting from continuous improvements from a dedicated global community,” said Paul Dix, CTO and co-founder of InfluxData.
SOURCE: GlobeNewswire