Wwentianyaoupdate 630 version and adapt jdk17

70de2796创建于 25 天前历史提交

文件	最后提交记录	最后更新时间
codegen	fix(rapidjson): optimize memory consumption caused by rapidjson	1 个月前
compute	init struct members	6 个月前
expression	adapt to SqlJsonQueryEmptyOrErrorBehavior	1 个月前
memory	rm deprecated code	2 个月前
metrics	Revert "remove securec xxx_s dependencies" This reverts commit fe3d30a6	8 个月前
operator	fix: correct LimitOperator offset handling when offset > 0	3 个月前
plannode	!49 window spill enable fix * use sortRowThreshold for window spill * window spill enable fix	3 个月前
simd	[fix] DT	7 个月前
type	feat(datetime): add DateAddDays and CurrentTimestamp functions	1 个月前
udf	update 630 version and adapt jdk17	25 天前
util	!49 window spill enable fix * use sortRowThreshold for window spill * window spill enable fix	3 个月前
vector	adapt to omnistream	2 个月前
CMakeLists.txt	rm deprecated code	2 个月前
README.MD	Init of OmniOperator	9 个月前

joy

Joy is project hoping to make it easier to create high performance data processing logic.

Joy leverage LLVM to create the code dynamically.
Joy doesn't define any intermediate IR, since

SQL can be the IR
the number of operations in SQL is not that many, we can create optimised operators for all of them

Joy provides a API, hopefully simplified, , without requiring LLVM knowledge, to create and optimise the needed operators

Introduction

Bring happiness to data processing. 😃

API overview:

table api groupby api

code gen api

Architecture

Usage

requirements:

provide high performance atom operators, which can be combined into task
SqlJit compiler fusion capability to performan task level optimization such as Weld, optimizations:
- dynamic vector size: the compiler should take into account the CPU capabilities and decide on for example vector size, to leverage SIMD and at the same reduce CPU cache miss
- type compaction, requires statistics, for example long -> int -> short, phone -> long
ensure cacheline alignment

Use TPC-H Q1 as an example to create the group by aggregator functionality

the purpose of this project is to create a sql processing engine using miri

https://github.com/rust-lang/miri Rust miri provides a mid-level IR which can be used to interpret and run rust code

The idea is to compile sql into rust closure and run using miri

Weld: maintains it's own language Joy: use closure instead

crate a new SQL JIT compiler leveraging llvm-sys
take over optimizer from Weld?

Weld:

use closure syntax with its own parser -- lots of code to maintain
require more time to parse and visit the AST, which could impact the JIT performance
We can potentially borrow the optimizer passes and implement more

Data Types

The type system should be as transparent as possible, ideally we should be able to use the native data types such as i32, i64, f32, u8 directly.

Column::create()

Code Gen

Direct LLVM code gen without parser to reduce code gen latency
direct optimization of the code to reduce the time needed for optimization pass
expose high level data processing codegen API for the community to create optimized data processing logic (vs Weld expose IR)

MCJit?? --> ORCJit (On request compiler)

Code Gen Simplified

The Joy project target to

provide a codegen framework requires NO knowledge of LLVM
zero overhead codegen: framework should not bring any overhead to the generated code.
A framework which provides a uniop (single input) and a binop (2 input) can we provide a trait for each of the operator type? how is the trait plug into the codegen? what's the benefit of using codegen for join?
built-in Vectorize input support
pluggable logic such as groupby, join

Debug using Visual Studio Code

Install the RUST and LLDB plugin for the vscode
Config the debug launch.json and input attach under configurations in launch.json. The LLDB attach is automatically displayed.
In the debug panel, click Launch in debug window and select the process to debug.
You can add breakpoints in the vscode and debug the code.

gen()

The gen function provides boilerplate code which loops over each row.
allows generate code while looping over each row
allows composition of generated code processing each row

the context of the generated code: 1. has access to all of the columns 2. knows what columns needed is needed 3. all columns are access via column index 3. which column to store the output

C++ Building

Build with llvm options(-S -O3 -emit-llvm -fno-discard-value-names)

./build.sh release

Build without llvm option

./build.sh debug