LPQL: A Query Language Designed for Speed
When we designed the query interface for LogPulse, we had a clear set of requirements. The syntax had to be familiar to engineers who have used Splunk SPL or similar pipe-based query languages. The queries had to compile to efficient ClickHouse SQL, not just work but run fast across billions of rows. And the system had to be secure by default -- no SQL injection, no cross-tenant data leaks, no unbounded resource consumption. The result is LPQL: LogPulse Query Language.
Design Goals
LPQL is a pipe-based query language. You start with a search expression, then chain transformations using the pipe operator. Each pipe command takes the output of the previous command and produces a new result set. If you have written SPL, the syntax will feel immediately familiar.
But under the surface, LPQL is fundamentally different from SPL. Every LPQL query is compiled to a ClickHouse SQL statement before execution. The compilation step is not a simple string substitution -- it is a full parse-and-transform pipeline that produces optimized, parameterized SQL with mandatory security constraints.
search index=web level=error
| where status_code >= 500
| stats count, avg(duration_ms) by host
| sort -count
| head 20Architecture: From Text to SQL
The LPQL processing pipeline has four stages: tokenization, parsing, AST construction, and SQL compilation.
The tokenizer breaks the raw query string into tokens: keywords, operators, field names, values, pipes, and literals. It handles quoted strings, escape sequences, and the various comparison operators.
The parser is a recursive descent parser. We chose recursive descent over parser generators like PEG or ANTLR because it gives us precise control over error messages and recovery behavior. When a user writes a malformed query, we want to point to the exact character where parsing failed and suggest a fix. Generated parsers make this kind of error reporting difficult.
The parser produces an abstract syntax tree (AST) that represents the query as a tree of typed nodes. Each pipe command becomes a node in the AST, with its arguments and modifiers attached as child nodes. The AST is the canonical representation of the query -- all subsequent processing operates on the tree, not the raw text.
The SQL compiler walks the AST and emits a ClickHouse SQL statement. Each pipe command corresponds to a SQL transformation. A search becomes a WHERE clause. A stats command becomes a GROUP BY with aggregate functions. A sort becomes an ORDER BY. The compiler nests these transformations as subqueries, building up the final SQL from the inside out.
The 22 Pipe Commands
LPQL supports 22 pipe commands that cover search, transformation, aggregation, and formatting.
Search and filtering: search performs keyword and field matching, where applies arbitrary boolean expressions, dedup removes duplicate events by field value, and fields selects or removes specific fields from the output.
Aggregation: stats computes aggregate functions over groups, timechart produces time-bucketed aggregations for charting, chart creates pivot-table style aggregations, top and rare find the most and least common values for a field.
Transformation: eval creates computed fields using expressions, rename changes field names, rex extracts fields using regular expressions, bin (aliased as bucket) discretizes numeric or time values into fixed-width intervals, fillnull replaces null values with a default, and filldown propagates the last non-null value forward.
Flow control: table formats output as a flat table, head and tail limit results to the first or last N rows, sort orders results by one or more fields, and join combines results from two searches.
Utility: makeresults generates synthetic events for testing queries without hitting real data.
search index=api
| timechart span=5m count by level
| fillnull value=0How Compilation Works
Each pipe command in the AST is compiled to a SQL subquery transformation. The compiler maintains a running query context and wraps each new pipe as an outer query around the previous result. Consider this simple LPQL query:
search index=web level=error
| stats count, avg(duration_ms) as avg_duration by host
| sort -avg_duration
| head 10The compiler produces SQL that looks conceptually like the following (simplified for clarity):
SELECT *
FROM (
SELECT host,
count() AS count,
avg(duration_ms) AS avg_duration
FROM shared_tenants.logs
WHERE tenant_id = {tenant_id:String}
AND index = {p0:String}
AND level = {p1:String}
GROUP BY host
)
ORDER BY avg_duration DESC
LIMIT 10Notice several things about this output. The tenant_id filter is injected automatically -- the user never specifies it, and they cannot remove it. The literal values for index and level are passed as parameterized query arguments ({p0:String}, {p1:String}), not interpolated into the SQL string. And the subquery nesting follows the pipe order: search produces the innermost query, stats wraps it with GROUP BY, sort adds ORDER BY, and head adds LIMIT.
Security by Default
Security is not a feature we bolted on after building the compiler. It is baked into the architecture at every level.
Every query gets a mandatory tenant_id filter. The compiler injects this predicate into the innermost WHERE clause, and it cannot be overridden or removed by the user. This ensures that ClickHouse partition pruning eliminates all data belonging to other tenants before any computation begins.
Namespace RBAC adds a second layer. Each user has a set of permitted namespaces based on their role. The compiler adds a namespace IN (...) predicate that restricts results to only the namespaces the user is authorized to access.
All literal values are passed as parameterized query arguments. This eliminates SQL injection entirely -- there is no code path where user-provided text is concatenated into the SQL string.
On the resource side, we enforce a 100K result limit on all queries to prevent memory exhaustion on the client. Queries are capped at a 50-condition limit to prevent adversarial query complexity. And ClickHouse enforces a 30-second execution timeout and a 5GB memory cap per query at the server level.
Performance: Why LPQL Queries Are Fast
LPQL query performance comes from the combination of the compiler optimizations and ClickHouse schema features.
Materialized columns are the biggest single optimization. When a query filters on level, status_code, or duration_ms, it reads from pre-computed columns instead of parsing the event body. This turns what would be a full-text scan into a simple column comparison, which ClickHouse executes orders of magnitude faster.
Token bloom filters accelerate full-text search terms. When the compiler encounters a keyword search (a bare string not attached to a specific field), it generates a hasToken() predicate that leverages the bloom filter index. ClickHouse skips entire data granules that the bloom filter identifies as non-matching.
Partition pruning ensures that every query starts by eliminating irrelevant partitions. The mandatory tenant_id filter prunes all other tenants. Time range predicates prune all non-matching days. By the time ClickHouse begins scanning data, it is typically reading less than 1% of the total table.
Stats Functions
The stats, timechart, and chart commands support a comprehensive set of aggregate functions: count for event counts, dc (distinct count) for cardinality, avg, sum, min, max for basic arithmetic aggregates, median and mode for distribution analysis, stdev for standard deviation, perc50 and p95 for percentile calculations, values and list for collecting distinct or all values into arrays, and first and last for positional aggregates.
These functions compile directly to ClickHouse aggregate functions. For example, dc compiles to uniqExact(), perc50 compiles to quantileExact(0.5)(), and values compiles to groupUniqArray(). The mapping is designed to produce the most efficient ClickHouse function for each semantic operation.
search index=api
| stats count, dc(user_id) as unique_users,
avg(duration_ms) as avg_latency,
p95(duration_ms) as p95_latency,
values(host) as hosts
by namespaceWhat Comes Next
LPQL is under active development. We are working on sub-search support for correlated queries, lookup command integration for enrichment at query time, and a query optimizer that can reorder pipe commands for better partition pruning. The goal is a query language that makes ClickHouse feel as natural for log analysis as SPL made Elasticsearch feel a decade ago -- but with better performance, stronger security, and no vendor lock-in.