The better question is, why is DuckDB so popular when one can use Polars which has a sane, lintable, typesafe API compared to the mess that is SQL:
WITH lagged AS (
SELECT
*,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
FROM events
),
sessions AS (
SELECT
*,
SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1))
OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM lagged
)
SELECT
user_id,
session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS event_count
FROM sessions
GROUP BY ALL
ORDER BY user_id, session_start;
vs result = (
df.sort(["user_id", "event_time"])
.with_columns(
session_id=(
pl.when(pl.col("event_time").diff().is_null())
.then(1)
.when(pl.col("event_time").diff().dt.total_minutes() > 30)
.then(1)
.otherwise(0)
.cum_sum()
.over("user_id")
)
)
.group_by(["user_id", "session_id"])
.agg(
session_start=pl.col("event_time").min(),
session_end=pl.col("event_time").max(),
event_count=pl.col("event_time").count(),
)
.sort(["user_id", "session_start"])
)Polars typesafe? It doesn't show you any errors until runtime right? Kusto query language is the best I've seen at type safety and I wish open source DBs would steal some ideas from it.
How can you tell if someone is a polars fan? Don’t worry, they’ll tell you. :)
Polars and duckdb interoperate so it's not really one or the other.
Other dataframe libraries work directly on duckdb also, like Ibis.
FWIW, I think SQLAlchemy does also.
In any case though, I find both the SQL and the code there equally readable.
I really hate SQL as a language for doing data-related tasks, but the core tech for it is often much better in terms of reliability (and is close to having performance parity IME)... specifically for working with larger-than-memory datasets. I think the team at Polars is working on improving larger-than-memory operations though.
I understand the linting aspect but not gonna lie I understood the first one immediately way more than the 2nd one due to knowing SQL well.
"Languages come and go, but SQL is forever"
That does look nicer if you have a Parquet file and want to analyze it. But DuckDB is also a database - if you want a persistent, reliable and mutable data store I don't think Polars would be suitable would it? (Genuine question - you sound like an expert and I'm not.)
Precisely to avoid the custom NIH Polars API, and use SQL which works everywhere (yes, inconsistencies aside).