logoalt Hacker News

refactor_mastertoday at 6:21 AM8 repliesview on HN

The better question is, why is DuckDB so popular when one can use Polars which has a sane, lintable, typesafe API compared to the mess that is SQL:

  WITH lagged AS (
      SELECT 
          *, 
          LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
      FROM events
  ),
  sessions AS (
      SELECT 
          *, 
          SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1)) 
              OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
      FROM lagged
  )
  SELECT
      user_id,
      session_id,
      MIN(event_time) AS session_start,
      MAX(event_time) AS session_end,
      COUNT(*) AS event_count
  FROM sessions
  GROUP BY ALL
  ORDER BY user_id, session_start;
vs

  result = (
      df.sort(["user_id", "event_time"])
      .with_columns(
          session_id=(
              pl.when(pl.col("event_time").diff().is_null())
              .then(1)
              .when(pl.col("event_time").diff().dt.total_minutes() > 30)
              .then(1)
              .otherwise(0)
              .cum_sum()
              .over("user_id")
          )
      )
      .group_by(["user_id", "session_id"])
      .agg(
          session_start=pl.col("event_time").min(),
          session_end=pl.col("event_time").max(),
          event_count=pl.col("event_time").count(),
      )
      .sort(["user_id", "session_start"])
  )

Replies

coldteatoday at 10:41 AM

Precisely to avoid the custom NIH Polars API, and use SQL which works everywhere (yes, inconsistencies aside).

brikymtoday at 7:31 AM

Polars typesafe? It doesn't show you any errors until runtime right? Kusto query language is the best I've seen at type safety and I wish open source DBs would steal some ideas from it.

willjtoday at 4:45 PM

How can you tell if someone is a polars fan? Don’t worry, they’ll tell you. :)

SmirkingRevengetoday at 7:04 PM

Polars and duckdb interoperate so it's not really one or the other.

Other dataframe libraries work directly on duckdb also, like Ibis.

FWIW, I think SQLAlchemy does also.

In any case though, I find both the SQL and the code there equally readable.

sceadutoday at 2:57 PM

I really hate SQL as a language for doing data-related tasks, but the core tech for it is often much better in terms of reliability (and is close to having performance parity IME)... specifically for working with larger-than-memory datasets. I think the team at Polars is working on improving larger-than-memory operations though.

porridgeraisintoday at 6:58 AM

I understand the linting aspect but not gonna lie I understood the first one immediately way more than the 2nd one due to knowing SQL well.

homebessguytoday at 7:11 AM

"Languages come and go, but SQL is forever"

IshKebabtoday at 8:42 AM

That does look nicer if you have a Parquet file and want to analyze it. But DuckDB is also a database - if you want a persistent, reliable and mutable data store I don't think Polars would be suitable would it? (Genuine question - you sound like an expert and I'm not.)