What do you use it for? I’m perpetually interested in using DuckDB, but it doesn’t seem to do anythi...

jkubicek • today at 4:44 AM • 8 replies • view on HN

What do you use it for? I’m perpetually interested in using DuckDB, but it doesn’t seem to do anything I need.

Replies

Basically like a locally hosted Snowflake - it only shines if you have enough data to analyze (100 MB - 100 GB is probably the sweet-spot range - less than that and the benefits are small, more than that and you risk flying too close to the sun with memory usage).

It has connectors for Postgres & other stores, so I find it faster to connect to a Postgres instance, pull all of the data from a table (even if the table is like 50GB - if you have 30 cores on the machine it will pull from Postgres using 30 cores in parallel, so it will only take a minute or two) - and then any analytical queries on the data are 10+ times faster in DuckDB over native Postgres (GROUP BY, regexp_replace, count(distinct...) etc).

orthoxerox • today at 5:46 AM

All kinds of data processing. For example, you download a million rows of metrics and load them in Excel to build pivot tables. It works, but now it's a billion rows. If you know SQL, it's a snap to point DuckDB at the source CSV or JSON and get the results in a second.

steve_adams_86 • today at 5:50 AM

The most interesting use case lately has been using it as the transformation and validation engine for a CLI that handles scientific data. Some datasets are small and could have been handled at the application layer, but some are quite massive (especially genomic data). DuckDB bundles with the CLI and travels around any platform, is super lightweight, allows for easily running in CI, on a user’s machine, against datasets of all sizes, and so on.

There are other embeddable options out there but I found DuckDb fit better for the potentially massive datasets, and also because of how naturally it ingests the types of data we work with, some of its unique features, and how trivial it was to learn and integrate with the project.

Otherwise I use it almost daily for doing guardrailed data exploration with LLMs. I prefer SQL over random DSLs in AWS or Sentry or what have you. I’ll ingest the data I need and just run SQL against it. I mentioned in another comment that I’ll tend to store more useful data (especially data I export routinely, like infra cost reports) on S3 and use a Rill instance to do basic exploration in a GUI (it will query remote parquet files).

skeeter2020 • today at 4:49 PM

the taste that hooked me: the next time you have a bunch of json data, csvs or other data - local or remote - and someone wants some charts (for me it was "productivity" metrics from Jira combined with a bunch of other stuff). First it is very easy/fast to load this data; DuckDB has a very liberal parsing engine and good connectors. Second, I used to worry a lot about my table definitions and cleaning data before structuring it. Not anymore! With DuckDB I find myself iteratively transforming data and creating new tables, combining sources, converting columns, slicing/dicing/rotating. It's very easy to "remix" data and there are functions or extensions for everything you might want to do. There's so little friction to get started that I've found it just naturally becomes the multitool in my toolbox.

THis will give you some experience and you'll start to see applicable problem spaces for DuckDB in product areas, especially anything with BI or DW.

raihansaputra • today at 10:39 AM

throwing in my 2 cents: It just replaced pandas for me. It's just so much easier to write sql against csv/json/whatever format data in jupyter/marimo notebooks through duckdb rather than reasoning through pandas. SQL is far more natural for me, and agents also work through it easily.

➕ show 1 reply

wiredfool • today at 11:42 AM

Few different use cases, other than just a general swiss army knife for vaguely tabular data.

* fastapi + duckdb + parquet for the backend for a relatively high profile website

* wasm duckdb + react for a few visualization websites

* yaml driven ETL from lots of sources, principally ugly spreadsheets, into usable data. More T than E or L really

edweis • today at 5:19 AM

I personally find it useful to search logs with AI

➕ show 1 reply

hilariously • today at 11:32 AM

Honestly as someone whose super SQL focused and spends less time focusing on python I just can write generic SQL to transform things in memory to do whatever I want, its very helpful for that.

alt Hacker News

Replies