Obligatory DuckDB solution:
> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"
Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.
This log in one of the PR:s claims a 5.4s running time on some Mac.
https://github.com/tempestphp/100-million-row-challenge/pull...