What's a decent time for this?
I was curious what it would take if I approached it the way I do with most CSV transformation tasks that I'm only intending to do once: use Unix command line tools such as cut, sed, sort, and uniq to do the bulk of the work, and then do something in whatever scripting language seems appropriate to put the final output in whatever format is needed.
The first part, using this command [1], produces output lines that look like this:
219,/blog/php-81-before-and-after,2021-06-21
and is sorted by URL path and then date.With 1 million lines that took 9 or 10 seconds (M2 Max Mac Studio). But with 100 million it took 1220 seconds, virtually all of which was sorting.
Turning that into JSON via a shell script [2] was about 15 seconds. (That script is 44% longer than it would have been had JSON allowed a comma after the last element of an array).
So basically 22 minutes. The sorting is the killer with this type of approach, because the input is 7 GB. The output is only 13 MB and the are under 300 pages and the largest page count is under 1000 so building the output up in memory as the unsorted input is scanned and then sorting it would clearly by way way faster.
[1] cut -d / -f 4- | sed -e 's/T..............$//' | sort | uniq -c | sed -e 's/^ *//' -e 's/ /,\//'
[2]
#!/bin/zsh
echo "{"
PAGE=none
while read LINE; do
COLS=("${(@s/,/)LINE}")
COUNT=${COLS[1]}
URL=${COLS[2]}
DATE=${COLS[3]}
if [ $URL != $PAGE ]; then
if [ $PAGE != "none" ]; then
echo
echo " },"
fi
PAGE=$URL
echo " \"\\$URL\": {"
FINISHDATE=no
else
if [ $FINISHDATE = "yes" ]; then
echo ","
fi
fi
echo -n " \"$DATE\": $COUNT"
FINISHDATE=yes
done
echo
echo "}"