logoalt Hacker News

100M-Row Challenge with PHP

161 pointsby brentroosetoday at 10:24 AM86 commentsview on HN

Comments

brentroosetoday at 10:24 AM

A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.

That's why I built a performance challenge for the PHP community

The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).

I hope people will have fun with it :)

show 7 replies
tzstoday at 10:05 PM

What's a decent time for this?

I was curious what it would take if I approached it the way I do with most CSV transformation tasks that I'm only intending to do once: use Unix command line tools such as cut, sed, sort, and uniq to do the bulk of the work, and then do something in whatever scripting language seems appropriate to put the final output in whatever format is needed.

The first part, using this command [1], produces output lines that look like this:

  219,/blog/php-81-before-and-after,2021-06-21
and is sorted by URL path and then date.

With 1 million lines that took 9 or 10 seconds (M2 Max Mac Studio). But with 100 million it took 1220 seconds, virtually all of which was sorting.

Turning that into JSON via a shell script [2] was about 15 seconds. (That script is 44% longer than it would have been had JSON allowed a comma after the last element of an array).

So basically 22 minutes. The sorting is the killer with this type of approach, because the input is 7 GB. The output is only 13 MB and the are under 300 pages and the largest page count is under 1000 so building the output up in memory as the unsorted input is scanned and then sorting it would clearly by way way faster.

[1] cut -d / -f 4- | sed -e 's/T..............$//' | sort | uniq -c | sed -e 's/^ *//' -e 's/ /,\//'

[2]

  #!/bin/zsh
  echo "{"
  PAGE=none
  while read LINE; do
      COLS=("${(@s/,/)LINE}")
      COUNT=${COLS[1]}
      URL=${COLS[2]}
      DATE=${COLS[3]}
      if [ $URL != $PAGE ]; then
          if [ $PAGE != "none" ]; then
              echo
              echo "    },"
          fi
          PAGE=$URL
          echo "    \"\\$URL\": {"
          FINISHDATE=no
      else
          if [ $FINISHDATE = "yes" ]; then
              echo ","
          fi
      fi
      echo -n "        \"$DATE\": $COUNT"
      FINISHDATE=yes
  done
  echo
  echo "}"
Twirrimtoday at 10:00 PM

I took a quick look, the dependency on php 8.5 is mildly irritating, even Ubuntu 26.04 isn't lined up to ship with that version, it's on 8.4.11.

You mention in the README that the goal is to run things in a standard environment, but then you're using a near bleeding edge PHP version that people are unlikely to be using?

I thought I'd just quickly spin up a container and take a look out of interest, but now it looks like I'll have to go dig into building my own PHP packages, or compiling my own version from scratch to even begin to look at things?

Xeoncrosstoday at 3:00 PM

This is why I jumped from PHP to Go, then why I jumped from Go to Rust.

Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.

Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.

If Rust scares you use Go. If Go scares you use Rust.

show 3 replies
pxtailtoday at 12:47 PM

Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!

https://elephpant.me/

show 2 replies
semiquavertoday at 3:10 PM

Are they just confused about what characters require escaping in JSON strings or is PHP weirder than I remember?

    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }
show 4 replies
chrismarlow9today at 7:45 PM

I don't have time to put together a submission but I'm willing to bet you can use this:

https://github.com/kjdev/php-ext-jq

And replicate this command:

jq -R ' [inputs | split(",") | {url: .[0], date: .[1] | split("T")[0]}] | group_by(.url) | map({ (.[0].url): ( map(.date) | group_by(.) | map({(.[0]): length}) | add ) }) | add ' < test-data.csv

And it will be faster than anything you can do in native php

Edit: I'm assuming none of the urls have a comma with this but it's more about offloading it through an extension, even if you custom built it

show 1 reply
tveitatoday at 1:36 PM

> Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others

Except that the generator script generates dates relative to time() ?

show 1 reply
matei88today at 7:41 PM

It reminds me of a good read about optimizing PHP for 1 billion rows challenge. TLDR; at some point you hit a limit in PHP’s stream layer

https://dev.to/realflowcontrol/processing-one-billion-rows-i...

Retr0idtoday at 1:19 PM

How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)

show 1 reply
csjhtoday at 4:42 PM

Obligatory DuckDB solution:

> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"

Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.

show 1 reply
poizan42today at 3:04 PM

> The output should be encoded as a pretty JSON string.

...

> Your parser should store the following output in $outputPath as a JSON file:

    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }
They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.
show 1 reply
spiderfarmertoday at 12:36 PM

Awesome. I’ll be following this. I’ll probably learn a ton.

wangzhongwangtoday at 1:33 PM

[dead]

tomaytotomatotoday at 5:15 PM

Tempted to submit a Java app wrapped in PHP exec() :D

show 1 reply