There are many systems that take a native data structure in your favorite language and, using some s...

amluto • yesterday at 4:56 AM • 10 replies • view on HN

There are many systems that take a native data structure in your favorite language and, using some sort of reflection, makes an on-disk structure that resembles it. Python pickles and Java’s serialization system are infamous examples, and rkyv is a less alarming one.

I am quite strongly of the opinion that one should essentially never use these for anything that needs to work well at any scale. If you need an industrial strength on-disk format, start with a tool for defining on-disk formats, and map back to your language. This gives you far better safety, portability across languages, and often performance as well.

Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto or even JSON or XML or ASN.1. Note that there are zero programming languages in that list. The right choice is probably not C structs or pickles or some other language’s idea of pickles or even a really cool library that makes Rust do this.

(OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust? Hint: Java is also memory-safe, and that has not saved users of Java deserialization from all the extremely high severity security holes that have shown up over the years. You can shoot yourself in the foot just fine when you point a cannon at your foot, even if the cannon has no undefined behavior.)

Replies

pjc50 • yesterday at 9:24 AM

Dotnet used to have BinaryFormatter, which they had to kill for this reason, and they basically recommend that exact set of serializers as a replacement: https://learn.microsoft.com/en-us/dotnet/standard/serializat...

bombela • today at 12:28 AM

> (OMG I just discovered rkyv_dyn. boggle. Did someone really attempt to reproduce the security catastrophe that is Java deserialization in Rust?

Trusting possibly malicious inputs is an universal problem.

Here is a simple example:

    echo "rm -rf /" > cmd
    sh cmd

And this problem is no different in rkyv than rkvy_dyn or any other serialization format on the planet. The issue is trusting inputs. This is also called a man in the middle attack.

The solution is to add a cryptographic signature to detect tempering.

➕ show 1 reply

hdjrudni • yesterday at 8:15 AM

Don't forget PHP's serialize/unserialize, it's also sketchy. Looks like they at least put up a big warning in their docs: https://www.php.net/manual/en/function.unserialize.php

Not hating on PHP, to be clear. It has its warts, but it has served me well.

➕ show 1 reply

gz09 • yesterday at 5:11 AM

> Depending on your needs, the right tool might be Parquet or Arrow or protobuf or Cap’n Proto

I think parquet and arrow are great formats, but ultimately they have to solve a similar problem that rkyv solves: for any given type that they support, what does the bit pattern look like in serialized form and in deserialized form (and how do I convert between the two).

However, it is useful to point out that parquet/arrow on top of that solve many more problems needed to store data 'at scale' than rkyv (which is just a serialization framework after all): well defined data and file format, backward compatibility, bloom filters, run length encoding, compression, indexes, interoperability between languages, etc. etc.

john01dav • yesterday at 7:42 PM

> even a really cool library that makes Rust do this.

The first library that comes to mind when I think of this is `serde` with `#[derive(Serialize, Deserialize)]`, but that gives persistence-format output as you describe is preferable to the former case. I usually use it with JSON.

So, this seems like it may be a false dichotomy.

➕ show 2 replies

vlovich123 • yesterday at 6:37 AM

Protobufs definitely doesn’t solve the problems described. Capnproto may solve it but I’m not 100% sure. JSON/XML/ASN.1 definitely don’t.

It’s like you listed a bunch of serialization technologies without grokking the problem outlined in the post doesn’t have much to do with rkyv itself.

➕ show 5 replies

neilyio • yesterday at 6:20 AM

Delightful metaphor, I'll be looking everywhere for a chance to use that now!

imtringued • yesterday at 10:31 AM

Fully agreed. rkyv looks like something that is hyper optimizing for a very niche case, but doesn't actually admit that it is doing so. The use case here is transient data akin to swapping in-memory data to disk.

"However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot."

At a first glance, it might sound like rkyv is better, after all, it has less restrictions and external schemas are annoying, but it doesn't actually solve the schema issue by having a self describing format like JSON or CBOR. You won't be able to use the data outside of Rust and you're probably tied to a specific Rust version.

➕ show 1 reply

LtWorf • yesterday at 7:21 AM

But if you use complicated serialisation formats you can't mmap a file into memory and use it directly. Which is quite convenient if you don't want to parse the whole file and allocate it to memory because it's too large compared to the amount of memory or time you have.

userbinator • yesterday at 7:35 AM

and often performance as well

BS. Nothing can be faster than a read()/write() (or even mmap()) into a struct, because everything else would need to do more work.

➕ show 1 reply

alt Hacker News

Replies