logoalt Hacker News

computomaticyesterday at 2:56 PM4 repliesview on HN

I was doing some experiments with removing top 100-1000 most common English words from my prompts. My hypothesis was that common words are effectively noise to agents. Based on the first few trials I attempted, there was no discernible difference in output. Would love to compare results with caveman.

Caveat: I didn’t do enough testing to find the edge cases (eg, negation).


Replies

computerphageyesterday at 3:31 PM

Yeah, when I'm writing code I try to avoid zeros and ones, since those are the most common bits, making them essentially noise

ruairidhwmyesterday at 3:28 PM

I literally just posted a blog on this. Some seemingly insignificant words are actually highly structural to the model. https://www.ruairidh.dev/blog/compressing-prompts-with-an-au...

show 1 reply
AlecSchueleryesterday at 3:35 PM

Doesn't it just use more tokens in reasoning?

slashdavetoday at 12:21 AM

> My hypothesis was that common words are effectively noise to agents

Umm... a few words can be combined in a rather large number of ways.

Punctuation is used a lot. Why not just remove all the periods and commas and see what happens? Probably not pretty