Shrinking Intelligence

Back Home

You may have seen Pied Piper's terrible company logo doing the rounds on X, following the recent Google announcement of TurboQuant.

Remember Pied Piper's video stream of a man trying to rescue a bird egg, only to fall and require rescuing himself?

This was all caught on a video live stream that the company's lossless compression algorithm was able to handle and deliver smoothly to a couple hundred thousand viewers, before the company's in home server rack starts to smoke and catch on fire.

TurboQuant is a quantization algorithm. There's a lot more detail in the announcement and research paper but the claim is that it reduces model size with zero accuracy loss.

If you've tried using smaller models or performed vector searches on quantized vectors, you've probably also noticed a degradation in quality.

If you've tried storing embeddings in a database, you've also likely been scared by the number of floats that you need to fit in there to represent a small amount of text, audio or video data.

We've been obsessed with creating intelligence and now we're obsessed with shrinking it.

Whether we've created it remains up for discussion, perspectives on the validity of benchmarks and what intelligence really means remain unsolved mysteries.

However by traditional scoring methods, TurboQuant is proposing that we can make intelligence smaller without making it dumber and that holds promise for true abdundance, available everywhere, on any device, in any environment.

At present billions of parameters and cloud based models require insane numbers of GPU compute to train and serve to users.

That does not scale to a societal intelligence layer.

I was late to the party watching Silicon Valley but the idea of using compression to improve speed, quality and adoption is as relevant as ever.

At Couchbase, we have our own Quantization options that can be selected when building indexes for your vectors. Product Quantization and Scalar Quantization.

PQ breaks dimensions into subspaces. It replaces each set of dimensions, these can be large and commonly for images are 4096 dimensions, with a single value that represents the nearest centroid in the subspace.
SQ quantizes each dimension of the the vector independently. The vectors are larger but a more precise representation than those using PQ.

With PQ there is a significant data size reduction because after training, the single value (integer ops) are not as computationally expensive as all of those embeddings (floating point ops) and require less memory to store (each vector down from 64KB to 128 bytes). The trade off is accuracy, as quantized vectors are compared versus their original raw values.

With SQ, each dimension of the vector is quantized independently. The quantized vectors are larger but a more precise representation.

What the TurboQuant algorithm introduces is a step towards a future where making trade offs is not as severe. That could mean not compromising on compute, storage, accuracy or intelligence.

Reading Chip War will give you a good insight into humanity's obsession with shrinking processing power. This is now being applied to models themselves.

It makes for an interesting thought experiment imagining a future where intelligence as we know it today is always available from any device.

How does the world change if that were true?

The scale of improvement and breakthroughs are happening faster than the innovation we experienced with Moore's Law, transistors on a microchip doubling every two years.

This is such a fundamental change in our quest for developing intelligence and making it available that it also poses questions over our role in society.

What is the value of our skills and unique additions to the world where we aren't ranked or scored by IQ alone?

When what we define intelligence as today is woven deeply into the fabric of society, how will our priorities change?