Data compression is as old as electronic communication. In A New Kind of Science, Stephen Wolfram writes that “Morse code, invented in 1838 for use in telegraphy, is an early example of data compression based on using shorter codewords for letters such as ‘e’ and ‘t’ that are more common in English.” At the dawn of the computer era, Claude Shannon, the father of information theory, outlined its limits, demonstrating just how far it could go.
Since then, some of the world’s best minds have explored how much information we don’t need, testing the limits of mathematics, computing power, and perception to reduce communication to its essence (technically speaking, anyway). The need continues to increase as streaming video gets more data-intensive and more popular, taking up a growing share of worldwide internet traffic. It’s the main reason the data going through our pipes is expected to triple from 2016’s figure by 2021, and the world’s biggest tech companies are at work and at war on the next generation of video compression.
Here’s a short-as-possible history.
Brief history
1867: Chicago Tribune publisher Joseph Medill argues for eliminating excess letters from the English language, like dropping the “e” in “favorite.”
1929: RCA’s Ray Kell files the first patent for video compression.
1934: Tribune publisher Robert R. McCormick, Medill’s grandson, institutes compressed spelling rules; some stick (“analog,” “canceled”), some don’t (“hocky,” “doctrin”).
1948: Claude Shannon and Robert Fano independently discover a technique for lossless compression known as Shannon-Fano coding.
1951: Fano’s student David Huffman pioneers a still-used method called Huffman coding.
1974: Nasir Ahmed develops Discrete Cosine Transform, used in JPEG and MPEG.
1976: Jorma Rissanen and Richard Pasco develop arithmetic coding, used in formats like JPEG and H.264.
1977: Abraham Lempel and Jacob Ziv develop dictionary-based coding, which is used in ZIP and GIF formats.
1986: 24-year-old Phil Katz invents the ZIP file.
1987: CompuServe introduces GIFs.
1992: The JPEG format is introduced, a decade after the Joint Photographic Experts Group is convened to develop a standard for transmitting images electronically; Adobe introduces the PDF.
1995: The MP3 standard gets its famous .mp3 file extension, short for MPEG (Moving Picture Experts Group) Audio Layer III.
Net Positive
In the new millennium, the trickling stream of information on the internet grew into a powerful river as companies and communication platforms began to irreversibly change society. Read about the human side of the web on Net Positive.
Explain it like I'm 5!
There are many specific, and often proprietary, types of compression, so let’s focus on two major categories: lossy, and lossless. With lossy compression, you lose some information forever. Take the JPEG image format, which uses both. It begins by converting an RGB (red, green, blue) image to YCbCr. Y is luma, which determines how bright a pixel is; Cb and Cr are the color information. Why? The human eye is more sensitive to brightness than color. You can lose more color information than brightness, so the method separates them and “downsamples” the color, preserving less information about it.
Discrete Cosine Transform is then used to “map an image space into a frequency” (pdf). This is where the math gets complicated, though the idea is simple: identify the least-necessary information in the picture and get rid of it. Low-frequency parts of the image represent gradual color change, like the sky; high-frequency parts represent lots of color changes in a small space, like leaves on a distant tree. Viewers notice if you skimp on the former; less so on the latter. The DCT process identifies what you can safely get rid of. (For visual examples, this is a good video.)
Next comes quantization. To greatly simplify things, that’s doing math to take what the DCT identified as unnecessary and getting rid of it. That’s the “lossy” part of the compression; you can’t get that data back.
Then there’s the lossless compression, based in part on David Huffman’s 1951 breakthrough. Data items that occur often in a file are coded with the fewest bits possible. Take this example (pdf): Under the ASCII coding standard, each letter in the phrase “happy hip hop” is represented by eight bits. With Huffman coding, you can represent “h” and “p” with two bits, “a” and “i” with three bits, and so forth, getting a 104-bit phrase down to 39. (It also produces a header or file that allows the computer to translate the Huffman-encoded information, like a secret decoder ring.)
Pop quiz
Correct. The hardest challenge for mp3 creator Karlheinz Brandenburg was compressing the human voice; Vega’s a cappella hit was the perfect test.
Incorrect.
If your inbox doesn’t support this quiz, find the solution at bottom of email.
Person of interest
In 1972, Lena Sjööblom, a Swedish model living in Chicago, was Playboy’s Miss November. Six or seven months later, the staff at the Signal and Image Processing Institute at the University of Southern California wanted an image that was “glossy to ensure good output dynamic range, and they wanted a human face” (pdf). Someone had the November Playboy on hand, so they scanned her picture, and a legend was born.
The Lena image became a standard to test compression algorithms. It’s more than just a pretty face; it’s “a nice mixture of detail, flat regions, shading, and texture that do a good job of testing various image processing algorithms.” There’s focus blur, dramatic lighting, the tricky feather in her hat, and her reflection in a mirror. Arguably the most important test of an algorithm is to get the human face right.
But the Lena image has fallen out of favor. First, no one cleared the copyright, though Playboy decided to be chill about research and educational use. Second, potentially important technical information about how the photo was made and printed was lacking. Finally, the use of a Playboy centerfold, even in SFW form (the photo is suggestive, but Lena bares only her shoulders), was an unwittingly apt representation of the tech world’s attitude toward women— which is why some started using Fabio images instead. As for Lena? Her career took an appropriate turn: she worked as a model at Kodak to test color film, then taught disabled computer users in her home country.
By the digits
33 zettabytes: Size of the “global datasphere” in 2018
175 zettabytes: Estimated size by 2025
73%: Share of internet traffic devoted to video in 2016
82%: Share projected for 2021
70 exabytes: Internet video traffic in 2016
228 exabytes: Projected internet video traffic in 2021
7 gigabytes: Maximum data per hour required to stream 4k Netflix video
20%: Data savings Netflix realized by customizing compression for each title
10 terabytes–100 terabytes: Estimate for how much data the human brain can store
4: Grams of DNA required to hold the equivalent of all the data in the world as of 2011
Fun fact!
How we 💽 now
Video is by far the biggest strain on bandwidth, and it’s only getting bigger. The dominant video codec (coder-decoder) has been H.264. That was followed by H.265, or HEVC, in 2013. It’s much better at compression (here’s why), but takeup has been slow. First, it requires more powerful hardware; Apple didn’t go all in on H.265 until 2017. Licensing fees are also higher than H.264. Meanwhile, in 2015, tech giants Amazon, Cisco, Google, Intel, Microsoft, Mozilla, and Netflix formed the Alliance for Open Media and released the royalty-free AV1 codec in 2018; in 2019, BBC tests found that AV1 is competitive with H.265.
To make matters more complicated, there’s another next-generation codec on the horizon, VVC, which is expected to be finalized in 2020. The same BBC tests found it to be better at compression than both AV1 and H.265, but it’s not going to be royalty-free. Then again, AV1 might not be either: An IP protection company is claiming royalties on any device that uses AV1—which, for example, would add up to $29 million a year for Apple if it adopted the codec. As Jan Ozer writes at Streaming Media, the next dominant video codec could be decided by patent attorneys.
watch this!
In 2016, Netflix released Meridian, a 12-minute film noir. It got bad reviews, but the target audience engineers who wanted to test codecs and equipment. “It’s a weird story wrapped up in a bunch of engineering requirements,” Netflix’s Chris Fetner said.
take me down this 🐰 hole!
One form of data that’s getting crowded is our most important info: genomic databases. As Dmitri Pavlichin and Tsachy Weissman explain in IEEE Spectrum, they contain a lot of redundant data, which—remember Huffman coding!—makes it suited for genome-specific compression that researchers like themselves are working on.
Poll
💬let's talk!
In yesterday’s poll about black holes, 44% of you think there are wormholes inside, 33% say there’s nothing, and 23% of you said their interiors are empty screening rooms, playing an endless loop of 2001: A Space Odyssey. 💌 Rick wrote in to suggest that perhaps they’re filled with “the center[s] of all of the donuts ever eaten.”
🤔 What did you think of today’s email?
Today’s email was written by Whet Moser, edited by Annaliese Griffin, and produced by Tori Smith.