“Five words, Dan, it’s FIVE words long!”
“- are you sure?”
“I ran it through three independent validators, including that Chinese one we just got access to last week”
“If you ran it through DeepWave, I hope to God you encrypted it”
“Dan, I’ve been quantum encrypting everything I’ve been finding for the last 7 years, do you really think I would have taken any risk on a fiver?”
“Who else knows?”
“No one yet. I was planning to discuss it with the committee on Friday…”
“Wait. Let’s not do it yet. We need some time before anyone else sees it”
“But Dan, the timing is perfect for the P3 grant submission. I’ve been burning through funds like a wildfire with all these quantum encryption cycles”
“Nam, committee members are humans too - there’s too high of a chance someone will leak it and try to validate without encryption. This is too vulnerable. It will put all our work at risk. I’ll pitch in. We can move money from my P1 grant.”
“When do we start the dig?”
“Tomorrow”
None of the three validators could detect a hash match of the five words Nam had found against their repositories. When DeepWave did not return a match either, Nam could hear his pulse beating against his eardrums. DeepWave is said to have the most complete repository, and with the additional exotic languages it covers, Nam got the confidence he needed to treat this find as the real deal. After all, Nam spent the last 13 years of his academic life with no avail. Dan was credited for finding a total of four three-word original sentences in his 40-year career, and Nam had been apprenticing under him for the last 2. He moved halfway across the country for the opportunity. Five words was unimaginable. People pinned the theoretic limit to four, but no four-word sentence had been found either. When Nam lay in bed later that night, he couldn’t sleep. At 2 am it struck him what five words could mean. What you could do with five words could be orders of magnitude more than what the papers predicted could be done with four. Dan was right. They couldn’t afford to risk it.
When they met in Dan’s office the next day, it was clear both men hadn’t slept. If they had known it to be the case, they would have met and started digging in the middle of the night, but out of courtesy, they both arrived at 7 am sharp, assuming the other would be well rested. Dan dropped his bag down on his table with a thud. Out of it he retrieved a smooth rock the size of a pineapple, and a rotary tool. “Nam, now I know you’re going to think I’m crazy, but I’m an old man and you’ll humor me. If anything happens… I want these words right here.” Nam would have agreed with the old man’s self-assessment had his mind not been racing through all the possibilities the night before. Now he was on the same wavelength. “That’s going to be one f-ing Rosetta stone” he said, grabbing the rotary.
They started with the source document in which Nam had found the five words: an unassuming piece about spearfishing regulations around the different Polynesian islands. But it gave them a lead, and they began to pull up documents about Fiji, Tahiti, Aotearoa, … each time running a combinatorial hash match against the five words, looking for subset matches. Whenever two or three words matched, they logged the document location and local context, and used those contexts as the launching points for new searches. For weeks this digging through documents continued. A few hours of sleep, and back at it again. Their start time began to drift. They began showing up earlier and earlier each day to Dan’s office. They were logging and organizing thousands of document fragments. For each organized cluster of documents, they reached out through anonymized channels to domain experts on different topics - for instance, to Ava, who was a linguist specializing in Arabic, and Jalla who had written a paper about the meditative practices of Subsaharan tribes. With each new “blind” collaborator they brought on, they shared only a grouping of documents and a linguistic key - the sets of words based on which those documents were grouped. This shared information was always met with the same kind of exhilarated high that Nam and Dan felt when they discovered the original five-word key. The document groupings provided previously unseen connections, uncovered unexpected conceptual relationships, strung together chains of reasoning never before reported.
There was a kind of quiet fervor, one could almost hear the shuffling of thousands of academics all over the world. All these exchanges took place on the sub-webs, in encrypted chats. Nam and Dan took care to anonymize all their communications, to pass the linguistic keys through multiple levels of quantum encryption, and to ensure that while they set off the initial spark trails, the resulting smolder paths could not be traced back to them. But because all the resulting findings were open sourced, every morning Dan and Nam opened the subarxiv pages, they were overwhelmed by the size of the fire that was burning, the amount of knowledge that was being published and shared daily. The academic community had started to piece together possible prior beliefs from different cultures and times, pieces of discourse that likely took place between individuals in un-augmented sources of communication, un-edited facts shared in textbooks to schoolchildren, personal chats between friends and loved ones in the un-assisted days. Every few days a new misspelling was shared as evidence. It was like a new fragment of the old world. Small pieces of the mosaic, all coming together bit by bit to provide small glimpses into how people talked, wrote, thought, believed, argued, loved, and felt before the surge of the LLMs, before the governments digitized all the information, before the many re-writings, re-trainings, re-generations, re-trainings … the endless cycle where machine only learned from machine, until it became less and less discoverable what, if any, content was originally authored by humans. Chewed and re-chewed into a homogenous paste. Government after government re-writing, re-chewing, re-spewing content in a new form, then more re-training with the generated content. Reward functions were encrypted and then deleted, so the next set of keyholders could not reconstruct the original repositories of knowledge. Each repository was now a re-processed version of the previous, untraceable to the original sources of information that were once used. The masses adapted quickly, naturally. For any thought, there was always a way to express it to others, you just had to click approve - first on the auto-complete, then on the auto-thought - and your words and thoughts formulated themselves, based on the pre-approved linguistic patterns, facts, and stories of the time.
But human curiosity runs deep - a legacy of our animal origins. Many people continued to hunt for “the originals”, any excerpts of text that could be traced back to some kind of authentic, un-edited piece of human communication, hiding in the haystack. A goldilocks challenge - the shorter the fragment, the lower the possibility of proving it was original; the longer the fragment, the greater the possibility it would have been broken up and re-written many times over. Dan was a Linguistic Archeologist and devoted his whole life’s energy to the hunt for these original linguistic artifacts. Any discovered “original” fragment of text could be used as a key to unlock how people previously thought about some topic, through the combination of words they selected in describing the topic. This combination of words could in turn be used as reinforcement for LLMs, to bootstrap the generation of more proxy original excerpts of text, and so on. Unfortunately, the shorter the text, the weaker the generalizations, limiting how much could be gleaned about past human communication. But Linguistic Algorithmists estimated that the longer text fragments, if found, could be used for bootstrapping the discovery of significant portions of past human linguistic exchanges, and with them, past belief systems, philosophies, religions, rhetorics of the time, etc. In fact, the amount that could be learned and extracted from a text fragment was exponentially more with every extra word that text fragment contained. When this research came out, the hunt for four-word “originals” exploded, but none had been found to date. And none except Nam, Dan, and a few earthworms living below the rock underneath Dan’s shed, would know about the existence of Nam’s “fiver”. In fact, Dan felt very strongly that he and Nam needed to forget the five words, for the safety of the whole community whose work had been unlocked, and for the sake of finding the truth about humanity’s past communications. If the five word fragment ended up in the hands of authorities and used as input to an LLM alongside the knowledge that it was an “original”, the fragment, and all the other knowledge that was derived from it would be quickly re-written and lost, and the probability that another original would be discovered at any point in the future would dwindle quickly. A haystack grows quickly when hay can be generated on command. This deep truth is what cut away at Nam and Dan’s sleep every night, and hastened their digging for the knowledge unlocked by their key. They were racing against time: time before their funding ran out for the compute they needed to keep digging, and time before authorities would catch onto this new knowledge surfacing.
Then they stumbled on Aki. Aki was part of a small, underground collective that occasionally met in person to exchange physical items, specialty items assembled into collections, passed along between members and across generations. Given the risk inherent in their activities, membership was limited to those with kinship connections. Dan’s great grandfather was the “original” kind of Archeologist. A portion of his avian bone collection was in safe keeping in Dan’s shed. Nam had an idea. They met Aki one morning in the University cafeteria. Aki was visiting from another town. The exchange was brief. Aki passed a take-out container across the table, Dan handed him a tin cookie box in return. A tin cookie box packed to the brim with bones. The men nodded to each other politely and went their separate ways. In Dan’s office later that morning, Nam and Dan paused over the take-out box, eyes darting, fingers trembling. Inside, 3 USB sticks: 2029, 2037, 2042 inscribed in their sides. Compressed copies of the LLMs trained at the company Aki’s father had worked at many years prior. Nam spent the next days wiring something together so that they could access the information on the stick labeled 2029. The rawest repository they could get their hands on.
“Dan. I ha- I hav-. Uh. Dan. 15 words”
“Impossible”
Silence.
“Read them to me” “it’s quite conceivable that humanity is just a passing phase in the evolution of intelligence”
“Mōshiwake arimasen”. Aki was standing at Dan’s door. “Mōshiwake arimasen” he repeated again as his eyes sank to the floor, behind him 10 gray uniforms.
Five words was the length of the most life-changing sentence hiding in a digital haystack and buried below an old garden shed, but no one ever found it again.