[ad_1]
The Web site Stack Overflow was created in 2008 as a place for programmers to answer one another’s questions. At the time, the Web was thin on high-quality technical information; if you got stuck while coding and needed a hand, your best bet was old, scattered forum threads that often led nowhere. Jeff Atwood and Joel Spolsky, a pair of prominent software developers, sought to solve this problem by turning programming Q. & A. into a kind of multiplayer game. On Stack Overflow—the name refers to a common way that programs crash—people could earn points for posting popular questions and leaving helpful answers. Points earned badges and special privileges; users would be motivated by a mix of altruism and glory.
Within three years of its founding, Stack Overflow had become indispensable to working programmers, who consulted it daily. Pages from Stack Overflow dominated programming search results; the site had more than sixteen million unique visitors a month out of an estimated nine million programmers worldwide. Almost ninety per cent of them arrived through Google. The same story was playing out across the Web: this was the era of “Web 2.0,” and sites that could extract knowledge from people’s heads and organize it for others were thriving. Yelp, Reddit, Flickr, Goodreads, Tumblr, and Stack Overflow all launched within a few years of one another, during a period when Google was experiencing its own extraordinary growth. Web 2.0 and Google fuelled each other: by indexing these crowdsourced knowledge projects, Google could get its arms around vast, dense repositories of high-quality information for free, and those same sites could acquire users and contributors through Google. The search company’s rapacious pursuit of other people’s data was excused by the fact that it drove users toward the content it harvested. In those days, Google even measured its success partly by how quickly users left its search pages: a short stay meant that a user had found what they were looking for.
All this started to change almost as soon as it had begun. Around that time, Google launched the OneBox, a feature that provided searchers with instant answers above search results. (Search for movie times, and you’d get them in the OneBox, above a list of links to movie theatres.) The feature siphoned traffic from the very sites that made it possible. Yelp was an instructive case: Google wanted to compete in the “local” market but didn’t have its own repository of restaurant and small-business reviews. Luther Lowe, Yelp’s former head of public policy, told me recently that Google tried everything it could to claw its way in, from licensing Yelp’s data (Yelp declined) to encouraging its own users to write reviews (no one wanted to contribute at the time) or even buying Yelp outright (it declined again). “Once those strategies failed—license, compete on the merits, purchase the content—what did they have left?” Lowe said. “They had to steal it.” In 2010 and 2011, Lowe says, Yelp caught Google scraping their content with no attribution. The data gave Google just enough momentum to bootstrap its own reviews product. When Yelp publicly accused Google of stealing its data, the company stopped, but the damage had already been done. (A similar thing happened at a company I once worked for, called Genius. We sued Google for copying lyrics from our database into the OneBox; I helped prove that it was happening by embedding a hidden message into the lyrics, using a pattern of apostrophes that, in Morse code, spelled “RED HANDED.” Google won in appellate court, in the Second Circuit. Genius petitioned the Supreme Court to hear the case, but the court declined.)
In 2012, Google doubled down on the OneBox with a redesign that deëmphasized the classic blue links to external Web sites in favor of Google’s own properties, like Shopping and Maps, and immediate answers culled from sites like Wikipedia. This made Google even more convenient and powerful, but also had the effect of starving the Web of users: instead of a search leading you to a Wikipedia page, say, where you might join the small percentage of visitors who end up contributing, you’d get your answer straight from Google. According to Lowe, on pages of search results featuring the new design, as many as eighty per cent of searchers would leave without ever clicking on a link. Many Web 2.0 darlings, dense with user-generated content, saw visitor numbers decline. It was around this time that, in some sense, the quality of the Web as a whole began to decline, with the notable exception of the few crowdsourced knowledge projects that managed to survive. There’s a reason that appending “reddit” or “wiki” to search terms has become an indispensable productivity hack: in a hollowed-out Web overrun with spammers and content farms, these have become some of the last places where real, knowledgeable humans hang out.
Today, large language models, like OpenAI’s ChatGPT and Google’s Bard, are completing a process begun by the OneBox: their goal is to ingest the Web so comprehensively that it might as well not exist. The question is whether this approach is sustainable. L.L.M.s depend for their intelligence on vast repositories of human writing—the artifacts of our intelligence. They especially depend on information-dense sources. In creating ChatGPT, Wikipedia was OpenAI’s most important data set, followed by Reddit; about twenty-two per cent of GPT-3’s training data comprised Web pages linked to and upvoted by Reddit users. ChatGPT is such a good programmer that the savvy developers I know aren’t using Stack Overflow anymore—and yet it’s partly by studying Stack Overflow that ChatGPT became such a good programmer. Recently, a group of researchers estimated that the number of new posts on Stack Overflow has decreased by sixteen per cent since the launch of ChatGPT.
I’m not a Stack Overflow power user, but I am a coder, and I’ve relied on the site for more than a decade. I’ve submitted projects to GitHub (a site for open-source code), posted on Reddit, and edited Wikipedia pages. Meanwhile, I’ve published blog posts and code to my Web site for years. Like everyone else, I didn’t suspect that I was producing GPT fodder; if I’d known, I might have asked for something in return, or even withheld my contributions. In April, the C.E.O. of Reddit announced that, from then on, any company that required large-scale data from its site would have to pay for the privilege. (Because the move threatened other, non-A.I.-related apps, Reddit users responded by “blacking out” huge swaths of the site, emphasizing that the company’s fortunes depended on uncompensated community contributions.) Stack Overflow has made a similar announcement.
Maybe the crowdsourcing sites will manage to wall off their content. But it may not matter. High-quality data is not necessarily a renewable resource, especially if you treat it like a vast virgin oil field, yours for the taking. The sites that have fuelled chatbots function like knowledge economies, using various kinds of currency—points, bounties, badges, bonuses—to broker information to where it is most needed, and chatbots are already thinning out the demand side of these marketplaces, starving the human engines that created the knowledge in the first place. This is a problem for us, of course: we all benefit from a human-powered Web. But it’s also a problem for A.I. It’s possible that A.I.s can only hoover up the whole Web once. If they are to continue getting smarter, they will need new reservoirs of knowledge. Where will it come from?
A.I. companies have already turned their attention to one possible source: chat. Anyone who uses a chatbot like Bard or ChatGPT is participating in a massive training exercise. In fact, one reason that these bots are provided for free may be that a user’s data is more valuable than her money: everything you type into a chatbot’s text box is grist for its model. Moreover, we aren’t just typing but pasting—e-mails, documents, code, manuals, contracts, and so on. We’re often asking the bots to summarize this material and then asking pointed questions about it, conducting a kind of close-reading seminar. Currently, there’s a limit to how much you can paste into a bot’s input box, but the amount of new data we can feed them at a gulp will only grow.
It won’t be long before many of us also start bulk-importing our most private documents into these models. A chatbot hasn’t yet asked me to grant it access to my e-mail archives—or to my texts, calendar, notes, and files. But, in exchange for a capable A.I. personal assistant, I could be tempted to compromise my privacy. A personal-assistant bot might nudge me to install a browser extension that tracks where I go on the Web so that it can learn from my detailed searching and browsing patterns. And ChatGPT and its ilk will soon become “multimodal,” able to fluidly blend and produce text, images, videos, and sound. Most language is actually spoken rather than written, and so bots will offer to help us by transcribing our meetings and phone calls, or even our everyday interactions.
Before models like GPT-3.5 and GPT-4 made their way into the user-facing ChatGPT product, they were tuned with what OpenAI calls “reinforcement learning from human feedback,” or R.L.H.F. Essentially, OpenAI paid human testers to have conversations with the raw model and rate the quality of its replies; the model learned from these ratings, aligning its responses ever more finely with our intentions. It’s because of R.L.H.F. that ChatGPT is so eerily good at understanding exactly what you’re asking and what a good answer should look like. This process was likely expensive. But now R.L.H.F. can be had for free, and at a much bigger scale, through conversations with real-world users. This is true even if you don’t click one of the thumbs-up, thumbs-down, or “This was helpful”-style buttons at the bottom of a chat transcript. GPT-4 is so good at interpreting writing that it can examine a chat transcript and decide for itself whether it did a good job serving you. One model’s conversations can even bootstrap another’s: it’s been claimed that rivals to ChatGPT, such as Google Bard, finished their training by consuming ChatGPT transcripts that had been posted online. (Google has denied this.)
The use of chatbots to evaluate and train other chatbots points the way toward the eventual goal of removing humans from the loop entirely. Perhaps the most fundamental limitation of today’s large language models is that they depend on knowledge that’s been generated by people. A sea change will come when the bots can generate knowledge for themselves. One possible path involves what’s known as synthetic data. For a long time now, A.I. researchers have padded their data sets as a matter of course: a neural network trained on images, for instance, might undergo a preprocessing step in which each image is rotated ninety degrees, or shrunk, or mirrored, creating for each picture eight or sixteen variants. But the doctoring can be much more involved than that. In autonomous-vehicle research, capturing real-world driving data is incredibly expensive, because you have to outfit an actual car with sensors and drive it around; it’s much cheaper to build a simulated car and run it through a virtual environment with simulated roads and weather conditions. It’s now typical to train state-of-the-art self-driving A.I.s by driving them for millions of miles on the road and billions in simulation.
[ad_2]
Source link







