What I learned from the creators of The Pile, one of the world's largest AI training datasets
Hint: If you think you understand all the nuances and complexities, you don't
I’m not sure why I find the issues and arguments around the data that trains AI models to be so fascinating. After all, some might say it’s a really messy, geeky, rather unsexy topic among all there is to cover about today’s AI boom.
But I think it is that very messiness — the uneven, many-sided, difficult to understand complexities — that is so interesting. Just when I think I understand all the issues and arguments, I realize I don’t. Every time I’m sure I have a strong opinion weighted towards one side or another, I realize I’m not sure. That’s true about so many issues in our modern world, of course, but the fact that data is both hailed as the “backbone” or today’s AI models but disparaged as “garbage” or “scraped” or “vacuumed” is, well, compelling.
That’s why speaking to two people from EleutherAI, the organization behind the Pile, one of the world’s largest AI training datasets — and the first public massive language model training dataset — turned into one of my favorite recent interviews. It lasted over an hour and could have gone on for hours more. The result was an article I published on VentureBeat last week: “One of the world’s largest AI training datasets is about to get bigger and ‘substantially better.”
Here, I thought I’d go deeper into the context of my reporting on that story:
An updated version of the Pile is on the way
There has been a lot of dunking on EleutherAI over the past year. A grassroots nonprofit research group that began as a loose-knit Discord collective in 2020 that sought to understand how OpenAI’s new GPT-3 worked, EleutherAI was named in one of the many generative AI-focused lawsuits last year. Former Arkansas Governor Mike Huckabee and other authors filed a lawsuit in October that alleged their books were taken without consent and included in Books3, a controversial dataset that contains more than 180,000 works and was included as part of the Pile project.
But far from stopping their dataset work, EleutherAI is now building an updated version of the Pile dataset, in collaboration with multiple organizations including the University of Toronto and the Allen Institute for AI, as well as independent researchers. Stella Biderman, executive director at EleutherAI, and Aviya Skowron, EleutherAI’s head of policy and ethics, told me the updated Pile dataset is a few months away from being finalized.
The Pile v2 includes more recent data than the original dataset, which was released in December 2020. The updated dataset will also include better quality and more diverse data. “We’re going to have many more books than the original Pile had, for example, and more diverse representation of non-academic non-fiction domains,” she said.
The ‘yuck’ factor of AI training data
Speaking of books, not surprisingly, the original Pile prompted criticism from creative workers about its use of Books3, but also suspicion about the purpose of a dataset of such massive scale. For example, in an August 2023 article in the Atlantic, author Alex Reisner downloaded the Pile and was able to identify the books contained in the Books3 sub-dataset.
At the end of the Atlantic article, Reisner wrote:
“Control is more essential than ever, now that intellectual property is digital and flows from person to person as bytes through airwaves. A culture of piracy has existed since the early days of the internet, and in a sense, AI developers are doing something that’s come to seem natural. It is uncomfortably apt that today’s flagship technology is powered by mass theft.
Yet the culture of piracy has, until now, facilitated mostly personal use by individual people. The exploitation of pirated books for profit, with the goal of replacing the writers whose work was taken—this is a different and disturbing trend.”
This is certainly part of the ‘yuck’ factor regarding AI training data I wrote about in a VentureBeat column way back in April 2023 (which seems like centuries ago in AI time). Personally, I’ve developed a thick skin when it comes to icky data-digging. I started writing about data analytics over 10 years ago for a magazine covering the direct marketing industry — a business that for decades had relied on mailing list brokers that sold or rented access to valuable datasets.
But the idea of creative output being sucked into the vacuum of AI datasets is definitely a ‘yuck’ — I think it’s hard to wrap our minds around the scale of data that we’re talking about here and the lack of clarity around how, exactly, the data is being used.
The Pile is open source, so anyone can see inside it
But what is interesting is that the Pile — with a name that itself sounds yucky…are we all just made up of big piles of data? — is, perhaps, one of the least “yucky” AI training datasets, at least in one important way: Documentation.
In the Atlantic piece, Reisner said one of the most troubling issues around generative AI is that it is “being made in secret. To produce humanlike answers to questions, systems such as ChatGPT process huge quantities of written material. But few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on.”
That may be true, but the Pile is open source — anyone can download it and use it. It’s not easy to wade through, of course, and researchers continue to work to analyze it, but Biderman told me the Pile remains the LLM training dataset most well-documented by its creator in the world.
“The Pile, which was produced by 12 people working entirely in their free time with no money is far better documented than any LM training dataset ever produced by major tech companies,” she said.
In fact, the whole reason for developing the Pile was that OpenAI’s GPT-3 was a closed model — no one really knew what text was used to train the large language model. So the objective in developing the Pile was to construct an extensive new data set, comprising billions of text passages, aimed at matching the scale of what OpenAI utilized for training GPT-3.
That doesn’t change the fact that there may be problems within the Pile. But one big reason people are criticizing the Pile is because they can — because they can access it and discover what it is in it. That isn’t true about many other closed datasets with limited or zero access.
Dataset visibility helps with research into policy and ethics
Something else to ponder is that the very openness of the Pile is what allows researchers to study it and how that kind of massive dataset impacts the output of large language models — including legal issues related to copyright.
For example, top legal minds are tackling issues related to copyrighted works being used to train large language models like GPT-4. And many disagree with EleutherAI’s general position, which Skowron told me is that using copyrighted data for model training is “fair use.”
But at the same time, they pointed out that “there needs to be much more visibility into that in order to achieve many policy objectives and ethical ideals that people want.” For example, issues around LLM memorization — which Alex Reisner also tackled in a recent piece in the Atlantic — are of great interest to copyright holders who say that it is not just the AI training data at issue but whether the LLM memorizes that copyrighted data for its output.
That, Skowron explains, requires visibility into training data sets and thorough documentation of the training, at the very minimum. “For many, many questions, you need actual access to the data sets to conduct many research questions…including memorization.”
A ‘chilling effect’ on the AI dataset debate?
The bottom line is that there is no bottom line at the moment when it comes to the the complex issues of AI training data.
And according to Skowron, in some ways a robust debate has come to a halt as copyright and IP lawyers rev their engines for the many lawsuits already being argued — including the the New York Times’ new lawsuit against OpenAI and Microsoft for copyright infringement, which many say could end up before the Supreme Court.
“There's definitely been a chilling effect on the entire dataset debate as a result of lawsuits, at least from my perspective,” they said. “The people who could speak on the dataset issues are no longer allowed to speak on them because it just stresses out the lawyers too much.”
Discussing the ins and outs of the Pile and other massive AI training datasets easily turns into another big — you guessed it — PILE, so I’ll pause here for now. But I’d certainly love to continue the debate! Let me know your thoughts… 🤔