A recent study by the AI Disclosures Project has cast doubt on the data used by OpenAI in training its large language models (LLMs). Researchers have found that the GPT-4o model shows considerable “recognition” of copyrighted and paywalled information from O’Reilly Media books.
This project, led by technologist Tim O’Reilly and economist Ilan Strauss, aims to highlight the societal risks of AI commercialization, emphasizing the need for better transparency in corporate data practices. The study involved a dataset of 34 copyrighted O’Reilly Media books, legally obtained, to assess if OpenAI’s models were trained on this copyrighted content without authorization.
Researchers utilized the DE-COP membership inference attack method, which helped determine whether the models could accurately distinguish between human-authored works and their LLM-generated counterparts. Key findings from this study revealed that GPT-4o has an AUROC score of 82% in recognizing non-public O’Reilly book content, while earlier versions, like GPT-3.5 Turbo, showed significantly lower recognition capability.
Notably, GPT-4o differentiated between paywalled and publicly accessible content better than its predecessors. In contrast, GPT-4o Mini demonstrated no recognition at all for O’Reilly material.
The implications of these findings suggest potential access violations related to the LibGen database, where the tested O’Reilly books were found. The researchers acknowledged that advancements in LLMs could improve their ability to discern between human and machine-generated text, which may not mitigate the effectiveness of classification methods.
The study raises concerns about the systemic use of copyrighted materials without compensation, which could ultimately harm content quality and diversity on the internet. The AI Disclosures Project advocates for increased accountability in how AI companies manage their model training processes.
They propose that liability provisions could encourage transparency, helping to establish markets for training data licensing and subsequent remuneration for content creators. Ultimately, the report concludes that OpenAI’s training of GPT-4o likely involved the unauthorized use of copyrighted content, stressing the urgency for reforms in data usage regulations in AI development.