OpenAI Launches Program to Build New AI Training Datasets with Organizations

Tamim Rupo
November 10, 2023
12:00 pm

OpenAI is introducing a novel partnership program, OpenAI Data Partnerships, aimed at gathering datasets from external entities to enhance the training of its AI models. This initiative seeks extensive private and public data that is not readily accessible online to the general public. The data sought after by OpenAI is not confined to quantitative or text formats; the program is open to receiving images, audio, or video.

Crucially, OpenAI specifies its interest in data on “any topic” and in “any language” that conveys “human intention,” resembling long-form essays or transcribed conversations. The human-centric data acquired through this program is anticipated to contribute to the refinement of tools such as automatic speech recognition technology, employed for transcribing spoken words. This aligns with OpenAI’s recent expansion of ChatGPT to support voice queries, enabling more interactive and conversational engagements with users. Exposing AI models to a broader range of information to learn how to engage in human-like conversations is poised to enhance not only this feature but also other tools in subsequent functionalities.

Announcing OpenAI Data Partnerships — help steer the future of AI by collaborating on public and private datasets with us. https://t.co/4tbi5SZ6sS

— OpenAI (@OpenAI) November 9, 2023

The model testing conducted in the data partnership program is poised to naturally enhance the capabilities of OpenAI’s consumer-facing GPT-4 Turbo, which has undergone updates to deliver more intricate and meaningful responses to users. OpenAI has initiated collaborations with interested organizations, including authoritative entities like the Icelandic government, aiming to enhance GPT-4’s comprehension of queries in the Icelandic language through curated datasets.

For organizations wishing to participate, a representative can complete a form on OpenAI’s website, specifying details about the type and size of data they intend to share. There are two pathways for datasets: the Open-Source archive, suitable for datasets relevant to training language models, with submissions being public for general use, and the private dataset pathway, allowing companies to contribute information for training proprietary AI models, including foundation models and fine-tuned/custom models. This option is recommended for entities seeking to maintain data confidentiality, although OpenAI explicitly states that it is not seeking datasets containing sensitive or personal information.

ChatGPT has already achieved substantial popularity with approximately 100 million weekly active users globally, emphasizing the ongoing importance of privacy for the tool. Past incidents, such as Samsung employees inadvertently leaking sensitive data to the AI model, highlight the sensitivity of data privacy concerns. While OpenAI asserts that it doesn’t utilize data generated by its API to train models unless explicitly provided by users through an opt-in form, scrutiny remains on how the company manages the data collected through this initiative, particularly the private datasets.