OpenAI o1. A new series of AI models designed to spend more time thinking before they respond.

After mixed hints from Sam Altman, OpenAI has officially confirmed the start of work on a successor to the GPT-4 language model. Against this background, a special security commission was created, which will control new developments.

OpenAI

“OpenAI has recently started training its next edge model, and we expect the systems created to take us to the next level of capability on the way to AGI [general artificial intelligence]. We pride ourselves on creating and launching models that are industry-leading in both capability and safety, but we welcome active debate at this important moment,” the company said in a blog post.

We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.

ChatGPT is a sibling model to InstructGPT⁠, which is trained to follow an instruction in a prompt and provide a detailed response.

Methods

We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT⁠, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.

To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization⁠. We performed several iterations of this process.

We are excited to introduce ChatGPT to get users’ feedback and learn about its strengths and weaknesses. During the research preview, usage of ChatGPT is free. Try it now at chatgpt.com⁠(opens in a new window).

Along with the announcement, it was announced that the OpenAI board had formed a Security Committee led by directors Bret Taylor, Adam D’Angelo, Nicole Seligman, and Sam Altman (head of OpenAI). The company’s website says that this committee will be responsible for making recommendations to the entire Board of Directors on critical security and protection decisions for OpenAI projects and operations.

On Thursday, December 19, Google unveiled its new AI reasoning model called Gemini 2.0 Flash Thinking that could rival OpenAI’s much talked about o1 model. While still in the experimental phase, Gemini 2.0 Flash Thinking is currently accessible to users through the search giant’s AI studio.

Similar to o1, Google’s latest AI model uses runtime reasoning techniques to achieve deeper thinking when users input complex problems to solve. This means that the model pauses to consider other related user prompts before providing an answer that it determined as being the most accurate.

“Built on 2.0 Flash’s speed and performance, this model is trained to use thoughts to strengthen its reasoning. And we see promising results when we increase inference time computation!” Jeff Dean, a chief scientist at Google DeepMind, wrote in a post on X.

What are AI reasoning models?

Earlier this year, OpenAI released a new AI model called o1 that uses techniques such as reinforcement learning and chain-of-thought reasoning to carry out step-by-step analysis of a problem before solving it. The launch of o1 was preceded by several months of hype around a secret project that the Sam Altman-led startup was working on code-named Project Strawberry.

It kicked off a race among other tech companies, each of whom scrambled to roll out their own reasoning models that took extra seconds or minutes after a user entered their prompt before providing a response.

DeepSeek, a China-based AI research company, launched R1 that reasoned through various tasks before arriving at an answer. Meanwhile, Alibaba’s Qwen team released its own “reasoning” model called QwQ earlier this month.