If you missed it, earlier this week I published a piece on everything we learned about Google’s search algorithms from the DOJ vs Google trial. Boy, there is a lot!
In this article, we’ll focus on AI.
-Google's AI trains on the Google Common Corpus (GCC).
-A Gemini version called MAGIT fine-tunes AI Overviews.
-OpenAI developed its own proprietary search index.
-A system called FastSearch grounds Gemini's AI answers.
-Publishers have little control over AI using their content.
-Google aims to build a "super assistant" for anything.
Here is the court document if you'd like to read it yourself.

1. Google Common Corpus is where most of the training data for Google’s AI comes from
You’ve likely heard of the Common Crawl, a repository of public web crawl data. While some of Google’s AI training data comes from the common crawl, most of it comes from another source called the Google Common Corpus.
The court documents tell us that “Google uses its Google Common Corpus ('GCC') to pre-train its Gemini GenAI models.” GCC is a dataset that involves large amounts of information scraped from the web and stored in a repository Google calls “Docjoins”. This repository does not contain all of the public information on the web, but rather, documents that were “visited at least once by Googlebot in the last few months.”

2. A version of Gemini called MAGIT fine tunes the model to produce responses for AI Overviews
There is a specific version of Gemini called MAGIT. It is fine-tuned specifically for producing the text in AI Overviews.
Gemini is trained on text, mostly from the web, and then it is further fine-tuned on specific collections of data so that it can solve specific tasks like math problems, answering questions or creating code.
Google does not use click and query data from users to train Gemini. It considered doing so, but did not find that the benefits of pre-training on search data were worth the cost.
I found it interesting they said the MAGIT model was fine-tuned to “produce text responses in the desired format for AI Overviews.” There is no mention in here of MAGIT being used to predict which links to put in AI Overviews. We do not know whether the links in AIOs are ranked by FastSearch (see below) or by the regular search ranking algorithms.

3. OpenAI developed their own search index
Did you know that OpenAI has their own search index? I did not!
In the trial documents we learned that OpenAI built their own search index because they had quality issues with third-party search providers.
Historically ChatGPT has pulled from Bing’s index.
They may still do this, although recent tests show that ChatGPT does appear to be pulling information from Google Search. (It was later noted that OpenAI was previously listed as a customer of SERPAPI, a tool that scrapes Google Search.)

I could not find any definitive information on OpenAI’s search index.
4. Gemini is grounded by a proprietary Google technology called FastSearch
FastSearch is based on RankEmbed signals - a set of search ranking signals. It generates an abbreviated list of ranked websites that a language model can use to produce a result that is grounded in search. FastSearch is faster than a full web search, but not as high quality.
For example, let's say you ask a question about a current news event in Gemini. Gemini should recognize that this event is not in its training data and ground (aka verify) its answer in Google Search. FastSearch would be used to generate a short list of websites that can be used to ground Gemini's response.
FastSearch is integrated into Vertex AI Vector search which can be used via API to ground LLM responses on Google Search Results (or even on your own documents.)

5. We have no say in how Google uses our content for AI
The court ruled that Google “will not have to modify its policies to offer website publishers more choice in how Google uses their content.”

As it stands now, site owners can use the Google-Extended directive in robots.txt to prevent Gemini’s AI models from being trained on your content. But this does not stop your site from being shown in AI Overviews or AI Mode. If you want to opt out of those features, you essentially are opting out of Search itself. It looks like that is not about to change.
It was interesting to see this initiative come out this week: The RSL Collective. It aims to create a system whereby AI companies can be directed to pay for using your content. It's a good idea, but I don't see any evidence that shows the AI companies will be bound to these rules, nor any method for paying website owners for their content. With that said, I am paying close attention to Stripe's new Tempo system which is a new blockchain which potentially could be the framework on which payments happen in an agent to agent web.
For more info on Agent2Agent communication read my blog post on how agent to agent communication is likely to radically change the web.
6. We might one day have an AI assistant from Google that can do anything!
Look at this:
“Over the longer term, GenAI companies are striving to transform chatbots into a kind of '[s]uper [a]ssistant.' A super assistant would be able to help perform 'any task' requested by the user.

I know that seems impossible. BUT…recently Google DeepMind CEO Demis Hassabis wrote that Google’s vision was to build a world model that will allow it to become a universal AI assistant. This would not just be for search on the internet but in the real world as well. The article speaks of using agents such as Google’s Genie that can simulate real world environments and train robots for real world tasks.
Perhaps one day we will tell our grandchildren of the world we lived in where Google was a tool you could use to type keywords on and get text based answers on a screen.
Here's something interesting. Try searching for “search engine”. You will not see the Google.com homepage there. I believe being a text based search engine is just a step in the road for Google’s ultimate goal of becoming our every day super helpful assistant for whatever we need in life.
If you liked this, you'll love my newsletter!
Or, Join us in the Search Bar for real time news on SEO and AI.
Marie
Related articles on my blog:
What I learned at Google I/O 2025: A new era of Search
What is the future of Google Search with AI? Will AI Mode replace traditional Search?
From RankBrain to BERT and more: A look at AI's Role in Google's Search Algorithms
What Google’s Trial Docs Reveal About Clicks, Links and Other Ranking Signals
Comments are closed.