Grok 4 is impressive. Elon says that Grok 4 is performing at or beyond PhD level in nearly all academic fields. Last night’s demo showcased some incredible scores in AI benchmarks. The most impressive of these was Grok 4’s score on “Humanity’s Last Exam”.
Humanity’s Last Exam is a challenging AI benchmark created by the Center for AI Safety and Scale AI, which features 2,500 multi-modal questions across diverse subjects that are at the frontier of human knowledge. These questions are insanely hard. Grok outperformed all of the other language models including OpenAI’s o3 and Google’s Gemini 2.5 by a significant margin. Grok 4 Heavy, a model that uses multiple agents to brainstorm together on problems and also can use tools like code executions or search, performed exceptionally well.
Testing Grok. Is it really PhdD level?
This is hard to test unless you have really strong knowledge of a topic. As I’ve been studying Google Search for a long time now, I thought I’d test Grok 4 on some questions about how Google’s systems work.
Prompt: When was Google’s last core update? What did it target? (be brief)
Grok 4 thought for 38 seconds. It browsed authoritative sources on the web and correctly told me that the update started rolling out on June 30 and may take up to three weeks to finish.
I thought this was a good answer.
Prompt: How can a site recover if impacted by this update?
Traditionally, LLMs frustrate me with their answer to this question. They draw from long standing advice on SEO that was good many years ago, but not accurate now. Most LLMs will tell me to focus on the technical aspects of a website and to work on getting more backlinks. However, for the last few years, Google’s core updates have introduced new techniques for using machine learning systems to predict which pages are likely to be the satisfying result for a searcher. What really matters when you've been impacted by a core update is having helpful, reliable and satisfying content.
Grok nailed this question.
These aren’t PhD level questions though. Let’s see if Grok can explain how Google search works.
Prompt: Explain the machine learning systems Google uses in search.
I was thrilled to see that Grok called on my site amongst others to answer this question.
I thought this was interesting because Grok’s privacy policy says that they use Brave’s search results.
Yet, a search on Brave for this same question does not surface my site.
Grok’s answer on how Google uses machine learning for search started off discussing features like AI Overviews and AI Mode. Then, it gave me the answer I was looking for. Here are some parts of it. This is quite an impressive answer.
There's much more that could be added to this answer, but I thought it was quite good.
Next I asked a question about Google’s recent breakthrough called MUVERA. Grok initially thought it was a typo. It did some searches and once again, nailed this answer as well.
Other interesting new features
I’m quite interested in Grok 4 Heavy which uses multiple agents to reason across problems for you. The cost for this is $300/m or $3000/year which is on par with OpenAI and Google’s premium plans that offer multi-agentic browsing in tools like Operator and Project Mariner.
The demo for Grok voice was impressive. They compared it to ChatGPT’s voice and it was quite obvious that there was better latency and a more real sounding voice.
The team also said they are working on a coding model, multi-modal agents and a really good video generation model which should all be released in the next few months.
What struck me the most about this demo was Elon’s tone when discussing the future. Perhaps he is being hyperbolic here? He talks about whether AI is good for humanity. And ends on a terrifying note.
“The actual notion of a human economy, assuming civilization continues to progress, will seem very quaint in retrospect. It will seem like, sort of, “cave men throwing sticks into a fire” sort of level of economy compared to what the future will hold. It’s very exciting. I’ve been at times kind of worried about like well…it’s somewhat unnerving to have intelligence created that is far greater than our own. And will this be bad or good for humanity? I think it will be good, most likely it will be good…yeah…but I’ve somewhat reconciled myself to the fact that even if it wasn’t going to be good, I’d at least like to be alive to see it happen.”
Is Grok 4 AGI? Most likely no, but it’s damn impressive. Is it better than o3 or Gemini 2.5? Perhaps, although I likely will stick with using the models from OpenAI and Google for now. However, given that XAI has a tremendous amount of computing power, I think Grok could very quickly emerge as a dramatically more helpful model than any other.
I’ll keep testing Grok 4. I’ll report more in my community and newsletter.
Comments are closed.