Original source: New Zhiyuan
Image source: Generated by Unbounded AI
In order to unpack the “black box” of large models, the Anthropic explainability team published a paper describing how they could train a new model to understand a simple model.
A study published by Anthropic claims to be able to see the soul of artificial intelligence. It looks like this:
Address:
According to the researchers, the new model can accurately predict and understand the working principle and composition mechanism of neurons in the original model.
Anthropic’s explainability team recently announced that they have successfully decomposed an abstract high-dimensional feature space in a simulated AI system.
The researchers first trained a very simple 512-neuron AI to predict text, and then trained another AI called an “autoencoder” to predict the activation pattern of the first AI.
Autoencoders are asked to construct a set of features (corresponding to the number of neurons in higher-dimensional AI) and predict how these features will map to neurons in real AI.
It was found that while the neurons in the original AI were not easy to understand, the simulated neurons in the new AI (i.e., “features”) were monosemy, and each feature represented a specific concept or function.
For example, trait #2663 represents the concept of “God”.
The strongest of the training phrases to activate it comes from Josephus’ record that says “When the Blizzard rains down on God, he goes to Sepphoris”.
You can see that the activations at the top are all about the different uses of “God”.
This simulated neuron appears to be made up of a set of real neurons, including 407, 182, and 259.
These real neurons themselves have little to do with “God”, for example, Neuron 407 responds primarily to non-English (especially stressed Latin letters) and non-standard text (such as HTML tags).
But at the feature level, everything is in order, and when feature 2663 is activated, it increases the probability of “bless”, “forbid”, “damn”, or “-zilla” appearing in the text.
The AI doesn’t distinguish the concept of “God” from the “God” in the monster’s name. This may be because the improvised AI doesn’t have enough neuronal resources to deal with it.
But this will change as the number of features that AI has increases:
At the bottom of this tree, you can see how the AI understands “the” in mathematical terms changes as it has more and more characteristics.
First of all, why is there a specific feature of “the” in a mathematical term? This is probably due to the AI’s need to predict that knowing a particular “the” should be followed by some mathematical vocabulary, such as “numerator” or “cosine”.
Of the smallest AI trained by the researchers with only 512 features, only one feature represented “the”, while the largest AI with 16,384 features had been divided into one feature representing “the” in machine learning, one feature representing “the” in complex analysis, and one feature representing “the” in topology and abstract algebra.
Therefore, if the system could be upgraded to an AI with more simulated neurons, the characteristics that represent “God” would likely split into two - one for the meaning of “God” in religion and the other for “God” in the name of the monster.
Later, there may be God in Christianity, God in Judaism, God in philosophy, and so on.
The research team evaluated the subjective interpretability of 412 groups of real neurons and the corresponding simulated neurons, and found that the interpretability of the simulated neurons was generally good:
Some features, such as the one that means “God”, are used for specific concepts.
Many other highly interpretable features, including some of the most interpretable, are the “formatting” used to represent text, such as uppercase or lowercase letters, English or other alphabets, etc.
How common are these features? That is, if you train two different 4096-feature AI on the same text data, will they have most of the same 4096 features? Will they all have certain characteristics that represent “God”?
Or will the first AI put “God” and “Godzilla” together, and the second AI will separate them? Will the second AI not have the “God” feature at all, but instead use that space to store some other concepts that the first AI would not have been able to understand?
The research team tested it and found that their two AI models were very similar!
On average, if there is one feature in the first model, the most similar feature in the second model will have a median correlation of 0.72.
What’s next?
In May of this year, OpenAI tried to get GPT-4 (very large) to understand GPT-2 (very small). They had GPT-4 examine GPT-2’s 307,200 neurons and report what it found.
GPT-4 found a series of interesting results and a bunch of random nonsense because they had not yet mastered the art of projecting real neurons onto simulated neurons and analyzing simulated neurons.
Although the results were not obvious, it was indeed a very ambitious attempt.
Unlike this AI in Anthropic’s explainability article, GPT-2 is a real (albeit very small) AI that has also impressed the general public.
But the ultimate goal of research is to be able to explain mainstream AI systems.
Anthropic’s explainability team admits that they haven’t done this yet, mainly for several reasons:
First of all, scaling up autoencoders is a difficult thing to do. In order to explain a system like GPT-4 (or Anthropic’s equivalent system Claude), you need an interpreter AI of about the same size.
But training AI on this scale requires huge computing power and financial support.
Secondly, the scalability of the interpretation is also an issue.
Even if we found all the simulated neurons about God, Godzilla, and everything else and drew a huge diagram of how they were connected.
Researchers still need to answer more complex questions, and solving them requires complex interactions involving millions of features and connections.
So there needs to be some automated process, some kind of larger “let GPT-4 tell us what GPT-2 is doing”.
Finally, what does all of this have to say for understanding the human brain?
Humans also use neural networks for reasoning and processing concepts.
There are a lot of neurons in the human brain, and this is the same as GPT-4.
The data available to humans is also very sparse – there are many concepts (such as squid) that rarely appear in everyday life.
Are we also mimicking a bigger brain?
This is still a very new area of research, but there have been some preliminary findings suggesting that neurons in the human visual cortex do encode features in some hyperlocalized way, similar to patterns observed in AI models.
Resources: