Abstraction and Reasoning Challenge 2024
“You are teaching the test!”
This was a common “diss” between teachers in America when I was going through school here. In typical American fashion even our education was the subject of competition. The school systems were awarded funding based upon the % of their student bodies who met the Math, Language and Science required levels on standardized testing.
This created a perverse incentive for teachers to focus on teaching their students the specific problems likely to be on the upcoming examinations. There was no reason to develop the minds of these students in any general purpose fashion… instead the ROI was secured by ensuring they were adept at scoring highly on a narrow band of knowledge.
This same thing is happening in Machine Learning.
AI systems are being scored on dozens of benchmarks. Most of these benchmark providers have published their evaluation metrics including sample questions and tasks.
If you are a VC-funded AI company whose survival depends on maintaining press coverage and the attention of peers… the temptation to shove these sample questions into the training data of your models is real. Better yet: building out algorithms in the AI system that are mini-programs that excel at the family of tasks found in the evaluation test set.
This is part of the reasons you see AI scores flying through the roof — incentives.
Let’s talk about the steps the community is taking to counterbalance this force.
ARC Challenge 2024
Welcome to the Abstraction and Reasoning Challenge (ARC), a potential major step towards achieving artificial general intelligence!
In this competition, we are challenged to build an algorithm that can perform reasoning tasks it has never seen before. Classic machine learning problems generally involve one specific task which can be solved by training on millions of data samples. But in this challenge, we need to build an algorithm that can learn patterns from a minimal number of examples.
The objective of this competition is to create an algorithm that is capable of solving abstract reasoning tasks. Critically, these are novel tasks: tasks that the algorithm has never seen before. This is why simply memorizing a set of reasoning templates will not suffice.
No more teaching the test.
The goal is to construct the output grid(s) corresponding to the test input grid(s), using 2 trials for each test input.
Let’s Review Our Data
A "grid" is a rectangular matrix (list of lists) of integers between 0 and 9 (inclusive). The smallest possible grid size is 1x1 and the largest is 30x30.
These grids are represented numerically, here is the first input/output pair from the training data:
We then attribute colors to those numbers, here’s one approach to mapping:
# 0:black, 1:blue, 2:red, 3:greed, 4:yellow, # 5:gray, 6:magenta, 7:orange, 8:sky, 9:brown
_cmap = colors.ListedColormap(
['#000000', '#0074D9','#FF4136','#2ECC40','#FFDC00',
'#AAAAAA', '#F012BE', '#FF851B', '#7FDBFF', '#870C25'])
norm = colors.Normalize(vmin=0, vmax=9)
plt.figure(figsize=(4, 1), dpi=200)
plt.imshow([list(range(10))], cmap=_cmap, norm=norm)
plt.xticks(list(range(10)))
plt.yticks([])
plt.show()
Here are the data associated with the ARC Prize competition:
Public training set
Public evaluation set
Private evaluation set
PUBLIC
The publicly available data is to be used for training and evaluation
The public training set contains 400 task files. The public evaluation set contains 400 task files for to test the performance of your algorithm.
PRIVATE
The ARC-AGI leaderboard is measured using 100 private evaluation tasks which are privately held on Kaggle. These tasks are private to ensure models may not be trained on them.
These tasks are not included in the public tasks, but they do use the same structure and cognitive priors. Public training set consists of simpler tasks whereas the public evaluation set is roughly the same level of difficulty as the private test set
The Game
The competition evaluates submissions on the percentage of correct predictions on the private evaluation set of 100 tasks.
The final score is the sum averaged of the highest score per task output divided by the total number of task test outputs. E.g. if there are two task outputs, and one is 100% correct and the other is 0% correct, your score is 0.5.
My Approach
This is my third data science competition on Kaggle — so there is a great deal I am still learning about this.
I joined my first competition with just 20 days left in the game. The other competitors had been active for 4+ months and so I needed to work quickly to make-up for lost time.
This challenge I have the luxury of time.
I have been working with my AI Council (Gemini, ChatGPT-4 and Claude) to map out the solution opportunity space. Here are a few ideas we’re developing.
Neuro-Symbolic Hybrid System
Program Synthesis with Reinforcement Learning
Evolutionary Algorithms
The first combines pattern recognition (via a neural network) with symbolic logic. Basically a CNN extracts features and feeds those to a reasoning engine that infers relationships.
The second treats each task as a synthesis problem. We use RL to explore the potential program space with our reward signal based on similarity to training example. Over time our agent would learn to generate programs to solve the tasks.
Then there are evolutionary algorithms. We are going to generate a population of potential solutions (e.g., sequences of operations) and evolve them through selection, mutation, and crossover.
Keys to Winning
First I plan on increasing the training data by applying transformations to the existing tasks (e.g., rotations, reflections, color changes). No matter the path I end up traveling to try to win this competition, I will need more data than the thin file we have been provided. Going to start with simple geometric transformations like rotations (90, 180, 270 degrees), reflections (horizontal, vertical), and color permutations (swapping colors).
While data augmentation is crucial, it's not the only step required for generalizability. I am opening these doors as well in search of performance:
Regularization: Prevent overfitting by adding constraints to the model
Domain Randomization: Train the model on a variety of environments or conditions to improve robustness.
Curriculum Learning: Gradually increase the difficulty of tasks during training to encourage generalization.
Building a system that is over-engineered to succeed on the (easier) training will harm performance when the private (more challenging) tasks are evaluated.
Regularization is going to be a key process. I will use dropout, weight decay and other methods to slightly hurt the model’s performance against the training data in exchange for greatly improving the model’s general purpose performance.
This is how I am going to achieve generalizability with my system design.
Stay tuned for Part II.
👋 Thank you for reading Life in the Singularity.
I started this in May 2023 and AI has accelerated faster ever since. Our audience includes Wall St Analysts, VCs, Big Tech Engineers and Fortune 500 Executives.
To help us continue our growth, would you please Like, Comment and Share this?
Thank you again!!!