Naren Manikandan

neural network weight prediction

2026-05-11T00:00:00+00:00

TL;DR: I tried two related tasks: predicting an MLP’s outputs from its weights (easy — works with concat or residual nets), and the inverse problem of inferring weights from input-output pairs (harder — required moving to a Deep Sets architecture, where the bottleneck shifts from sample count to model capacity).

I wanted to build a simple neural network to do a slightly tricky task: can we predict the weights of the model from just the inputs and outputs? Part I walks through a simpler task of confirming that we can simulate a forward pass from a certain architecture, and Part II will delve into the question above.

Part I

We have a well-defined task: given inputs and weights of another, smaller neural net, can we predict its outputs with this network? Since the function defined by the generator net (smaller one) is well-defined and continuous, by Universal approximation theorem (UAT), a sufficiently large ReLU net (nonlinearities are key) can approximate it. The theorem guarantees that this is possible, but provides no details for how to do it. So let’s do it.

Architecture

To keep things concise, we’ll define two layers with a ReLU activation for our data generator. We can easily pass in weight tensors and evaluate any input.

def get_mlp_out(x, w1, w2):
  y1 = F.relu(x @ w1)
  return y1 @ w2

The predictor net has two network variations we’ll compare.

class XLinear(nn.Module):
  def forward(self, x):
    h = x

    for layer in layers:
      h = activation(layer(h))
      h = concat(h, x)  # reattach original input each time

    return output_layer(h)

XLinear has strength over a plain Linear network because the generator input and weights that we concatenate at each layer can skip information bottlenecks from previous hidden layers.

To provide even more scaffolding, we’ll also look at ResLinear (it doesn’t actually have any conv layers like a typical ResNet, but maybe the skip connections would still help).

class ResBlock(nn.Module):
  def forward(self, x):
    return activation(main_path(x) + skip_path(x))


class ResLinear(nn.Module):
  def forward(self, x):
    x = input_layer(x)
    x = res_block(x)
    x = res_block(x)
    return output_layer(x)

The x samples are the concatenation of the generator net’s input and weight tensors and the y samples are the evaluated output (using get_mlp_out).

Results

After training with a batch size of 1024 and 50000 training steps, we get the following (note: all the loss curves below use exponential moving average to smooth out the noise).

The concatenation technique did work! But ResLinear did even better.

By the end of the training run, we reach a loss of 0.017! This loss can’t be memorization since we sample fresh generator inputs and weights at each batch.

Part II

To reiterate: can we build a weight predictor by taking in the inputs and outputs of a generator network?

My first instinct is to try doing the same thing as above and see what results we get. Before that, a few comments – the possible number of weight matrix combinations is massive. how do we incentivize the bigger network to learn the structure of the smaller one. With sufficient I/O samples, would the smaller model’s architecture be the least lossy path?

To begin, let’s first try a much simpler generator net with only one linear layer.

def get_simple_data(num_unknowns, num_pairs):
  # weights are the unknowns
  w = torch.randn(batch_size, num_unknowns, 1)
  pairs = []
  # we take multiple pairs for each input 
  # to give more info about the weight layer
  for i in range(num_pairs):
    x = torch.randn(batch_size, num_unknowns)
    y = torch.bmm(x.unsqueeze(1), w).squeeze(1)
    pairs.append(x)
    pairs.append(y)

  # ...

Using this data generator, we should expect that our model (we used ResLinear) should perform better when given more equations (num_pairs) than unknowns (num_unknowns) since the system is overdetermined. So we try this out:

And it worked! The model with five I/O pairs for each weight tensor performed much better. As the network is purely linear, the unknowns can be solved purely by least-squares, so this test is more of a sanity check.

Now, we move onto a more complex generation network that isn’t purely linear.

def get_data_v2(num_pairs):
  """
  30 pairs for 15 unknowns (2*5+5*1).
  """
  w1 = torch.randn(batch_size, 2, 5)
  w2 = torch.randn(batch_size, 5, 1)
  pairs = []
  for i in range(num_pairs):
    x = torch.randn(batch_size, 2)
    y = batch_mlp(x, w1, w2)  # MLP generator defined above
    pairs.append(x)
    pairs.append(y)

  inp = torch.cat(tuple(pairs), 1)
  # ...

Training loop:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss = nn.MSELoss()

for i in t:
  x, weights = get_data_v2(num_pairs=30)
  optimizer.zero_grad()

  x_eval = torch.randn(batch_size, 2)

  # eval true weights on x_eval
  w1, w2 = weights
  y = batch_mlp(x_eval, w1, w2)

  # get weight predictions from test x
  weight_pred = model(x)

  # eval predicted weights on x_eval
  y_pred = compute_weights(x_eval, weight_pred)

  output = loss(y_pred, y)
  output.backward()
  optimizer.step()

We naively apply the same model to this task. However, there are key modifications to the training procedure:

The loss isn’t computed between the predicted and actual weights. Doing such would make the training incredibly inefficient since there are many valid weight tensors that can implement the same function and we don’t need the exact sequence of them. Essentially, we’re optimizing for functional equivalence.
The loss is computed between the true y – which is outputted from our generator net – and predicted y – which is the result of running a newly-sampled eval x sample through the predicted weights. This is a Monte Carlo estimate of the expected error over the entire input distribution.

Turns out, despite having 2x the equations as unknowns, loss doesn’t decrease at all. A few reasons why:

Since we blindly smush the xy pairs into a 1D list in get_data_v2, we don’t give the model any inductive biases about the groupings of values. This makes learning the I/O mappings an uphill battle.
The input tensors aren’t permutation invariant, meaning the same layout of pairs in a different arrangement is a new unseen example for the model (which it shouldn’t be).
With the nonlinearity in the generator net now (ReLU!), it’s a lot harder for another model to learn (especially with issues 1 and 2 included).

Let’s try something else: what if we directly embed x and y into some unified vector and have the model predict weights from here?

Architecture

class PairEncoder(nn.Module):
  def __init__(self, x_dim, y_dim, embed_dim):
    super().__init__()
    self.net = nn.Sequential(
        nn.Linear(x_dim + y_dim, 128),
        nn.ReLU(),
        nn.Linear(128, 256),
        nn.ReLU(),
        nn.Linear(256, embed_dim)
    )

  def forward(self, x, y):
    pair = torch.cat([x, y], dim=-1)
    return self.net(pair)

Now, we have a pair encoder that transforms a concatenated x and y into an embedding vector.

class WeightPredictor(nn.Module):
  def __init__(self, x_dim, y_dim, out_dim, embed_dim):
    super().__init__()
    self.encoder = PairEncoder(x_dim, y_dim, embed_dim)
    self.decoder = nn.Sequential(
        nn.Linear(embed_dim, 256),
        nn.ReLU(),
        nn.Linear(256, 128),
        nn.ReLU(),
        nn.Linear(128, out_dim)
    )

  def forward(self, x, y):
    embeddings = self.encoder(x, y)
    # avgs across input pairs in each batch. 
    # key for permutation invariance.
    embed_pooled = embeddings.mean(dim=1)
    return self.decoder(embed_pooled)

Finally, we decode these embeddings into the weights.

For all the training runs below, we use a learning rate of 1e-3 and Lecun weight initialization for the data generator (I didn’t do either in the first iteration and that was a huge mistake). Our embedding dimension has size 128.

Results

Following the same paradigm of predicting weights from the train input and using a new eval x to make a y prediction, we get this graph where we try predicting 15 unknowns from the weight layer using 20, 30, and 45 equations.

Cool results – our new method works! The loss goes down to 0.005 for the 45 equation curve and only a bit more for the other two plots. Let’s scale up the weights of the generator model and see if we get the same trend.

I followed the same scaling factors as before: x1.33, x2, and x3, but with 50 unknowns instead. To compare, let’s use $R^2$, the fraction of variance explained by the model.

$$ \begin{array}{c|cc} \text{Equations / Unknowns} & \text{15 unknowns} & \text{50 unknowns} \\ \hline 1.33\times & 98.3\% & 91.7\% \\ 2\times & 98.6\% & 93.6\% \\ 3\times & 99.0\% & 93.8\% \end{array} $$

We see that with more unknowns, the problem is genuinely harder per-unit-of-variance, with 5-6% less explained variance at each ratio. Also, there’s saturation from 2x to 3x the equations for the 50 unknowns problem but we still see improvements with the 15 unknowns counterpart. One possible hypothesis for this behavior is the embedding layer bottleneck: the PairEncoder tries collapsing all pairs into a 128-dim vector. With more unknowns, the same fixed vector needs to hold more information. Furthermore, mean pooling throws away the spread and correlations between pairs which do actually matter – switching to attention pooling could make a huge difference.

Discussion

Previously, we raised a question on whether the generator net’s weight layout provides the least-cost option for predicting the right outputs. It’s not clear if the training process results in a weight configuration that falls in the same equivalence class as the true set of weights or the predicted weights happen approximate the function only for this training distribution and not elsewhere. Further tests might include adding more nonlinearities to the generator network or sampling the evaluation x tensors from a different distribution and see if accuracy still holds up.

Turns out, the architecture we used is essentially a Deep Sets model where we encode the inputs independently, pool across the set, and decode. It’s advantages permutation invariance and the ability to handle input sets of varying sizes. It’s cool that our pretty simple weight prediction challenge independently arrived at the same methods as the paper.

There are many avenues to continue this experiment, from increasing the embedding size to see if it’s the bottleneck to scaling the generator and predictor networks. Checkout the code and email me what you think. Thanks!

random fun stuff

2026-05-09T00:00:00+00:00

chillosophy was a club at my high school where we discussed perplexing questions about the world around us. it taught me a lot about the value of conversation and perspective. here you’ll find all the crazy stuff we talked about.

i wrote my personal statement for college about indian astrology. it was quite risky but i tied it into my fascination with the unfathomable and the limits of science.

i love to read. biographies, sci fi, history, philosophy, business: anything i can get my hands on. i have a notion database tracking the stuff i’ve read and my takeaways from each piece.

things to (maybe) do

understand basic cryptography
try hacks on raspberry pi
invest in a crypto currency
buy an nft (and see how it works)
try out muzero on some atari game and see how it does
run alphafold and see how it works
designing a vision system that can recognize the activity of my parakeets and notify unusual behavior
try out a ctf
get a gpu and train a model
website for dad’s photos
attend treehacks (and a bunch of hackathons all over the world)
get into yc (edit: not sure about this)
go on a sabbatical somewhere remote
internship at a big tech company (just to see how it looks like to work there)

models of intelligence

2026-03-15T00:00:00+00:00

i created this article primarily to consolidate my thoughts on how different people have thought of language modeling as a sufficient paradigm for general intelligence.

useful intuitions of language models from this lecture to preface this piece:

next-token prediction is massively multi-task learning
scaling compute reliably improves loss

in-context learning: ability for a model to adapt to the user prompt without changing any weights. this leads to zero-shot generalization where the model can answer a novel question in one go based on the patterns and knowledge absorbed during training.

the residual stream is an evolving embedding vector that serves as the memory system for the entire model with a deep linear structure (which has many implications in mech interp). attention and feedforward layers can read the embeddings and write to them on every layer, depending on the needs of the model. residual connections preserve past information, serving as a skip connection. this framework shows us that transformers operate on a shared “scratchpad” of embeddings. each layer doesn’t overwrite its predecessor but instead increments the residual stream with whatever new information or feature transformations are needed.

proponents of the “scaling transformers to AGI” paradigm argue that autoregressive predictions of the next token can lead to emergent capabilities when given longer context windows that can represent more complex concepts. models must learn to compress vast amounts of information about the world which correlates with increased generalization.

the pretraining component dismisses most AI scientists from taking this potential path to general intelligence seriously, as the model is ingesting enormous amounts of data and effectively memorizing patterns across the samples. François Chollet states that LLMs store vector programs that map some embedding space to another embedding space, and their reasoning capabilities are only interpolation—bounded by the input data distribution.

Chain of Thought (CoT) reasoning

according to this article, compared to direct answer generation where we only rely on a few tokens for processing, we split up the computation between various tokens with CoT prompting. each node gets added to the context window, creating state-space recurrence rather than parameter-space recurrence (the latter exists in a Recurrent Neural Network where this property is built directly into its architecture). recurrence allows the model to adapt to the needs of the prompt and hold a latent representation that it can reuse. likely more links between recurrence and reasoning that i’m missing.

I still feel like the conclusion the author states is lacking explanation (assumption below):

chain of thought is a natural fit for language models to “reason” and therefore one should be optimistic about training methods that are designed to enhance it generalizing to many domains.

reinforcement learning

it’s pretty amazing what reinforcement learning has allowed us to accomplish. deepseek went from a base V3 model to R1-Zero purely through RL. GRPO was the custom system that rewarded coherence, completeness, and fluency in the model, leading the model to develop the following properties for performance naturally:

reflective behaviors without explicit prompting
allocate “thinking time” to harder problems and create more CoT traces
interesting “wait” and “aha” moments that show an understanding of discovery

there were still problems with readability and language mixing as Supervised Fine-Tuning was completely excluded. it would be interesting to predict exactly what improvements in the R-zero models would result in domination over the regular R-series. still, this is a massive new frontier as portrayed below.

It’s the solving strategies you see this model use in its chain of thought. It’s how it goes back and forth thinking to itself. These thoughts are emergent (!!!) and this is actually seriously incredible, impressive and new (as in publicly available and documented etc.). The model could never learn this with 1 (by imitation), because the cognition of the model and the cognition of the human labeler is different. The human would never know to correctly annotate these kinds of solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards a final outcome.

Andrej Karpathy, X

Arc Prize is an amazing effort to continue challenging frontier AI models with a benchmark (Arc-AGI) that’s easy for humans but hard for language models (even reasoning models with the second iteration).

crypto – notes to self

2026-03-11T00:00:00+00:00

bitcoin paper

it’s really interesting how bitcoin rethinks our understanding of trust. instead of holding third parties accountable for verifying transactions, bitcoin leverages decentralized consensus to establish a collective truth about the transaction history.

The Byzantine generals problem arose from the scenario where a set of generals had to coordinate an attack with the possibilities of message interference and evil generals sabotaging plans. the unsolved issue was that these generals had to be semi-trusted and not completely anonymous for the case of a truly decentralized currency. If there are no “trusted higher-authorities” or notions of identity, how do we prevent single users from using multiple identities and committing harmful actions (aka a Sybil attack)?

proof of work is the mechanism proposed by Nakamoto by which we can limit identity based on the ability to compute a hard math puzzle, relying on economic and computational costs to make it prohibitively expensive for malicious actors to launch these attacks. this allows bitcoin to fulfill the Byzantine Fault Tolerance, a property in distributed computing of whether independent computers can achieve consensus despite the possibility of malicious nodes that can introduce false information (Nakamoto Consensus).

without a central figure overseeing where blocks are being added, there’s the possibility of double-spending, where users can spend the same tokens more than once. to prevent this, the timestamp server was created where hashes of previous nodes are contained in the current node (hence the name blockchain). modifying any previous block would mean redoing proof-of-work for all the blocks after it, which is very hard to maintain.

two rules to ensure this is feasible:

miners are incentivized to only build blocks on the longest chain (very likely the valid dataset since the computational power is massive).
transactions are never final on the blockchain as there could be multiple tree branches in which one could be the honest consensus while the others are attackers trying to outcompete the former. yet, it becomes very costly for a malicious chain to outcompete the valid one. it’s shown below how the probability that an attacker catches up drops exponentially, especially when assuming p > q.

note: if bad actors control more than 50% of the total computational power, they can dominate consensus, allowing them to double-spend, block transactions, and reorganize chains for their advantage (referred to as a “51% attack”).

incentives for miners are the following:

new coins are added into circulation as miners expend their CPU/GPU time and electricity for mining.
transaction fees are awarded to miners and paid by users for being able to participate in the network. the exact value operates as a market mechanism where users bid for more prioritization by the miners. most currencies will shift to this form of miner incentives as the number of coins in the blockchain is limited.

i wanted to point out this beautiful way that this system disincentivizes bad actors.

The incentive may help encourage nodes to stay honest. If a greedy attacker is able to assemble more CPU power than all the honest nodes, he would have to choose between using it to defraud people by stealing back his payments, or using it to generate new coins. He ought to find it more profitable to play by the rules, such rules that favour him with more new coins than everyone else combined, than to undermine the system and the validity of his own wealth.