# I used Claude, a DGX Spark, and three AI peer reviewers to reproduce a scientific paper. Here's how it actually went.

---

I recently published a paper and an article showing the viral satellite radar pyramid claims, according to which there are massive structures below ground in Giza, don't hold up. 

This article is about how I did the research and how the work actually got done in a different way from what I planned.

As a non-technical user, I wanted to be able to tell an agent the goal I wanted to reach about a topic I'm interested in but in which I don't have domain expertise nor the technical capabilities to accomplish it. 

The original idea was to use OpenClaw to carry out an autonomous research loop. A system that could take a scientific paper and any supporting reference material as input, design experiments to test it, write the processing code, run it against real satellite data, evaluate the results, and iterate until it either confirmed or disproved the claims. Loosely inspired by the Karpathy autoresearch concept, where the agent converges on a result through self-directed iteration.

I did the initial planning on Claude Desktop using Opus 4.6, handed the resulting md file to OpenClaw, and had my DGX Spark ready for the actual radar data processing. OpenClaw was running the agent loop, with Opus as orchestrator handling strategic decisions and Sonnet doing the coding work.

That setup worked for about the first third of the project. Then the context problem showed up.

---

## The Context Problem

OpenClaw could build the processing pipeline. Reading satellite radar files, splitting signals into Doppler sub-apertures, computing coherence between sub-apertures, running tomographic focusing. Each stage, taken individually, was fine.

The issue was that research is not a sequence of independent stages. Each decision depends on why the previous one was made. When you choose a coregistration method, that choice constrains what the tomographic inversion can do downstream. When you pick a sub-aperture count, it changes the baseline geometry, which changes the resolution, which changes what depths you can claim to see. The whole chain is coupled.

When the agent's context got compacted — and it happened multiple times — those couplings broke. The agent would rewrite a stage using a convention that contradicted an earlier stage, resulting in a quietly wrong output that looked numerically valid but was scientifically meaningless. I tried different approaches to preserve memory across sessions — mem0, LCM DAG, memory flushes, qmd — but none of them held up once the accumulated context got large enough.

Code correctness can be checked one function at a time. Research correctness depends on whether the full chain of decisions still holds together.

---

## What Replaced It

The actual research got done in a single long Claude Opus conversation on Claude Desktop. 

I described what I wanted to test, Opus wrote the Python script, I copied it to the DGX Spark, ran it, and pasted the terminal output back into the conversation. Opus interpreted the results, identified problems, proposed the next experiment, and wrote the next script. I basically added a human in the loop. That loop ran for the entire project: geometry derivations, empirical null tests, coherence analysis, paper drafting, and four rounds of AI peer review responses.

The important difference from the agent approach is that the context was never compacted. Every failed attempt, every corrected formula, every design decision and the reasoning behind it stayed in the conversation. When ChatGPT later found a formula error, Opus could trace it back through the full chain: here is where the error entered, here are the three downstream calculations it affected, here is the corrected version of each, and here are the seven places in the paper where the old number appears. That kind of traceability requires remembering the whole project, not just the current task.

---

## Where Autonomy Actually Worked

One piece of the project did run as an autonomous loop, and it worked well.

Once the core processing pipeline was validated and I trusted the code, we needed to sweep across every parameter combination to check whether any configuration could distinguish the pyramid from empty desert. Sub-aperture count, bandwidth ratio, filter type, depth range. 240 experiments, each comparing two sites with identical processing.

That is the shape of problem where autonomous iteration belongs: a precise question with an unambiguous evaluation metric. Run experiment, compute score, log result, move to the next combination. This was a brute-force grid sweep rather than a Karpathy-style ratchet — the parameter space was small enough that exhaustive search was faster than intelligent exploration. The loop ran, produced clear results, and the answer was zero of 240 experiments could tell the pyramid apart from sand.

The distinction matters for anyone building agent systems. Autonomy works when the problem is bounded and the evaluation is mechanical. It breaks when the problem itself needs to evolve based on what you are finding, which was most of this research project.

---

## The DGX Spark

I didn't really use the DGX Spark for its capabilities. The 128GB unified memory was useful, as numpy could hold large complex arrays without swapping. 

But the GPU was never used and there was no cuFFT, no cupy, and no CUDA, as every script ran on numpy and scipy. The single-pass analysis processes one radar image at a time and the parameter sweep operates on 800x800 pixel patches. No need for a dedicated GPU for that.

The research can be done on a MacBook with enough storage space to hold the almost 100GB of satellite images. 

Where the Spark would actually matter is a proper multi-pass InSAR pipeline: full coregistration across 15+ satellite acquisitions, phase unwrapping, DEM-assisted geocoding. We did not get there in this project because C-band resolution was too coarse to justify the engineering, but if X-band data arrives, the Spark will earn its keep.

---

## The Peer Review Layer

The part that surprised me most was using multiple AI models as independent peer reviewers.

The paper and the codebase went through four rounds with three different models, ChatGPT, Gemini, and Grok. Each round caught something real, technical errors that changed numbers.

Gemini found a unit conversion error in the main table. The look angle was listed as 0.001° when the correct value was 0.056°. The downstream math used the radian value and was correct, but the table was wrong.

ChatGPT found a formula error that changed the headline resolution number from 11.3 km to 285 m. Same conclusion — 285 m still cannot resolve a 43 m chamber — but the paper's strongest numerical claim was off by a factor of 40. ChatGPT also found a real bug: a variable in the parameter sweep code was being swept and logged but never actually entered the computation.

Grok validated the corrected version and suggested two framing improvements that made the argument harder to attack.

Each model was good at different things. Gemini caught notation and unit details. ChatGPT was relentless on methodology and code. Grok was best at spotting weaknesses in how the overall argument was framed. Running them independently, each unaware of the others, produced better coverage than any single model doing multiple passes.

That is probably the most reusable takeaway from this project. Multi-model peer review on technical work is cheap, fast, and catches things the author cannot see. I would do it on everything now.

---

## Takeaways

After three months of intense use, I retired OpenClaw in favor of Hermes. If I have to spend time engineering workarounds for memory limitations, the tool is wrong for the job.

I'm currently testing Hermes on a similar type of research.

The process I trust now is: plan first, front-load as much context as possible, use the agent for bounded well-scoped tasks, and run multi-model peer review before anything goes public.

The autonomous loop is not the wrong idea, but it is the wrong default, at least today. The right default for open-ended research is a long conversation with accumulated context, with autonomous loops delegated to the parts where the question is already settled and you just need the answer computed.

That being said, my goal is to get to a point where an agent can do the full research on its own. That would be a game changer — not just for me, but for anyone. It would mean people can unleash their curiosity and creativity, the things that make humans special, and let agents handle the experiments, which they can run faster and more systematically than any human.

---

Research: https://zenodo.org/records/19574701
Code: https://github.com/simossss/sar-tomography-reproduction
