4 Comments
User's avatar
Neural Foundry's avatar

Solid breakdown of the multimodal shift. The OCR-to-text pipeline always felt fragile, and treating PDFs as images is one of those obvious-in-hindsight moves that changes everything. The bit about compounding errors at each OCR stage is spot-on, especially for complex layouts like building sketches or nested tables. What really clicked for me was manageing multimodal state as just another JSON structure with mime type flags - keeps the architecture clean without reinventing the wheel. Curious how much of the performance gap comes from skipping conversions versus actually preserving spatial context though.

Expand full comment
Neural Empowerment's avatar

Phew, glad I've been converting my docs to emojis. πŸ™‚πŸ““βž‘οΈπŸŽ†

Expand full comment
Alejandro Aboy's avatar

I've been experimenting with pixel table + Claude vision to reverse engineer images that I like into prompts and it works pretty well. Same with documents, I even learned that OpenAI vision models work way better understanding images than PDFs. Vision APIs are quite strong and you can notice with Nano banana or even screenshoting errors into Claude Code.

Expand full comment
Meenakshi NavamaniAvadaiappan's avatar

Thanks for the good 😊

Expand full comment