Solid breakdown of the multimodal shift. The OCR-to-text pipeline always felt fragile, and treating PDFs as images is one of those obvious-in-hindsight moves that changes everything. The bit about compounding errors at each OCR stage is spot-on, especially for complex layouts like building sketches or nested tables. What really clicked for me was manageing multimodal state as just another JSON structure with mime type flags - keeps the architecture clean without reinventing the wheel. Curious how much of the performance gap comes from skipping conversions versus actually preserving spatial context though.
I've been experimenting with pixel table + Claude vision to reverse engineer images that I like into prompts and it works pretty well. Same with documents, I even learned that OpenAI vision models work way better understanding images than PDFs. Vision APIs are quite strong and you can notice with Nano banana or even screenshoting errors into Claude Code.
Solid breakdown of the multimodal shift. The OCR-to-text pipeline always felt fragile, and treating PDFs as images is one of those obvious-in-hindsight moves that changes everything. The bit about compounding errors at each OCR stage is spot-on, especially for complex layouts like building sketches or nested tables. What really clicked for me was manageing multimodal state as just another JSON structure with mime type flags - keeps the architecture clean without reinventing the wheel. Curious how much of the performance gap comes from skipping conversions versus actually preserving spatial context though.
Phew, glad I've been converting my docs to emojis. ππβ‘οΈπ
I've been experimenting with pixel table + Claude vision to reverse engineer images that I like into prompts and it works pretty well. Same with documents, I even learned that OpenAI vision models work way better understanding images than PDFs. Vision APIs are quite strong and you can notice with Nano banana or even screenshoting errors into Claude Code.
Thanks for the good π