12 Comments
User's avatar
Suhrab Khan's avatar

Great breakdown! Treating PDFs and images as native inputs is such a practical shift. Makes multimodal agents far more reliable and efficient. Multimodal memory management and native modality handling are the real differentiators for production-grade agentic systems.

Alejandro Aboy's avatar

Each page exported as image+embed - for one shots pixeltable is overkill, but if you need storage and some kind of versioning for more realistic workloads, pixeltable sounds interesting.

If you want to deep dive on the raw implementation you can just use any vision API and setup a pipeline that saves structured outputs for subsequent embeddings, I am trying to discover if pixeltable adds any value to this or not yet. Still getting my idea around it since the demo always shines πŸ˜…

Neural Foundry's avatar

Solid breakdown of the multimodal shift. The OCR-to-text pipeline always felt fragile, and treating PDFs as images is one of those obvious-in-hindsight moves that changes everything. The bit about compounding errors at each OCR stage is spot-on, especially for complex layouts like building sketches or nested tables. What really clicked for me was manageing multimodal state as just another JSON structure with mime type flags - keeps the architecture clean without reinventing the wheel. Curious how much of the performance gap comes from skipping conversions versus actually preserving spatial context though.

Paul Iusztin's avatar

Well, you don't have to skip the conversation to preserve the spatial context. You just keep your data in its original modality. If it's text from your messages, keep it text ,if it's an image, keep it as an image.

Neural Empowerment's avatar

Phew, glad I've been converting my docs to emojis. πŸ™‚πŸ““βž‘οΈπŸŽ†

Paul Iusztin's avatar

haha, good one 🀣

Alejandro Aboy's avatar

I've been experimenting with pixel table + Claude vision to reverse engineer images that I like into prompts and it works pretty well. Same with documents, I even learned that OpenAI vision models work way better understanding images than PDFs. Vision APIs are quite strong and you can notice with Nano banana or even screenshoting errors into Claude Code.

Paul Iusztin's avatar

Yes, agree! That's my point, working with multimodal nowadays is pretty easy after you get used to manipulating images, audio, PDFs, etc. The hassle of mapping to text is not worth it anymore.

P.S. With Gemini it's the other way around, it works better with PDFs than images πŸ˜‚

But I have to try pixel table! You think it's worth it?

Alejandro Aboy's avatar

Pixeltable is quite interesting on how it embeds everything on a database (I think It defaults to postgres) and you can actually know what's happening. Haven't test anything around It besides playing around. But i've been researching It because I hace to convert lots of Google slides into RAGable data and I thought of going into the multimodal direction instead of converting to text, thats why I tried pixeltable basically.

Paul Iusztin's avatar

Got it. Have you tried to convert the slides into PDFs -> images -> and embed each page? Or is pixable doing something similar?

Meenakshi NavamaniAvadaiappan's avatar

Thanks for the good 😊

Paul Iusztin's avatar

Thanks πŸ™