Stop Converting Documents to Text. You're…

Dec 9, 2025

How to work with multimodal agents: images, PDFs, audio, and... text.

11 Comments

Great breakdown! Treating PDFs and images as native inputs is such a practical shift. Makes multimodal agents far more reliable and efficient. Multimodal memory management and native modality handling are the real differentiators for production-grade agentic systems.

Alejandro Aboy

Dec 10

Each page exported as image+embed - for one shots pixeltable is overkill, but if you need storage and some kind of versioning for more realistic workloads, pixeltable sounds interesting.

If you want to deep dive on the raw implementation you can just use any vision API and setup a pipeline that saves structured outputs for subsequent embeddings, I am trying to discover if pixeltable adds any value to this or not yet. Still getting my idea around it since the demo always shines 😅

Neural Empowerment

Dec 9

Phew, glad I've been converting my docs to emojis. 🙂📓➡️🎆

haha, good one 🤣

I've been experimenting with pixel table + Claude vision to reverse engineer images that I like into prompts and it works pretty well. Same with documents, I even learned that OpenAI vision models work way better understanding images than PDFs. Vision APIs are quite strong and you can notice with Nano banana or even screenshoting errors into Claude Code.

Reply (1)

Paul Iusztin

Dec 10

Yes, agree! That's my point, working with multimodal nowadays is pretty easy after you get used to manipulating images, audio, PDFs, etc. The hassle of mapping to text is not worth it anymore.

P.S. With Gemini it's the other way around, it works better with PDFs than images 😂

But I have to try pixel table! You think it's worth it?

Reply (1)

Alejandro Aboy

Dec 10

Pixeltable is quite interesting on how it embeds everything on a database (I think It defaults to postgres) and you can actually know what's happening. Haven't test anything around It besides playing around. But i've been researching It because I hace to convert lots of Google slides into RAGable data and I thought of going into the multimodal direction instead of converting to text, thats why I tried pixeltable basically.

Reply (1)

Paul Iusztin

Dec 10

Got it. Have you tried to convert the slides into PDFs -> images -> and embed each page? Or is pixable doing something similar?

Meenakshi NavamaniAvadaiappan

Dec 9

Thanks for the good 😊

Reply (1)

Paul Iusztin

Dec 10

Thanks 🙏

Comment removed

Comment removed

Well, you don't have to skip the conversation to preserve the spatial context. You just keep your data in its original modality. If it's text from your messages, keep it text ,if it's an image, keep it as an image.