AI Catalyst Studio
02/09/2024
http://aicatalyststudio.co.za/
Welcome to AI Catalyst Studio | Innovating Tomorrow with AI AI Catalyst Studio: Innovating your business with tailored AI solutions driven by engineering expertise. Start your AI journey today.
The Bayesian Trap
Let’s assume that you have tested positive for a rare disease that affects 0.1% of the population
You have been told by the doctor that the test will correctly identify 99% of people that have the disease and incorrectly identify 1% of people who don’t have the disease.
Normally you would assume that you have 99% chance of actually having the disease because that is the accuracy of test.
But if we apply Bayes theorem:
- The hypothesis that you have the disease P(H) given that you tested positive = P(H|E)
- The prior probability P(H) that the hypothesis is true (how likely you had the disease before being tested positive) is extremely difficult to determine but a good starting point is to use the frequency of the disease in the population (0.1%)
- The probability of the event given the hypothesis is true i.e. the probability that you would test positive if you had the disease = P(H)
The total probability of the event occurring P(E) i.e. the probability that you actually have the disease
= The probability of having the disease P(H) and correctly testing positive P(E|H)
+
The probability of not having the disease P(-H) and incorrectly testing positive P(E|-H)
= [P(E|H) x 0.1%] / [P(H) x P(E|H) P(-H) x P(E|-H)]
= [0.99 x 0.001] / [(0.001 x 0.99) + (0.999 x 0.01)
= 0.09
This means that you really have a 9% chance of actually having the disease if tested positive
Think about that 🤔
There is a lot more information contained in the different documents we use on a daily basis beyond just text data.
For example, GPT-4, Gemini, and Claude are multimodal LLMs that can ingest images as well as text. The images are passed through a Vision Transformer, resulting in visual tokens. The visual tokens are then passed through a projection layer that specializes in aligning visual tokens with text tokens. The visual and text tokens are then provided to the LLM, which cannot make the difference between the different data modes.
In the context of RAG, an LLM plays a role at indexing time, where it generates a vector representation of the data to index it in a vector database. It is also used at retrieval time, where it uses the retrieved documents to provide an answer to a user question. A multimodal LLM can generate embedding representations of images and text and answer questions using those same data types. If we want to answer questions that involve information in different data modes, using a multimodal LLM at indexing and retrieval time is the best option.
If you want to build your RAG pipeline using API providers like OpenAI, you can use GPT-4 for question-answering using multimodal prompts. Even if it is available for text generation, it might not be available for embedding generation. What remains is creating embedding for images. This can be achieved by prompting a multimodal LLM to describe in text the images we need to index. We can then index the images using the text descriptions and their vector representations.
The complexity of generating a text description of an image is not the same as answering questions using a large context of different data types. With a small multimodal LLM, we might get satisfactory results in describing images but subpar results in answering questions. For example, it is pretty simple to build an image description pipeline with LlaVA models and Llama.cpp as LLM backbone. Those descriptions can be used for indexing as well as for answering questions that may involve those images. The LLM answering questions would use the text description of images instead of the images themselves. Today, that might be the simplest option for building a multimodal RAG pipeline. It might not be as performant, but the technology is improving very fast!
Click here to claim your Sponsored Listing.
Category
Address
136 Plover Avenue, KGE
Fourways