Philip May on LinkedIn: #aicc #ihub #pix #vti #genai #llm #rag

Philip May

Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom

1d Edited

Report this post

We are a small group of AI experts in the Deutsche Telekom AICC (AI Competence Center). Our task is to train use case specific LLMs for the Telekom business domain. For example, we work with Mixtral and Llama models. We have started a new blog on our internal social media platform. There we share our insights, experiences and news. Our first article is about how we found a semi-RAG system.A semi-RAG system is not something you intentionally build or invent. It is a certain state of a RAG system where an LLM has to combine parametric knowledge (acquired through training) with prompt knowledge (from the knowledge DB).One reason you need to do this is that user questions are so extremely wide-ranging that it is extremely difficult to have all the questions in advance in finished texts in the knowledge database. The other reason is that your knowledge base may simply be too limited.If you train your own use-case specific open LLMs for such systems, then the training data must be designed in a different way. This was a very important realization on our way to successful on-premises LLMs.If you are a Deutsche Telekom or T-Systems International employee, you can read all the details on our blog:https://lnkd.in/eiB4sPmyAnd please do not hesitate to press the subscribe button on Yam-United if you want to receive updates. 😉 #AICC #iHub #PIX #VTI #GenAI #LLM #RAG

13 Comments

Like Comment

Dr. Hamed Ketabdar

GenAI Lead at Deutsche Telekom, Lecturer at TU Berlin

Report this comment

Thanks Philip! is Semi-RAG same i related to what is called 'Semi-Structured RAG' in the community?

Like Reply

1Reaction 2Reactions

Aravind Ganapathiraju

VP of Applied AI

22h

Report this comment

Hi Philip. I am not able to access the blog. Could you check the link you shared? Thanks.

More Relevant Posts

Philip May

Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom

4d Edited
Report this post
I still remember very clearly how I trained a semantic bilingual German-English embedding model almost four years ago. Back then for T-Systems on site services GmbH. Nowadays, with the hype of #RAG, it is probably one of the most popular German-language open source models, measured by the more than one million downloads per month.This success fills me with joy and pride and also with some doubts. But why the doubts?I think it is important to understand that there is a big difference between semantic embeddings and Q/A retrieval models.Semantic embeddings can be used to cluster texts. Or, for example, to search large texts on the basis of a few keywords. However, they are less suitable for finding answers to questions.The reason is that the semantic similarity between a question and a text with the answer does not necessarily have to be high. For this reason, Q/A retrieval models that are trained to embed questions and potential answers close to each other are primarily suitable for retrieval in RAG systems.I'm afraid many are using my semantic embedding as a replacement for a Q/A retrieval model in a RAG application. This should not be done.At Telekom, we use self-trained Q/A retrieval models for our RAG retrieval. We also have our own data sets for this. For other EU languages, by the way, we have had very good experiences with the intfloat/multilingual-e5-large model. Incidentally, this also works very well for German.- my semantic bilingual German-English embedding called "T-Systems-onsite/cross-en-de-roberta-sentence-transformer": https://lnkd.in/eSx6kc6m- the "intfloat/multilingual-e5-large model" model: https://lnkd.in/e7aetC3tDeutsche Telekom #AICC #iHub #PIX #weloveai #VTI #GenAI #LLM
153

16 Comments

Like Comment

To view or add a comment, sign in
Philip May

Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom

1mo Edited
Report this post
I made a systematic comparison of the pandas file formats, compression methods and compression levels. The comparison is based on the compression rate and the save/load times.The article can be found here: https://lnkd.in/e6w7pWSXTL;RD:If you consider the compression method together with the compression level, zstd is the best option. This is especially true for compression levels 10 to 12.In terms of data format, Feather seems to be the best choice. Feather has a better compression ratio than Parquet. Up to a compression level of 12, the storage times of Parquet and Feather are practically the same. The loading times of Feather are definitely and significantly better than those of Parquet.For these reasons, Feather seems to be the best choice in combination with zstd and a compression level of 10 to 12.This can be done with:df.to_feather("filename.feather", compression="zstd", compression_level=10)

Pandas Data Format and Compression # philipmay.org

28

3 Comments

Like Comment

To view or add a comment, sign in
Philip May

Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom

1mo Edited
Report this post
Since some time I like #DVC but also #Jupyter #Notebooks. Because the DVC examples only ever show Python scripts, I always wondered how you can still use notebooks for the pipelines. Now this is the solution. Thanks for sharing Alaeddine Abdessalem!Deutsche Telekom, #AICC

11

1 Comment

Like Comment

To view or add a comment, sign in

2,868 followers

92 Posts
7 Articles

View Profile

Explore topics

Sales
Marketing
Business Administration
HR Management
Content Management
Engineering
Soft Skills
See All

Philip May on LinkedIn: #aicc #ihub #pix #vti #genai #llm #rag | 13 comments (2024)

More Relevant Posts

More from this author

Explore topics