Philip May
Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom
- Report this post
We are a small group of AI experts in the Deutsche Telekom AICC (AI Competence Center). Our task is to train use case specific LLMs for the Telekom business domain. For example, we work with Mixtral and Llama models. We have started a new blog on our internal social media platform. There we share our insights, experiences and news. Our first article is about how we found a semi-RAG system.A semi-RAG system is not something you intentionally build or invent. It is a certain state of a RAG system where an LLM has to combine parametric knowledge (acquired through training) with prompt knowledge (from the knowledge DB).One reason you need to do this is that user questions are so extremely wide-ranging that it is extremely difficult to have all the questions in advance in finished texts in the knowledge database. The other reason is that your knowledge base may simply be too limited.If you train your own use-case specific open LLMs for such systems, then the training data must be designed in a different way. This was a very important realization on our way to successful on-premises LLMs.If you are a Deutsche Telekom or T-Systems International employee, you can read all the details on our blog:https://lnkd.in/eiB4sPmyAnd please do not hesitate to press the subscribe button on Yam-United if you want to receive updates. đ #AICC #iHub #PIX #VTI #GenAI #LLM #RAG
92
13 Comments
Dr. Hamed Ketabdar
GenAI Lead at Deutsche Telekom, Lecturer at TU Berlin
1d
- Report this comment
Thanks Philip! is Semi-RAG same i related to what is called 'Semi-Structured RAG' in the community?
1Reaction 2Reactions
Aravind Ganapathiraju
VP of Applied AI
22h
- Report this comment
Hi Philip. I am not able to access the blog. Could you check the link you shared? Thanks.
1Reaction 2Reactions
Orkhan Amrullayev
Data Scientist | ML/LLM Engineer
1d
- Report this comment
Is the website restricted for outside of Germany? Because it says âPage is unavailableâ
1Reaction 2Reactions
Dr. Jan Philipp Harries
Taming LLMs @ ellamind
2h
- Report this comment
Philip May this is really cool, would love to read the blog. Great stuff that you and the team are doing đ.BTW: Will you be able to join the 2nd #AIDEV2 on 9/24? I think this would fit very well đ.
1Reaction
Shubham Kharola
Business Analyst at Deutsche Telekom Digital Labs
9h
- Report this comment
Anand Saurabh
1Reaction
Vinzent Wuttke
Helping mid-sized global market leaders to bring ML into production | Leiter Business Development @ Datasolut
1d
- Report this comment
Thanks for sharing with the community. That is amazing!
1Reaction 2Reactions
To view or add a comment, sign in
More Relevant Posts
-
Philip May
Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom
- Report this post
I still remember very clearly how I trained a semantic bilingual German-English embedding model almost four years ago. Back then for T-Systems on site services GmbH. Nowadays, with the hype of #RAG, it is probably one of the most popular German-language open source models, measured by the more than one million downloads per month.This success fills me with joy and pride and also with some doubts. But why the doubts?I think it is important to understand that there is a big difference between semantic embeddings and Q/A retrieval models.Semantic embeddings can be used to cluster texts. Or, for example, to search large texts on the basis of a few keywords. However, they are less suitable for finding answers to questions.The reason is that the semantic similarity between a question and a text with the answer does not necessarily have to be high. For this reason, Q/A retrieval models that are trained to embed questions and potential answers close to each other are primarily suitable for retrieval in RAG systems.I'm afraid many are using my semantic embedding as a replacement for a Q/A retrieval model in a RAG application. This should not be done.At Telekom, we use self-trained Q/A retrieval models for our RAG retrieval. We also have our own data sets for this. For other EU languages, by the way, we have had very good experiences with the intfloat/multilingual-e5-large model. Incidentally, this also works very well for German.- my semantic bilingual German-English embedding called "T-Systems-onsite/cross-en-de-roberta-sentence-transformer": https://lnkd.in/eSx6kc6m- the "intfloat/multilingual-e5-large model" model: https://lnkd.in/e7aetC3tDeutsche Telekom #AICC #iHub #PIX #weloveai #VTI #GenAI #LLM
153
16 Comments
Like CommentTo view or add a comment, sign in
-
Philip May
Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom
- Report this post
I made a systematic comparison of the pandas file formats, compression methods and compression levels. The comparison is based on the compression rate and the save/load times.The article can be found here: https://lnkd.in/e6w7pWSXTL;RD:If you consider the compression method together with the compression level, zstd is the best option. This is especially true for compression levels 10 to 12.In terms of data format, Feather seems to be the best choice. Feather has a better compression ratio than Parquet. Up to a compression level of 12, the storage times of Parquet and Feather are practically the same. The loading times of Feather are definitely and significantly better than those of Parquet.For these reasons, Feather seems to be the best choice in combination with zstd and a compression level of 10 to 12.This can be done with:df.to_feather("filename.feather", compression="zstd", compression_level=10)
28
3 Comments
Like CommentTo view or add a comment, sign in
-
Philip May
Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom
- Report this post
Since some time I like #DVC but also #Jupyter #Notebooks. Because the DVC examples only ever show Python scripts, I always wondered how you can still use notebooks for the pipelines. Now this is the solution. Thanks for sharing Alaeddine Abdessalem!Deutsche Telekom, #AICC
11
1 Comment
Like CommentTo view or add a comment, sign in
2,868 followers
- 92 Posts
- 7 Articles
View Profile
FollowMore from this author
- The German colossal, cleaned Common Crawl Corpus released Philip May 3y
- Talk: Challenges and Potentials in the Training of German Language Models Philip May 3y
- Cross Language Sentence Model for Semantic Search released Philip May 3y
Explore topics
- Sales
- Marketing
- Business Administration
- HR Management
- Content Management
- Engineering
- Soft Skills
- See All