Philip May on LinkedIn: #aicc #ihub #pix #vti #genai #llm #rag | 13 comments (2025)

Philip May

Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom

  • Report this post

We are a small group of AI experts in the Deutsche Telekom AICC (AI Competence Center). Our task is to train use case specific LLMs for the Telekom business domain. For example, we work with Mixtral and Llama models. We have started a new blog on our internal social media platform. There we share our insights, experiences and news. Our first article is about how we found a semi-RAG system.A semi-RAG system is not something you intentionally build or invent. It is a certain state of a RAG system where an LLM has to combine parametric knowledge (acquired through training) with prompt knowledge (from the knowledge DB).One reason you need to do this is that user questions are so extremely wide-ranging that it is extremely difficult to have all the questions in advance in finished texts in the knowledge database. The other reason is that your knowledge base may simply be too limited.If you train your own use-case specific open LLMs for such systems, then the training data must be designed in a different way. This was a very important realization on our way to successful on-premises LLMs.If you are a Deutsche Telekom or T-Systems International employee, you can read all the details on our blog:https://lnkd.in/eiB4sPmyAnd please do not hesitate to press the subscribe button on Yam-United if you want to receive updates. 😉 #AICC #iHub #PIX #VTI #GenAI #LLM #RAG

  • Philip May on LinkedIn: #aicc #ihub #pix #vti #genai #llm #rag | 13 comments (2)

92

13 Comments

Like Comment

Dr. Hamed Ketabdar

GenAI Lead at Deutsche Telekom, Lecturer at TU Berlin

1d

  • Report this comment

Thanks Philip! is Semi-RAG same i related to what is called 'Semi-Structured RAG' in the community?

Like Reply

1Reaction 2Reactions

Like Reply

1Reaction 2Reactions

Orkhan Amrullayev

Data Scientist | ML/LLM Engineer

1d

  • Report this comment

Is the website restricted for outside of Germany? Because it says “Page is unavailable”

Like Reply

1Reaction 2Reactions

Dr. Jan Philipp Harries

Taming LLMs @ ellamind

2h

  • Report this comment

Philip May this is really cool, would love to read the blog. Great stuff that you and the team are doing 👍.BTW: Will you be able to join the 2nd #AIDEV2 on 9/24? I think this would fit very well 😉.

Like Reply

1Reaction

Shubham Kharola

Business Analyst at Deutsche Telekom Digital Labs

9h

  • Report this comment

Anand Saurabh

Like Reply

1Reaction

Vinzent Wuttke

Helping mid-sized global market leaders to bring ML into production | Leiter Business Development @ Datasolut

1d

  • Report this comment

Thanks for sharing with the community. That is amazing!

Like Reply

1Reaction 2Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

  • Philip May

    Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom

    • Report this post

    I still remember very clearly how I trained a semantic bilingual German-English embedding model almost four years ago. Back then for T-Systems on site services GmbH. Nowadays, with the hype of #RAG, it is probably one of the most popular German-language open source models, measured by the more than one million downloads per month.This success fills me with joy and pride and also with some doubts. But why the doubts?I think it is important to understand that there is a big difference between semantic embeddings and Q/A retrieval models.Semantic embeddings can be used to cluster texts. Or, for example, to search large texts on the basis of a few keywords. However, they are less suitable for finding answers to questions.The reason is that the semantic similarity between a question and a text with the answer does not necessarily have to be high. For this reason, Q/A retrieval models that are trained to embed questions and potential answers close to each other are primarily suitable for retrieval in RAG systems.I'm afraid many are using my semantic embedding as a replacement for a Q/A retrieval model in a RAG application. This should not be done.At Telekom, we use self-trained Q/A retrieval models for our RAG retrieval. We also have our own data sets for this. For other EU languages, by the way, we have had very good experiences with the intfloat/multilingual-e5-large model. Incidentally, this also works very well for German.- my semantic bilingual German-English embedding called "T-Systems-onsite/cross-en-de-roberta-sentence-transformer": https://lnkd.in/eSx6kc6m- the "intfloat/multilingual-e5-large model" model: https://lnkd.in/e7aetC3tDeutsche Telekom #AICC #iHub #PIX #weloveai #VTI #GenAI #LLM

    • Philip May on LinkedIn: #aicc #ihub #pix #vti #genai #llm #rag | 13 comments (13)

    153

    16 Comments

    Like Comment

    To view or add a comment, sign in

  • Philip May

    Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom

    • Report this post

    I made a systematic comparison of the pandas file formats, compression methods and compression levels. The comparison is based on the compression rate and the save/load times.The article can be found here: https://lnkd.in/e6w7pWSXTL;RD:If you consider the compression method together with the compression level, zstd is the best option. This is especially true for compression levels 10 to 12.In terms of data format, Feather seems to be the best choice. Feather has a better compression ratio than Parquet. Up to a compression level of 12, the storage times of Parquet and Feather are practically the same. The loading times of Feather are definitely and significantly better than those of Parquet.For these reasons, Feather seems to be the best choice in combination with zstd and a compression level of 10 to 12.This can be done with:df.to_feather("filename.feather", compression="zstd", compression_level=10)

    Pandas Data Format and Compression # philipmay.org

    28

    3 Comments

    Like Comment

    To view or add a comment, sign in

  • Philip May

    Data scientist and open source enthusiast with NLP focus @ Deutsche Telekom

    • Report this post

    Since some time I like #DVC but also #Jupyter #Notebooks. Because the DVC examples only ever show Python scripts, I always wondered how you can still use notebooks for the pipelines. Now this is the solution. Thanks for sharing Alaeddine Abdessalem!Deutsche Telekom, #AICC

    11

    1 Comment

    Like Comment

    To view or add a comment, sign in

Philip May on LinkedIn: #aicc #ihub #pix #vti #genai #llm #rag | 13 comments (25)

Philip May on LinkedIn: #aicc #ihub #pix #vti #genai #llm #rag | 13 comments (26)

2,868 followers

  • 92 Posts
  • 7 Articles

View Profile

Follow

More from this author

  • The German colossal, cleaned Common Crawl Corpus released Philip May 3y
  • Talk: Challenges and Potentials in the Training of German Language Models Philip May 3y
  • Cross Language Sentence Model for Semantic Search released Philip May 3y

Explore topics

  • Sales
  • Marketing
  • Business Administration
  • HR Management
  • Content Management
  • Engineering
  • Soft Skills
  • See All
Philip May on LinkedIn: #aicc #ihub #pix #vti #genai #llm #rag | 13 comments (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Sen. Ignacio Ratke

Last Updated:

Views: 6231

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Sen. Ignacio Ratke

Birthday: 1999-05-27

Address: Apt. 171 8116 Bailey Via, Roberthaven, GA 58289

Phone: +2585395768220

Job: Lead Liaison

Hobby: Lockpicking, LARPing, Lego building, Lapidary, Macrame, Book restoration, Bodybuilding

Introduction: My name is Sen. Ignacio Ratke, I am a adventurous, zealous, outstanding, agreeable, precious, excited, gifted person who loves writing and wants to share my knowledge and understanding with you.