Select your language

Select your language

Select your language

Stefan Lang

Expert on machine learning, artificial intelligence, systems theory and bioinformatics (M.Sc.)

Looking for an experienced AI expert or bioinformatician? Then you've come to the right place!

I have already supported customers from various industries in the implementation of their projects. This has allowed me to gain experience in many areas of research and development.

Whether software development, training of artificial intelligence, data analysis, mathematical modeling or laboratory work, there is hardly an area of machine learning and bioinformatics in which I have not yet been active.

I have in-depth knowledge of the entire software life cycle, from the development of a prototype to validation of the methods and transition to production.

I would be happy to contribute my ideas and many years of experience in the development of intelligent software to your next project.

Projectlist

Download PDF

Results: Development of a Llama-3 chat bot that can use various tools (customer databases, web search, chat, ...) to answer user questions. The bot can be hosted entirely on the customer's own servers, which also enables the processing of sensitive data.

Methods:
  • Data-pipeline: Pytorch-Lightning module to convert text documents into a vector space using a (small) language model (Jina-V2 text embeddings) and store them together with any (also nested) metadata in a vector database
  • Chat bot: LangGraph agent with Llama-3 LLM to answer user questions. The agent has various tools at its disposal to collect information about the question from the linked databases and the Internet. It is also able to ask follow-up questions if the user's question is unclear. The agent then uses the collected information to create an answer (Retrieval-Augmented Generation, RAG)

Results: Module to recognize text in images and compare it with a customer-specific template.

Methods:
  • OCR: Identification of text blocks in images & recognition of the text
  • Matching: Local similarity matching of the recognized text with a customer-specific template, taking into account possible OCR errors

Results: Tool for identifying and coloring individual objects of different classes specified by the customer in images

Methods:
  • Data-pipeline: API connection to the customer's annotation tool for the creation of training data
  • Model: neural network for multi-class, multi-instance segmentation of images
  • Deployment: microservices for training & inference

Results: Automated recognition of music tracks in live recordings. The developed method can identify instrumental / vocal versions, variations in vocals or instrumentals (up to changed instruments or vocals in a different language), and excerpts of music pieces in a database of audio recordings

Methods:
  • Music detection: classification of music or extraction of tracks from mixed recordings (e.g., television broadcasts, live concerts, albums, ...)
  • Music decomposition: decomposition of musical pieces into vocal and instrumental channels
  • Matching: creation of an electronic fingerprint of the decomposed audio channels and local similarity analysis of the fingerprints to identify the music pieces in a database

Results: development of a tool for recognizing and linking custom term classes from continuous text.

Methods:
  • Models: Named-Entity-Recognition (NER) model using transformer embeddings to annotate the terms, Relation-Tagging model to link the terms (libraries PyTorch)
  • Annotation pipeline: import / export functions to manually tag examples of the term classes and relations to be learned using a graphical annotation tool (INCEpTION)
  • Trainer: module to adapt the AI models to the manually annotated data, i.e. to learn the customized term classes and relations

Results: Construction of a library of input and output adapters for the generic perceiverIO architecture. Implemented modalities (data types): Text, audio, images, videos, time series

Methods:
  • Input adapters: modality-specific restructuring of input data as a 2-dimensional array and concatenation of modalities as input to perceiverIO
  • Output adapter: development of queries (query arrays) for reconstruction (autocoding), classification, and prediction of the input data
  • Model: methods for data preparation, configuration of models (depending on input data and task), training of models and use of models

Results: Temporal as well as spatial prediction of epidemiological parameters (new infections, R-value) by linking and interpreting different data sources (infection numbers, socio-demographic data, mobility, ...)

Methods:
  • Building the data infrastructure: merging & processing the different data sources in a graph database (ArangoDB). Pipeline for updating the data
  • Data analysis: frequency analysis & filtering (smoothing). Determination of temporal dependencies (cross-correlation) between time series (within locations and between locations). Determination of the effect of measures taken on the time series
  • Modeling: Neural network for multivariate time series analysis (Temporal Fusion Transformer) taking into account static covariates (place, number of inhabitants, ...). Determination of partial dependencies of the static and dynamic covariates on the target variable
  • Deployment: Docker container with databases, models and API

Results: Automated transcription of audio files in German language. Monitoring of transcription quality and training of new / unrecognized words. Classification / interpretation of the transcribed texts

Methods:
  • Speech-to-text (STT): KaldiASR model trained on German language dataset. Determination of word recognition probabilities for quality estimation
  • Trainer: module to teach new words to the STT model. Testing recognition rate for given keywords. Phonemization of poorly recognized words using a separate grapheme-to-phoneme model (g2p). Scraping sample texts to calculate word transition probabilities. Incorporation of the new words and phonemes into the grammar and phoneme classes of the model and retraining of the model. Synthesizing some example texts through a separate text-to-speech model (CoquiTTS) and retranslating them into text as validation
  • Natural language processing: document indexing of the transcribed texts, semantic search of keywords and classification of the texts based on given classes

Results: Identification of product groups with similar sales patterns. Analysis of trends and seasonality of sales. Estimation of future material requirements for several months

Methods:
  • Data preparation: connection to sales database. Data set with product sales as time series and metadata about the products (single parts, colors, size, ...)
  • Data analysis: finding correlations in sales behavior, grouping of products. Frequency analysis and seasonal decomposition of time series
  • Demand forecasting: predicting sales of products or product groups, taking into account product characteristics, current trends and seasonal sales patterns (Prophet and NBeats ensemble). Estimation of future material demand

Results: Estimation of the impact of errors/delays in processes at specific locations on the remaining transportation network

Methods:
  • Data preparation: structuring data into locations, movements between locations, and processes at locations. Calculation of temporal static and dynamic properties of locations (capacities, load factors, ...)
  • Data analysis: analysis of movements and disturbances in the network, estimation of the effect of disturbances on subsequent stations (identification of error-chains)
  • Simulation: simulation of the effect of changed transport routes / times or changed processes / parameters at the locations on the overall network

Results: Pathogens must camouflage themselves in the body to avoid being recognized as foreign and beeing removed. The camouflage cannot be perfect. The immune system must weigh at what "threshold" of self-similarity it might attack camouflaged pathogens (a low threshold means little autoimmunity, but poorer defense against camouflaged pathogens, a high threshold means good defense but possible autoimmunity). Identification of target proteins for drug intervention of autoimmunity

Methods:
  • Literature review: (innate) immune system, complement system, social systems theory, mimicry/crypsis, mathematical / game theoretical models of mimicry, mathematical / metabolic models of complement system
  • Modeling: transfer of behavioral models describing mimicry and crypsis in animals to the microbiological level (molecular crypsis). Linking crypsis models to models of the innate immune response (specifically complement system). Modeling of the trade-off between autoimmunity and defense against camouflaged pathogens
  • Publication: Publication of relevant results in scientific journals

Results: Implementation of a protein microarray and fluorescence filter in a smartphone attachment for on-site detection of specific (e.g., unwanted) proteins in samples (e.g., growth hormones in milk). Efficient analysis directly on the smartphone using computer vision methods

Methods:
  • Data preparation: standardize, interpolate, and rectify (orthogonalize) input images (colored spots of the microarray taken with smartphones, i.e., highly varying qualities)
  • Image recognition: localize spots and mark edges. Identify spots of positive / negative controls. Determine the color intensities of the other spots and calibrate them against the controls to calculate the concentration of the target protein in the sample

Results: Characterization of forms of cooperation in biofilms. In particular, modeling of intra-species and inter-species crossfeeding interactions. Investigation of the evolutionary stability of cooperation with respect to parasitism

Methods:
  • Literature review: social systems theory, evolutionary game theory, forms of cooperation and communication in microorganisms, crossfeeding
  • Modeling: agent-based model to simulate crossfeeding interactions between unicellular fungi. Modeling the effect of communication via molecules released into the environment or direct connection of individuals by nanotubes
  • Publication: Publication of relevant results in scientific journals

Results: Design and implementation of algorithms for separation of cell aggregates (segmentation), tracking of single cells and extraction of cell typical parameters. Later: further development to analyze data from confocal laser scanning microscopy (5-dimensional)

Methods:
  • Data preparation: deconvolution of images with microscope-specific kernel (remove specific light scattering patterns), interpolation, standardization
  • Segmentation: separate foreground (focused cells) from background (noise, macromolecules, non-focused cells, ...)
  • Image recognition: recognize single cells and cell clusters. Separate cell clusters. Reconstruct shape of single cells
  • Extract features: Recognize specific features of cell types and characterize given properties (size, movement pattern, speed, ...)

Skills

  • Natural Language Processing (NLP)
    • Large Language Models (LLM)
    • Generative Language Models
    • Text Classification
    • Named Entity Recognition
    • Relation Tagging
  • Computer Vision
    • Image Segmentation
    • Image Classification
    • Object Detection
    • Object Tracking
  • Audio Processing
    • Audio-Information-Retrieval
    • Automatic Speech Recognition (ASR)
    • Text-To-Speech synthesis (TTS)
  • Timeseries analysis and forecasting
  • Regression / classification
  • Visualization
  • Pattern recognition
  • Anomaly detection
  • Analysis of dependencies
  • Timeseries analysis
  • Audio / speech analysis
  • Analysis of biological data (-omics, mass spectrometry, ...)
  • Modeling of biological systems (chemical reaction networks, multi-scale ecological models, individual-based modeling, evolutionary game theory)
  • Processing of microscopic images (segmentation, object detection and tracking, classification, anomaly detection)
  • Python
  • Java
  • R
  • C / C++
  • Matlab/li>
  • LaTex
  • Development of microservices
  • Continuous Integration / Continuous Development (CI/CD)
  • Docker
  • Kafka
  • PyTorch / PyTorch - Lightning / PyTorch - Forecasting
  • TensorFlow / Keras
  • Kaldi-ASR
  • CoquiTTS
  • Deeplearning4j
  • ImageJ
  • OpenCV
  • Scipy
  • Numpy
  • Pandas
  • PyCharm
  • Jupyter Notebook / Lab
  • IntelliJ
  • RStudio
  • Matlab
  • Eclipse
  • GIT
  • SVN
  • Github
  • Gitlab
  • Bitbucket
  • OpenProject
  • Jira
  • Confluence
  • ArangoDB
  • SQL
  • MariaDB
  • MongoDB

Contact