Spotlight Presentation: Foundational model building with single-cell RNA-Seq data
26 Sep 2024
Sierra B
Data Quality
Target Identification
Lead Generation & Optimization
Drug Response Prediction
Strand will present progress on the following subproblems in the use of single-cell RNA-Seq data for drug discovery:
-
An AWS data lake capable of ingesting and processing single-cell RNA-Seq data with associated metadata at scale
-
Semi-automated LLM-based ingestion to a schema with ≈35 fields of single-cell RNA-Seq + metadata of 3 disease datasets -- UC, AD and FTD -- from GEO. We show an improvement in turnaround time of ≈3x-5x
-
A standardized single-cell pipeline that generates normalized counts from fastqs for ingested data
-
Embeddings of the single-cell data for pretraining on an LLM [see for e.g scBERT]. We show how such embeddings might be used to remove batch effect and hence integrate data.