Project: To advance analytics capabilities for cancer vaccine research using Nanopore sequencing technology by developing a real-time analysis pipeline for TCR repertoire analysis
Background Information:
Long-read sequencing technologies, such as Nanopore, are prone to errors and require sophisticated data cleaning methods for accurate sequence recovery. Addressing this challenge is crucial for advancements in bioinformatics and cancer research
Key Achievements
- Spearheaded analysis of large-scale genomic datasets (>7 million rows), leveraging advanced bioinformatics techniques to uncover critical insights that directly influenced research directions and decision-making
- Designed and implemented a custom automated pipeline by integrating 11+ open-source bioinformatics tools, reducing manual intervention by 40% and accelerating data processing timelines by 30%
- Engineered robust Shell and Python automation scripts to extract and classify cell barcodes, segment TCR α and β chains at single-cell level, and reconstruct TCR chains via de novo assembly, enhancing reproducibility and scalability
- Revolutionized cell barcode extraction methodologies, achieving a 258% increase in barcode recovery (from 608,700 to 2,181,878), improving data quality and downstream analysis accuracy
- Pioneered a novel Python-based algorithm to separate TCR α and β chains without reliance on a whitelist, achieving 90% accuracy across 100 unique cell barcodes using Ward's linkage clustering, setting a new benchmark for TCR chain analysis
- Tailored Shasta for TCR-specific assembly of α and β sequences, overcoming configuration constraints and pioneering novel methodological advancements
- Co-authored “Refining TCR clonotype identification with long-read sequencing technique” submitted to Society for Immunotherapy of Cancer (SITC), as third author and first intern co-author, positioned among full-time researchers highlighting significant contributions to research project