

Start by testing your current knowledge of Bioinformatics Tools. Click on Start Button below.
UNDER CONSTRUCTION
Bioinformatics Tools Quiz

What is meant by "building a bioinformatics pipeline"?
There are many ways to build a bioinformatics pipeline, depending on the scientific field, company goals, and the data available. Below is a general outline that can serve as a guide. The tools and steps may vary, and in some cases, these steps might fall under a single person’s role or need to be broken into more detailed parts. After each step, there is an example from a 2024 article from Singh, Bedate, Richthofen, and others. After reading the examples below, please learn about how bioinformatic pipelines are used in public health.
a. Start by identifying the biological question or the specific disease or genetic disorder you are investigating.
They aimed to find new immune system "brakes" (called inhibitory receptors) that could be used as targets to improve cancer treatments. This was because current therapies don’t work for everyone.
b. Collect biological samples and sequence their DNA or RNA to get raw genetic data.
Instead of collecting new samples, they used existing public genetic data from immune cells and tumors, which already included RNA sequencing data.
c. Clean up the data by checking quality issues and preparing it for analysis.
They didn’t go into detail about this, but since the data came from trusted public sources, it had likely already been checked and cleaned.
d. Match the sequences to a known reference genome or assemble them from scratch if no reference exists.
They used a well-known database called Ensembl to get full genetic sequences and matched them to known reference data.
e. Identify important genetic regions like coding (exons), non-coding (introns), and pseudogenes.
They focused on key parts of proteins inside cells, specifically short pieces called ITIM and ITSM motifs, which are known to help control immune responses.
f. Search online databases to find similar sequences or known patterns.
They compared their sequences to known inhibitory receptors and ran statistical tests to see which ones looked like they might act the same way.
g. Look for new genes or immune system targets (epitopes) that have not been studied before.
They found 390 new possible inhibitory receptors that hadn’t been studied before. These could be new targets for cancer therapy.
h. Analyze how genes or organisms are related by building family trees (phylogenetic trees).
They didn’t build family trees, but they did group the new receptors based on how similar they were in shape and function.
i. Convert DNA or RNA sequences into protein sequences using genetic code rules.
They used tools to turn genetic data into protein sequences so they could study the parts responsible for immune suppression.
j. Use computer tools to predict what those proteins might look like in 3D.
They used a tool called AlphaFold to predict the 3D shape of the proteins and filtered out the ones that didn’t look stable or accurate.
k. Explore what these genes or proteins might do using computer-based (in-silico) tools.
They used machine learning and computer models to predict which of the new proteins were likely to function as immune inhibitors.
l. If needed, test these findings in the lab or in living systems (in-vivo) to confirm the results.
They didn’t test the predictions in the lab yet. Their goal was to provide a computer-based starting point for future experiments.
Databases
EMBL
A comprehensive database for nucleotide sequences
°GenBank is an annotated collection of publicly available gene sequences from various organisms. This database is maintained by NCBI (https://www.ncbi.nlm.nih.gov/).
°Use this database to retrieve the mRNA sequence for FLT-1, which is a biomarker for preeclampsia. Preeclampsia is a pregnancy complication characterized by high blood pressure and high levels of proteinuria.
A curated database of protein sequences and functional information
UniPort
GATK
A quality control tool that provides a visual summary of high-throughput sequencing data to identify potential issues.
MultiQC
Aggregates results from multiple bioinformatics tools (like FASTQC) into a single comprehensive report for easier analysis.
A base-calling program that assigns quality scores to DNA sequencing reads, indicating the accuracy of each base call.
FastQC
Phred
Data Acquisition and Quality Control
BWA / Bowtie2
Efficient aligners for mapping sequencing reads to reference genomes
HISAT2
A fast and sensitive tool for aligning RNA-seq reads
A high-performance aligner designed for RNA-seq data
STAR
Read Alignment / Mapping
Picard
A set of command‑line utilities for manipulating high‑throughput sequencing data (BAM/SAM)
Sambamba
Fast and efficient BAM/SAM file processing—covering indexing, sorting, and more
Essential utilities for working with SAM and BAM data formats
SAMtools
Post-Alignment Processing
GATK
At toolkit for variant discovery in high-throughput sequencing data focused on post-alignment processing such as base recalibration and variant calling
HTseq
A python package for counting reads aligned to genomic features
Toos for peak calling (MASC2) and deep-sequencing data analyzing (deepTools)
MACS2 / deepTools
Variant Calling or Quantification
Bioconductor
An open‑source project that offers extensive tools for genomic data analysis in R
DESeq2
An R package for differential expression analysis of RNA‑seq data
An R package for differential expression analysis of RNA‑seq data
edgeR