a computer screen with a bunch of data on it
a computer screen with a bunch of data on it

Start by testing your current knowledge of Bioinformatics Tools. Click on Start Button below.

UNDER CONSTRUCTION

Bioinformatics Tools Quiz

What is meant by "building a bioinformatics pipeline"?

There are many ways to build a bioinformatics pipeline, depending on the scientific field, company goals, and the data available. Below is a general outline that can serve as a guide. The tools and steps may vary, and in some cases, these steps might fall under a single person’s role or need to be broken into more detailed parts. After each step, there is an example from a 2024 article from Singh, Bedate, Richthofen, and others. After reading the examples below, please learn about how bioinformatic pipelines are used in public health.

a. Start by identifying the biological question or the specific disease or genetic disorder you are investigating.

They aimed to find new immune system "brakes" (called inhibitory receptors) that could be used as targets to improve cancer treatments. This was because current therapies don’t work for everyone.

b. Collect biological samples and sequence their DNA or RNA to get raw genetic data.

Instead of collecting new samples, they used existing public genetic data from immune cells and tumors, which already included RNA sequencing data.

c. Clean up the data by checking quality issues and preparing it for analysis.

They didn’t go into detail about this, but since the data came from trusted public sources, it had likely already been checked and cleaned.

d. Match the sequences to a known reference genome or assemble them from scratch if no reference exists.

They used a well-known database called Ensembl to get full genetic sequences and matched them to known reference data.

e. Identify important genetic regions like coding (exons), non-coding (introns), and pseudogenes.

They focused on key parts of proteins inside cells, specifically short pieces called ITIM and ITSM motifs, which are known to help control immune responses.

f. Search online databases to find similar sequences or known patterns.

They compared their sequences to known inhibitory receptors and ran statistical tests to see which ones looked like they might act the same way.

g. Look for new genes or immune system targets (epitopes) that have not been studied before.

They found 390 new possible inhibitory receptors that hadn’t been studied before. These could be new targets for cancer therapy.

h. Analyze how genes or organisms are related by building family trees (phylogenetic trees).

They didn’t build family trees, but they did group the new receptors based on how similar they were in shape and function.

i. Convert DNA or RNA sequences into protein sequences using genetic code rules.

They used tools to turn genetic data into protein sequences so they could study the parts responsible for immune suppression.

j. Use computer tools to predict what those proteins might look like in 3D.

They used a tool called AlphaFold to predict the 3D shape of the proteins and filtered out the ones that didn’t look stable or accurate.

k. Explore what these genes or proteins might do using computer-based (in-silico) tools.

They used machine learning and computer models to predict which of the new proteins were likely to function as immune inhibitors.

l. If needed, test these findings in the lab or in living systems (in-vivo) to confirm the results.

They didn’t test the predictions in the lab yet. Their goal was to provide a computer-based starting point for future experiments.

Databases

EMBL
black blue and yellow textile
black blue and yellow textile

A comprehensive database for nucleotide sequences

a man riding a skateboard down a street next to tall buildings
a man riding a skateboard down a street next to tall buildings

°GenBank is an annotated collection of publicly available gene sequences from various organisms. This database is maintained by NCBI (https://www.ncbi.nlm.nih.gov/).

°Use this database to retrieve the mRNA sequence for FLT-1, which is a biomarker for preeclampsia. Preeclampsia is a pregnancy complication characterized by high blood pressure and high levels of proteinuria.

A curated database of protein sequences and functional information

UniPort
GATK
black blue and yellow textile
black blue and yellow textile

A quality control tool that provides a visual summary of high-throughput sequencing data to identify potential issues.

a man riding a skateboard down the side of a ramp
a man riding a skateboard down the side of a ramp
a man riding a skateboard down a street next to tall buildings
a man riding a skateboard down a street next to tall buildings
MultiQC

Aggregates results from multiple bioinformatics tools (like FASTQC) into a single comprehensive report for easier analysis.

A base-calling program that assigns quality scores to DNA sequencing reads, indicating the accuracy of each base call.

FastQC
Phred

Data Acquisition and Quality Control

BWA / Bowtie2
black blue and yellow textile
black blue and yellow textile

Efficient aligners for mapping sequencing reads to reference genomes

a man riding a skateboard down the side of a ramp
a man riding a skateboard down the side of a ramp
a man riding a skateboard down a street next to tall buildings
a man riding a skateboard down a street next to tall buildings
HISAT2

A fast and sensitive tool for aligning RNA-seq reads

A high-performance aligner designed for RNA-seq data

STAR

Read Alignment / Mapping

Picard
black blue and yellow textile
black blue and yellow textile

A set of command‑line utilities for manipulating high‑throughput sequencing data (BAM/SAM)

a man riding a skateboard down the side of a ramp
a man riding a skateboard down the side of a ramp
a man riding a skateboard down a street next to tall buildings
a man riding a skateboard down a street next to tall buildings

Sambamba

Fast and efficient BAM/SAM file processing—covering indexing, sorting, and more

Essential utilities for working with SAM and BAM data formats

SAMtools

Post-Alignment Processing

GATK
black blue and yellow textile
black blue and yellow textile

At toolkit for variant discovery in high-throughput sequencing data focused on post-alignment processing such as base recalibration and variant calling

a man riding a skateboard down the side of a ramp
a man riding a skateboard down the side of a ramp
a man riding a skateboard down a street next to tall buildings
a man riding a skateboard down a street next to tall buildings
HTseq

A python package for counting reads aligned to genomic features

Toos for peak calling (MASC2) and deep-sequencing data analyzing (deepTools)

MACS2 / deepTools

Variant Calling or Quantification

Bioconductor
black blue and yellow textile
black blue and yellow textile

An open‑source project that offers extensive tools for genomic data analysis in R

a man riding a skateboard down the side of a ramp
a man riding a skateboard down the side of a ramp
a man riding a skateboard down a street next to tall buildings
a man riding a skateboard down a street next to tall buildings
DESeq2

An R package for differential expression analysis of RNA‑seq data

An R package for differential expression analysis of RNA‑seq data

edgeR

Statistical Analysis and Biological Interpretation