CSC 121: Computers and Scientific Thinking
Fall 2025

Lab 3: Applications in Biology

The National Center for Biotechnology Information (NCBI), which is part of the National Institutes of Health (NIH), is the primary repository of biological information in the U.S. The NCBI Web site includes the Basic Local Alignment Search Tool (BLAST), which can be used to search the GenBank database of DNA sequences and find regions of local similarity. This lab, based on exercises developed at NCBI/NIH and the University of New Hampshire, will familiarize you with the use of BLAST.


Jurassic Park Dino-DNA Analysis

In 1990, Michael Crichton published the book Jurassic Park about the resurrection of dinosaurs using the blood from the stomachs of insects that had been encased in amber. At one point in the book, the lead researcher, Dr. Henry Wu, is asked to explain some of the DNA techniques used in reconstructing the extinct dinosaur genomes. Dr. Wu describes the use of restriction enzymes and how the fragmented pieces of dino DNA can be spliced together with these enzymes. He also alludes to the fact that they don't have the entire genome but that they "fill in the gaps" with modern day frog DNA. At one point during his discussion, he points to a computer screen and remarks "Here you see the actual structure of a small fragment of dinosaur DNA."

gcgttgctgg cgtttttcca taggctccgc ccccctgacg agcatcacaa aaatcgacgc ggtggcgaaa cccgacagga ctataaagat accaggcgtt tccccctgga agctccctcg tgttccgacc ctgccgctta ccggatacct gtccgccttt ctcccttcgg gaagcgtggc tgctcacgct gtaggtatct cagttcggtg taggtcgttc gctccaagct gggctgtgtg ccgttcagcc cgaccgctgc gccttatccg gtaactatcg tcttgagtcc aacccggtaa agtaggacag gtgccggcag cgctctgggt cattttcggc gaggaccgct ttcgctggag atcggcctgt cgcttgcggt attcggaatc ttgcacgccc tcgctcaagc cttcgtcact ccaaacgttt cggcgagaag caggccatta tcgccggcat ggcggccgac gcgctgggct ggcgttcgcg acgcgaggct ggatggcctt ccccattatg attcttctcg cttccggcgg cccgcgttgc aggccatgct gtccaggcag gtagatgacg accatcaggg acagcttcaa cggctcttac cagcctaact tcgatcactg gaccgctgat cgtcacggcg atttatgccg caagtcagag gtggcgaaac ccgacaagga ctataaagat accaggcgtt tcccctggaa gcgctctcct gttccgaccc tgccgcttac cggatacctg tccgcctttc tcccttcggg ctttctcatt gctcacgctg taggtatctc agttcggtgt aggtcgttcg ctccaagctg acgaaccccc cgttcagccc gaccgctgcg ccttatccgg taactatcgt cttgagtcca acacgactta acgggttggc atggattgta ggcgccgccc tataccttgt ctgcctcccc gcggtgcatg gagccgggcc acctcgacct gaatggaagc cggcggcacc tcgctaacgg ccaagaattg gagccaatca attcttgcgg agaactgtga atgcgcaaac caacccttgg ccatcgcgtc cgccatctcc agcagccgca cgcggcgcat ctcgggcagc gttgggtcct gcgcatgatc gtgctagcct gtcgttgagg acccggctag gctggcgggg ttgccttact atgaatcacc gatacgcgag cgaacgtgaa gcgactgctg ctgcaaaacg tctgcgacct atgaatggtc ttcggtttcc gtgtttcgta aagtctggaa acgcggaagt cagcgccctg

In 1992, a researcher at the National Institutes of Health (NIH), Dr. Mark Boguski, copied this sequence from the book and searched all of the known DNA sequences at the time. Dr. Boguski wrote up his findings and submitted a manuscript to the journal BioTechniques, as a joke. Surpsingly, his manuscript was accepted and published [Boguski, M.S. A Molecular Biologist Visits Jurassic Park. (1992) BioTechniques 12(5):668-669]. You will reproduce this experiment using BLAST.

EXERCISE 1:   From the main BLAST page, click on the Nucleotide Blast button. This brings up a web page where you can specify your query sequence along with various parameters (including the genetic database to use). Copy-and-paste the above "dinosaur DNA" sequence into the window labeled Enter Query Sequence, using the default Core nucleotide database (core_nt) database, and then click the BLAST button to start the search. After a short, delay, the results of your search will be displayed in the page. By default, the Descriptions tab lists the closest matches (or "hits") in a table, with columns for the sequence description, measures of how close the match was, and background information about the sequence (Accession). Click on the Total Score column heading so that the matches are ranked by total score, the best measure of an overall match. The top match, after sorting by Total Score, should be Cloning vector pHRS-9, complete sequence. What are the descriptions of the next two matches?

Clicking on the Accession link in the rightmost column of a match will give you more information about that match. This includes the SOURCE ORGANISM of the match, i.e., the plant or animal this sample was taken from. For example, the SOURCE ORGANISM entry for the top match (Cloning vector pHRS-9, complete sequence) includes the phrase artificial sequences, which means that this is not a naturally occurring sequence.

EXERCISE 2:    Click on the Accession links corresponding to the next 9 matches. How many of these entries have artificial sequences listed under the SOURCE ORGANISM entry?

By default, the table shows the matching entries as a list with descriptions. Clicking on the Graphic Summary tab at the top of the table displays the matches in a visual form. The colors of the lines show how close the matches were to the search sample, with red signifying sections that match closely and the colors lavender, green, blue, and black signifying less perfect matches. There may also be white sections, denoting gaps in the matching sequence.

EXERCISE 3:    Click on the Graphic Summary tab at the top of the match table. Describe the lines you see for the matches. Are there any colors beside red? If so, what are the other colors? Do any of the lines contain gaps (sequences of white)?

In practice, researchers rarely have complete and exact DNA samples. Some mistakes will undoubtedly occur in extracting samples from organisms, and gaps may occur as pieces of a sample are lost or incorrectly combined. This is why BLAST reports multiple matches and provides matching information via the colored lines and overall score. Advanced users of BLAST can specify additional search parameters to control how similar a match must be in order to be reported.

EXERCISE 4:   Introduce errors into the Jurassic Park sequence by deleting three lines from somewhere in the middle of the search sequence. In addition, add a line of random nucleotides (C, G, A or T) somewhere else in the sequence. Do these changes affect the search results you obtain (compared to the matches from the original search)? Are the top 3 matches the same as before (after sorting by Total Score)? Are the scores of the top matches affected? How about the lines in the Graphic Summary?

The Lost World Dino-DNA Analysis

After Dr. Boguski's article appeared in 1992, it was brought to Michael Crichton's attention. Crichton, who was working on the sequel to Jurassic Park, reached out to Boguski and asked him to consult on the book. Dr. Boguski constructed an interesting sequence that he felt was more scientifically plausible, and this sequence appeared The Lost World.

gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc
EXERCISE 5:   Once again, conduct a BLAST search by invoking Nucleotide Blast, copy-and-pasting the new Lost World sequence into the Enter Query Sequence window, and submitting it to BLAST. As before, click on the Total Score header to sort the hits by total score. Click the Accession link to the right of the highest-scoring sequence match in the list. Which source organism is this DNA sequence from? Similarly, what is the source organism for the next highest-match?
EXERCISE 6:   In the book, it is theorized that birds evolved from dinosaurs. If that were true, you might expect the DNA of modern birds to have similarities with dinosaur DNA. Do either of the top two matches support this theory? If a match does not support the theory, what explanation might there be for it? Explain your answers.

The BLAST page provides tools for searching the DNA databases and viewing the results in various ways. The default Nucleotide Blast option that we have been using displays the hits as nucleotide sequences (i.e., sequences of nucleotides C, G, A, and T). Alternatively, by selecting the blastx option, the results may instead be viewed as proteins. Apparently, Dr. Boguski couldn't resist sneaking a hidden message into his Lost World sequence. He inserted nucleotides into his sequence which, when interpreted as proteins, spelled out a 4-word message.

EXERCISE 7:   To view Dr. Boguski's hidden message, go to the main BLAST page and select the blastx option. Copy-and-paste the Lost World sequence into the Enter Query Sequence window and submit it to BLAST (using the default database). Once the matches appear (this can take several seconds), click on the Alignments tab at the top of the table. This will show the matches as protein sequences, using 20 letters to denote amino acids. Dr. Boguski's message is hidden in the first match, with message letters embedded in the Query sequence corresponding to four gaps (sequences of dashes) in the Sbjct sequence. What is his 4-word message?

Submit a document containing your answers to all the lab questions via BlueLine.