PLPTH 613
Bioinformatics Applications
Spring 2009


Schedule
Research project
K-State Online

Lab 3. Assembly of sequenced DNA fragments

Introduction

Assembly of a set of sequenced fragments from a shotgun treatment of a whole genome or  large subclone may be done with a variety of publicly available programs including phrap/crossmatch, the Broad Institute's Arachne, TIGR Assembler, the Celera Assembler, and increasing numbers of short-read assemblers. Several commercial programs include Gene Codes Sequencher, GCG SeqMerge, DNAStar Lasergene's SeqMan, and Vector NTI Advance's ContigExpress. Because the public packages are built to run on Linux or Unix systems including Mac OS X and we are using Windows machines, we will use ContigExpress to demonstrate the principles. It's installed in the lab machines on a course license that is valid for just the current semester. Here's a typical plot of an assembly from ContigExpress, showing the tiling in the upper panel, the coverage depth in the middle section, and the assembly quality (based on consistency of sequence matching in the aligned fragments) in the bottom section:
VNTI Contig Express assembly view

Learning goals of this lab

  1. Gain experience invoking Perl scripts from the command line, using varying parameter values to affect output.
  2. Learn the influence of sequence quality, length, vector contamination, and coverage on the quality of assembly.
  3. Learn the features of a useful software package for doing molecular biology and genomics, including searching sequences against public databases.

What we will do

  1. Using a 120-Kb "contig" randomly sampled from chromosome 1 of the sorghum genome assembly, we will simulate shotgun digestion of a large-insert clone into fragments of various sizes and to various coverages. We will then introduce vector contamination (which happens during preparation of sequencing templates by PCR amplification of inserts) and sequencing error, due to poor signal quality during sequencing and resulting in miscalling of bases. Finally we will import our fragments into ContigExpress and attempt to assemble them.
  2. We will use a second set of files from a different source, in .abi format (meaning raw trace files with quality values, direct from the sequencer), to examine the relationship of trace and quality data to assembly characteristics.
  3. Finally, we will search our assembled contig(s) against a public sequence data base in order to identify similarity with known genes.

Exercise 1

Questions to be answered are in this color. Other questions should be thought about, but you need not turn in the answers.
  1. As always, begin by creating a directory (the C drive is the best place) for holding your files during the lab. For convenience I'll call this directory Lab_3.
  2. Perl has been installed on your lab machines, and I've written a script for simulating fragment data, called sample_reads_for_assembly.pl. To download it, right-click on the link.
  3. This script operates on a FASTA-formatted sequence file and a cloning vector file, both of which you should also download to your local disk. (By now you should have noticed that left-clicking causes the server to try to send the contents of a file to your browser as HTML for displaying on the page, while right-clicking and choosing "Save link as..." tells it to send the file for saving directly. Why should we copy material from a page, paste it into a text document, and save that under a new name, when we can just save it directly under the original name?).
  4. Open a DOS window (from the Start menu, choose Run... and in the Run dialog, type cmd and click OK). Then navigate to your Lab_3 directory. Type the letters

    perl s

    and hit the Tab key so that the DOS shell completes the name sample_reads_for_assembly.pl for you. Now press Enter and observe the display.
  5. A well-written script should always tell you how to use it, by providing a list of expected arguments. Here we see that the script expects us to provide a sequence filename, a name for the output file, the desired read length, and the coverage. If we choose, we can supply three further arguments: the error rate in the internal parts of the fragment, the factor by which this error rate should be increased in the 50 bases on either end (which in Sanger sequencing are the most error-prone), and a file containing cloning-vector sequence for adding vector contamination to the fragment ends. If we choose not to give values for these, the script will use the default error rates shown and will not add vector sequence.
  6. Creating simulated fragments for an assembly experiment For the first experiment, we will pretend that the vector sequence has already been trimmed off the fragment reads. Type (or paste) at the command prompt

    perl sample_reads_for_assembly.pl chr1_120k_from_25M.fasta cov1\frag1k 1000 1

    and press Enter. Note that whenever using the command line, you should take advantage of filename completion in this way, to save typing. Thus when entering the above, you could have typed perl s (Tab, Space) c (Tab, Space) and then the rest.
  7. What you've told this Perl script to do is to create enough 1000-base fragments for 1x coverage of a 120-Kb piece of DNA -- which means it will create 120 fragments -- and place them in a subdirectory, cov1, with basename frag1k. The script will create cov1 when it detects that there is no subdirectory of this name in your current directory, Lab_3.
  8. Navigate there with

    cd cov1

    and view one of the files with

    more <filename>

    to assure yourself that it looks like a regular FASTA-formatted file. Note that the file and fragment names contain a fragment number followed by the start position from which the fragment was sampled. This will be useful to us when we view the assembly. Of course in real data we would have no way of knowing where the fragment came from. Some of the fragments would be, in fact, just cloned pieces of the large-insert cloning vector.
  9. Viewing and assembling fragments in ContigExpress Now open ContigExpress with Start/All Programs/Invitrogen/Vector NTI Advance 11/ContigExpress. This will give you a ContigExpress Project Explorer window. Choose Project/Add Fragments/From FASTA file and browse to your new cov1 directory. Select all the files (just type Ctrl-A) and click Open to load them into the assembly project.
  10. Because we have no quality values for this sequence data and we have already pretended to remove leftover vector sequence on the ends of the fragments, we will leave ContigExpress to trim the fragment ends based on alignment results. We'll just tell the program to assemble everything and see what happens. Select all of the fragments and choose Assemble/Assemble Selected Fragments.
  11. Study the right-hand pane of the Project window. This is called the List Pane and shows the organization of fragments into contigs. Note that ContigExpress shows the original fragment lengths as well as the numbers of bases trimmed at both ends. Since we chose a coverage of only 1 in our first experiment, there should be several singletons: fragments that could not be assembled into contigs. You'll see these below the contigs in the List Pane.
  12. To see why fragments have been trimmed, double-click on the Contig 1 icon to open a Contig Viewer window. Three panes are provided: a Text Pane, a Graphics Pane, and a Contig Alignment Pane. To see these all at once, choose View/Edit Pane Layout and select the Standard Predefined Layout in the dialog.
  13. Exploring the Contig Viewer assembly results. Click on a fragment in the Graphics Pane and observe what happens in the Contig Alignment Pane below. Be sure you understand why columns in the alignment are highlighted in yellow. What is the correspondence between this coloring and the green bars in the Graphics Pane? You should try to determine this from the display, but if you can't, you'll find a user manual in the VectorNTI directory somewhere on your hard drive. Chapters 28 to 30 discuss ContigExpress.
  14. After clicking in the Graphics Pane, click on the Zoom In button in the second toolbar (if you hover the cursor over the buttons, you'll see tool tips that tell you what they do). Hold down the Control key and click to zoom the image vertically. Note that the fragment labels appear. Observe the start-position values in the fragment names. How do these correspond with the assembly that has been done? Note the "_r" after some of the fragment names. How does this correspond with the assembly, and what does it mean?
  15. When you're finished exploring the Contig Viewer, close it and choose Assemble/Assembly Setup and click on the Clipping tab. This allows an analyst to set parameters for trimming, based on poor overlap quality of alignments between fragments. Close the dialog and look again at the List Pane. Note that some of the singletons have been trimmed. Since clearly no alignments were found for these, I'm not sure why they were trimmed.
  16. Can you estimate the amount of the original 120K clone that has been covered by contigs? Click View/Expand Contigs to hide the fragments, or click the Expand Contigs button in the toolbar.  (Inconveniently, ContigExpress doesn't sum up the contig lengths for you, and doesn't seem to allow you to export them as text so that you could sum them up in Excel).
  17. Exploring effects of simulated vector contamination on assembly You should be convinced that 1x coverage is far too low for full sequencing of our big clone. Clearly the object is to achieve enough coverage that we'll be able to build one large contig. But for the next experiment, let's create fragments with vector contamination. To save time, we'll start with a 50-Kb clone this time, so download this sequence file.
  18. At the command prompt, paste this text:

    perl sample_reads_for_assembly.pl chr1_50k.fasta cov5\frag1k 500 5 .01 5 TOPO_vector.txt

    and press Enter. Can you predict the number of fragments that will be generated and placed in subdirectory cov5?
  19. In the ContigExpress Project Explorer window, choose Project/Close Project to get rid of the work we've done so far (when prompted to Save, click No). Now load the new data as in step 9 above.
  20. Note two things: all the fragments have exactly the same length (which is not like real life!) and it's 560, not 500. That is because the script has added a 37-base chunk of vector to one end and a 23-base chunk to the other. As these were "sequenced" and "basecalled", errors were introduced so that about 1 in 20 bases in the end regions are wrong.
  21. Can the fragments be assembled, as in step 10? Try it. For comparison later, make a note of the number and approximate total length of your contigs.
  22. Unfortunately, this wasn't a good demonstration of why vector contamination must be removed before assembly will work. The vector sequence that we added to simulate contamination was so short that ContigExpress was able to trim it off during its attempts to align the fragments. Notice the sizes listed in the columns 5'Trimmed bases and 3'Trimmed bases -- many of them match the sizes of the two vector sequences we added.
  23. Ordinarily, however, vector remnants are too long for this approach to work, and MUST be removed before assembly. So we will see how this would be done in ContigExpress. First, repeat step 19 above: close the project (since you can't un-trim the fragments!) and add all the fragments again from FASTA files.
  24. Exploring vector trimming Now select about half of the fragments (CE won't let you trim more than 300 at a time) and choose Edit/Trim Selected Fragments Ends for Vector Contamination. For each fragment you'll see a small pane showing the ends of the sequence and the number of bases before trimming, 560 in our little example. Click the Settings... button.
  25. In the Settings dialog, you'll see a Polylinker List panel at left. Click the checkbox of the one called pCR Blunt II TOPO_(3490-30) and then click on the name of the polylinker.
  26. You'll see the sequence in the right panel, with several enzyme insertion sites shown. Click on the EcoRI insertion site that is nearest the left end -- it will now be highlighted in red. Now click OK and, back in the Trimmer dialog, click Calculate! After a while the trimming will be finished. Look at the results. The red sequences represent the trimmed part, and we see that only 19 bases were trimmed from the 5' ends. Let's try relaxing the constraints to make trimming a little more vigorous. Go back to the Settings dialog and change the Minimum vector overlap with ambiguities to 25 and the Vector match threshold to 60, then go back to the Trimmer dialog and click Calculate!
  27. Except for a couple of reads where the trimmer didn't know where to stop and trimmed the whole read away, we may have found reasonable settings (but you can probably do better if you try). So click OK to make these trims permanent, and view the results in the List Pane. Actually they aren't so good, are they? We've lost a lot of data. But select and trim the rest of your reads in the same way.
  28. Now repeat step 10: select all fragments and assemble. Determine the number of contigs and compare their estimated total length with the values you recorded in step 21. What can you conclude from this experiment?
  29. Functional analysis exercise Select one of your contigs (say, Contig 1) and choose Project/Convert Contig to Fragment. This will add a new fragment called Contig 1 Consensus. Right-click on this consensus sequence label and choose Analyses/Web Analyses/Compare Against/Many databases using BLAST@NCBI. In the next dialog, accept the defaults Whole sequence and Direct, and if you're presented with a prompt to choose a browser, choose the default for your computer.
  30. You'll be sent to a local copy of the NCBI BLAST gateway page with your sequence in it. Just click the BLAST! button, and on the next page (which is served by NCBI) accept the defaults and initiate the BLAST search by clicking View Report. You should understand that this will carry out alignments of your contig sequence with the nr (nonredundant) NCBI nucleotide database, which contains a vast number of DNA sequences from thousands of organisms. The search result will report the annotations that have been attached (by the submitters) to the sequences that showed the highest-scoring alignments. You can click on any score in the Score (Bits) column to see the corresponding alignment. What gene, gene family, or other genomic element do you think your contig represents, or contains? (Please don't paste the whole BLAST output into your report, just the top few descriptions!) I should warn you that I don't know what you'll find, so there is no certain answer to this question -- just give a concise but informative summary.
  31. We will now do a few experiments without attaching any vector contamination, under the assumption that this can be efficiently removed in real situations by proper tuning of the trimming software. Our aim is to determine how coverage, fragment sizes, and error rates influence assembly. For speed, sample fragments from an even smaller clone of only 20 Kb. Use the Perl script to generate at least five data sets, varying fragment size from 300 to 1200 (I don't recommend trying shorter reads, without adjusting CE's trimming settings), coverage from 2 to 8, and error from 0.01 internally to 0.05. Be sure to use a unique directory name for the output (the 2nd argument) -- otherwise you'll write all of the files into the Lab_3 directory itself! Also realize that you can return to a previous command by pressing the Up arrow on your keyboard, and can then use the left and right arrows to edit the command. Record the numbers of contigs and of unassembled fragments and report these along with your parameters, in a concise table. Note any peculiar behavior of CE and any other discoveries you make from this experiment. If there were only a few singletons left in an experiment, can you explain (from their sampling start positions, errors, or any other feature) why they were not incorporated into any contig?
  32. Ordinarily you will be assembling fragments that come with quality scores. Until recently (when "next-generation" sequencing has become popular) these data were provided by sequencing labs in the form of trace files with the .abi suffix . We will now look at a set of fragments of this kind.
  33. In the ContigExpress Project Explorer window, choose Project/Close Project. Now choose Project/Add Fragments/From ABI file... and navigate to directory
    C:\Program Files\Invitrogen\Vector NTI Advance 11\Demo Projects\HS354_C02\ABI. Select all files and choose Open. You'll get a prompt about naming the fragments -- click Yes just so that we're all doing the same thing.
  34. Now in the List Pane, double-click on one of the fragment names to open the Fragment Viewer. Again, you may either view each panel separately by clicking on the tabs at lower left, or follow the instructions in step 12 above to see all panels at once.
  35. Scrolling along the Chromatogram panel, notice the change in appearance of the traces at either 5' or 3' ends. This is quite usual for Sanger sequencing, and the ambiguity of the peak patterns results in amplified basecalling error at the ends. Experiment with the various buttons to see how the traces can be shown or hidden, etc.
  36. Assemble your fragments and repeat step 29 with the contig. Examining the BLAST scores and alignments and realizing that an E value of 0.0 corresponds to an exact match between sequences, from what organism do you think this demonstration project was built?