Introduction to Perl for Biologists, PLPTH 750

Department of Plant Pathology, Kansas State University

Home page
Schedule

Sample projects

NOTE that many of my sample problems involve calculating statistics -- just because that's what I do. There is no requirement for your project to be like this -- use your imagination to think of some task you once had, or may some day have, that would be much easier if the computer could do it.

Nothing in any of these projects requires any Perl that you haven't seen. You are free to submit a project that requires learning new Perl, but when I see the design document I can tell whether you will really need new Perl, and will be able to help you decide what you do need.

1. Calculating three statistics to describe a biological interaction network

A biological network (for example, a protein - protein interaction network) consists of a set of molecules (represented by their names) and a set of interactions between molecules. In network terminology, we call these nodes and edges. An example is shown in this plot drawn with Cytoscape:

Sample network
 
There are several ways to characterize the structure of a network. Notice, for example, that some of the nodes have many more edges than others. The number of edges connected to any node is the degree of that node. So some nodes are of higher degree than others. The first thing we would like to know is the degree distribution of the network. This is represented by a table of the form

Degree        Proportion of nodes with this degree
0                   0.0
1                   0.32
2                   0.24
...

where, of course, the proportions must sum to 1.

Secondly, we would like to know the average degree. This is just twice the number of edges, divided by the number of nodes. (The reason for "twice" is that each edge connects two nodes).
Clustering_coefficient
Thirdly, we would like to know the average clustering coefficient of the network. The clustering coefficient of a node measures the relative frequency of edges between the immediate neighbors of that node. For example, in the figure at right node A has degree k = 5. If among all of its 5 neighbors (B, C, D, E, and H) each pair were connected by an edge, there would be k(k - 1)/2 = 10 edges. But in fact there is only one such edge (BC). So the clustering coefficient of this node is 0.1. The average clustering coefficient of the entire network is just the average of these numbers, taken over all nodes.

Finally, you'll need to know how the network is represented in a text file. It's simple: each line describes two nodes that are linked by an edge.

The task is, given an interaction network described in such a file, to compute the three statistics described above.

2. Joining multiple tables on one column

This is an operation that is a headache to do in Excel. It's useful to have a little Perl utility for it. Here's the problem: I have several two-column tab-separated files, of which two examples are shown here:

File 1

Label    Value
0    x
1    b
2    n
4    p
5    s
7    v
9    c

File 2

Label    Value
5    r
3    c
2    j
8    s
0    b

I wish to join these so that there is only one label column, but a value column for each of the source files. For this example, my result file will look like this:

Label    F1    F2
0    x    b
1    b    -
2    n    j
3    -    c
4    p    -
5    s    r
7    v    -
8    -    s
9    c    -

The task is, given a file containing a list of any number of such input files, to produce one such joined output file. You may assume that every Label column contains only integers and that in any input file, any given label appears only once. The output table should have rows only for labels that appeared in at least one input file.

3. Calculating a pairwise correlation table

Given a tab-separated table of numbers with row labels, calculate the Pearson's correlation statistic between each pair of rows and write the correlation table to an output file with row and column labels. You may assume that there are no missing values and that all m rows are of identical length n.

4. Calculating run-length distributions

In a series of coin flips, it is of interest to know the distribution of lengths of runs, which are consecutive sequences of identical outcomes (heads or tails). For example, if you flip a coin 1000 times, there will be a certain number of runs of length 20, 19, 18, etc. Generate such a distribution by simulation and write it to a file. Allow the analyst to specify the probability of heads that will apply to all flips in an experiment. You need not distinguish runs of heads from runs of tails.

5. Calculating k-mer distribution

The distribution of k-mers in biological sequences is also interesting, and is one way that genic regions are distinguishable from intergenic regions. It can even be used for testing whether a piece of text is likely to have been written by the same author who wrote other texts. Write a program that, given a biological sequence and an integer k specified by the analyst, calculates the frequencies of the k-mers that are present. For example, in the sequence acgttcgtacg, we have 3-mers acg (2), cgt (2), gtt (1), tcg (1), gta (1), and tac (1). Of course you will want to place the output in a nice tab-separated table with column labels.

6. Generating realistic sequences

Let's suppose you've calculated the 2-mer distribution in a large body of genic sequence. You might now want to generate a new sequence with similar properties. Can you see how? For example, suppose in a DNA sequence you have counted aa 150 times, ac 200 times, ag 250 times, and at 100 times. This means that if you start with an a, the probability that the next base is a t is 100/(150 + 200 + 250 + 100) or 0.143. After you generate the next base, you use the same idea (relative frequencies of 2-mers starting with that base) to generate the base following that.

A variant of this involves words in text. If you use, say, the works of Shakespeare to tabulate the frequencies of 2-word sequences, you can use this distribution to generate some text that, while it won't be Shakespearean or even good English, looks a lot more like Shakespearean English than do random sequences of words. Some interesting work on this line (using letter or word frequencies) was done by Claude Shannon, the "father of information theory", and here are a few slides related to his work.

The task is to write a sequence generator based on a known 2-mer frequency distribution.

7. Calculating false-discovery-rate (FDR) statistics

Analysis of a microarray experiment testing R genes in C conditions yields a R x C matrix of p values. The rows are labeled with probe (or gene) names and the columns with condition names, as in the example below.

ProbeID    C1    C2    C3    C4    C5    C6    C7

AFFX-BioB-5_at    0.5486    1.20E-05    0.5806    0.1404    0.1124    0.2592    0.1872

We wish to calculate for each condition a suitable threshold for accepting that an expression change in a gene under that condition is meaningful. This is how it's done by the method first proposed by Benjamini & Hochberg (1995): we sort the p values in ascending order (that is, in descending order of significance). This allows us to assign ranks 1 through R (where the rank of the ith gene is just i) to the p values associated with that condition. Now, given a desired alpha level or acceptable false-positive rate (denoted as f) specified by the analyst, we will accept all genes for which p < if/N.

Your program will accept a data file and a f value and will print to a file a table of the p values of all genes (call their number G) that were accepted at that f value, omitting any genes that were not accepted for any condition. The FDR, which you should also print out, is fN/G.

Note that you can't sort p values independently for each condition, because then you'll lose the associations of gene name with p value for the other conditions. You must sort the entire table on that condition. Of course you already know how to do this.

More about the FDR and subsequent improvements to it

8. Preparing Nimblegen data for analysis

Here are some tasks suggested by Marce Lorenzen, for handling Nimblegen resequencing-array data that looks like this. To see the column headings lined up properly with the data, right-click on the link and save the file, then load it into Microsoft Excel. The full data file, which is not provided here, contains 2 million lines, so you couldn't manipulate it in Excel.
  1. In practice we would load these data into a relational database rather than write multiple Perl scripts to parse them, but this is beyond the scope of our little Perl course. Here's how I would prepare the data for a database, by separating the fields on which I would want to query. The chromosome (or linkage group, LG) name appears in both SEQ_ID and PROBE_ID columns. In the PROBE_ID column, it's combined with the chromosome base position, but that also appears in the POSITION column, so we can discard it. In the SEQ_ID column, the chromosome name is combined with the base-position range of that chromosome, so I would discard the chromosome name and keep the base range. I'd also rename the column headings to agree with my data changes. So your task is to prepare a new file in which these changes have been made to these two columns. This would allow us to load the data into Microsoft Access or other relational database and query it, if we knew how. But since we don't, the rest of our tasks will be to write Perl scripts to extract the data more laboriously.
  2. For the rest of the tasks in this project, assume that we have not prepared the data file as in task 1. In this task, you must divide this file up into smaller files, one for each of the chromosomes. These, for reasons best known to Tribolium geneticists, are named chrLGX, chrLG2, chrLG3, chrLG4, chrLG5, chrLG6, chrLG7, chrLG8, chrLG9, chrLG10, UNK_GROUP, and UNK_SGTN. Of course, your script must not assume that you know these names -- it should find them out from the data file. So, for example, the file named chrLG4.txt should contain all the lines of the main data file whose PROBE_ID or SEQ_ID column contains CHRLG4 or chrLG4 as the chromosome part of the data entry.
  3. Create separate FASTA-formatted files for each of the above chromosome subsets, following the format
    >PROBE_ID
    PROBE_SEQUENCE

    Example:
    >CHRLG10FS006009885
    AAGTGTTAAATACATAATTTTTTTCCAGACATTTAAGTTTGTAGCTTGAATCAA
    >CHRLG10FS008661858
    AGACCCAGTGCAGCTTCCCCCAGATTGAATTATTAATGGATTAAGTCGAAC
  1. Given a specific chromosome and starting and ending base positions, create a file containing only the data rows in which the POSITION entry lies between those positions on that chromosome.  For example, this longer (1000-line) sample file contains data for 148 probes located on chromosome 2, but only 7 of those probes are located between positions 700000 and 800000 (CHRLG2FS000747050; CHRLG2FS000748724; CHRLG2FS000749579; CHRLG2FS000750962; CHRLG2FS000753313; CHRLG2FS000755513; and CHRLG2FS000767665). So your Perl script, given the arguments 2, 700000, and 800000, would write an 8-line file (the header line, plus the data lines for these 7 probes).

9. Converting weather data (HARD for beginners)

The difficulty of this project lies in deciding what kind of data structure to put the data into so that you can get them back out in a different arrangement. From the National Climatic Data Center I downloaded daily surface weather data for Klamath Falls, Oregon during part of 1992. Each line gives data for one weather variable (in column ELEM) for 31 days in each month, with the missing-data code -99999 entered when observations were not available. Your task is to create a new file in which each line represents one day, and each column represents one variable, like this (you may print out just these 5 columns in your solution):

Day    TMAX    TMIN    PRCP    TOBS [and a few other variables]

33298    50    32    4    42
33299    42    32    37    36
33300    53    33    33    45
33301    52    34    58    35
33302    39    24    0    30

How do you extract records from each line of the original file? It's true that they're comma-separated, but I recommend that you do not use split(). Using regular expressions will do the job much more concisely.

Note that I converted the dates into Julian days, using December 31, 1899 as day 0 since that's how Excel does it. I'll show you how to do this with a Perl module, so that you can focus on the problem of converting the data organization. At the top of your program, insert this statement:

use Date::Calc qw(Delta_Days Days_in_Month);

which tells Perl to use these two subroutines from module Date::Calc. Then define these constants:

my ($kExcel_ref_year, $kExcel_ref_month, $kExcel_ref_day) = (1900, 1, 1);
my $kDay_kludge = 2; # unexplained difference between Excel's and Perl's calculation

The subroutines are called as follows, after you've extracted the current year, month, and day from the current line of the data (the year and month are combined in YEARMO):

my $excel_day = $kDay_kludge + Delta_Days
    ($kExcel_ref_year, $kExcel_ref_month, $kExcel_ref_day, $year, $month, $day);

Because the data files contain entries for 31 days of every month (with the missing-data code inserted for nonexistent days) you'll need to know the number of days in each month, so that you don't try to print out data for nonexistent days such as April 31.You can get this number with

my $num_days_in_month = Days_in_Month($year, $month);