Sample projects
NOTE that many of my
sample problems involve calculating statistics -- just because that's
what I do. There is no requirement for your project to be like this --
use your imagination to think of some task you once had, or may some
day have, that would be much easier if the computer could do it.
Nothing in any of these
projects requires any Perl that you haven't seen. You are free to
submit a project that requires learning new Perl, but when I see the
design document I can tell whether you will really need new Perl, and
will be able to help you decide what you do need.
1. Calculating three
statistics to describe a biological interaction network
A biological network (for example, a
protein - protein interaction network) consists of a set of molecules
(represented by their names) and a set of interactions between
molecules. In network terminology, we call these nodes and edges. An example is shown in this
plot drawn with Cytoscape:

There are several ways to characterize the structure of a network.
Notice, for example, that some of the nodes have many more edges than
others. The number of edges connected to any node is the degree of that node. So some nodes
are of higher degree than others. The first thing we would like to know
is the degree distribution of
the network. This is represented by a table of the form
Degree
Proportion of nodes with this degree
0
0.0
1
0.32
2
0.24
...
where, of course, the proportions must sum to 1.
Secondly, we would like to know the average
degree.
This is just twice the number of edges, divided by the number of nodes.
(The reason for "twice" is that each edge connects two nodes).

Thirdly, we would like to know the average
clustering coefficient
of the network. The clustering
coefficient of a node measures the relative frequency of edges
between
the immediate neighbors of that node. For example, in the figure at
right node A has degree k =
5. If among all of its 5 neighbors (B, C, D, E, and H) each pair were
connected by an edge, there would be k(k
- 1)/2 = 10 edges. But in fact there is only one such edge (BC). So the
clustering coefficient of this node is 0.1. The average clustering
coefficient of the entire network is just the average of these numbers,
taken over all nodes.
Finally, you'll need to know how the network is represented in a text
file. It's simple: each line describes two nodes that are linked by
an edge.
The task is, given an
interaction network described in such a file, to compute the three
statistics described above.
2. Joining multiple tables
on one column
This is an operation that is a headache to do in Excel. It's useful to
have a little Perl utility for it. Here's the problem: I have several
two-column tab-separated files, of which two examples are shown here:
File 1
Label
Value
0 x
1 b
2 n
4 p
5 s
7 v
9 c
File 2
Label Value
5 r
3 c
2 j
8 s
0 b
I wish to join these so that there is only one label column, but a
value column for each of the source files. For this example, my result
file will look like this:
Label
F1 F2
0
x b
1
b -
2
n j
3
- c
4
p -
5
s r
7
v -
8
- s
9
c -
The task is, given a file containing a list of any number of such input
files, to produce one such joined output file. You may assume that
every Label column contains only integers and that in any input file,
any given label appears only once. The output table should have rows
only for labels that appeared in at least one input file.
3. Calculating a pairwise correlation table
Given a tab-separated table of numbers with row labels, calculate
the Pearson's correlation statistic between each pair of rows and write
the correlation table to an output file with row and column labels. You
may assume that there are no missing values and that all m rows are of
identical length n.
4. Calculating run-length distributions
In a series of coin flips, it is of interest to know the distribution
of lengths of runs, which are consecutive sequences of identical
outcomes (heads
or tails). For example, if you flip a coin 1000 times, there will be a
certain number of runs of length 20, 19, 18, etc. Generate such a
distribution by simulation and write it
to a file. Allow the analyst to specify the
probability of heads that will apply to all flips in an experiment. You
need not
distinguish runs of heads from runs of tails.
5. Calculating k-mer
distribution
The distribution of k-mers in
biological sequences is also interesting, and is one way that genic
regions are distinguishable from intergenic regions. It can even be
used for testing whether a piece of text is likely to have been written
by the same author who wrote other texts. Write a program that, given a
biological sequence and an integer k
specified by the analyst, calculates the frequencies of the k-mers that are present. For
example, in the sequence acgttcgtacg,
we have 3-mers acg (2), cgt (2), gtt (1), tcg (1), gta (1), and tac (1). Of course you will
want to place the output in a nice tab-separated table with column
labels.
6. Generating realistic sequences
Let's suppose you've calculated the 2-mer distribution in a large body
of genic sequence. You might now want to generate a new sequence with
similar properties. Can you see how? For example, suppose in a DNA
sequence you have counted aa
150 times, ac
200 times, ag
250 times, and at
100 times. This means that
if you start with an a,
the probability that the next base is a t is
100/(150 + 200 + 250 + 100)
or 0.143. After you generate the next base, you use the same idea
(relative frequencies of 2-mers starting with that base) to generate
the base following that.
A variant of this involves words in text. If you use, say, the works of
Shakespeare to tabulate the frequencies of 2-word sequences, you can
use this distribution to generate some text that, while it won't be
Shakespearean or even good English, looks a lot more like Shakespearean
English than do random sequences of words. Some interesting work on
this line (using letter or word frequencies) was done by Claude Shannon,
the "father of information theory", and here are a few slides
related to his work.
The task is to write a sequence generator based on a known 2-mer
frequency distribution.
7. Calculating false-discovery-rate (FDR) statistics
Analysis of a microarray experiment testing R genes in C conditions yields a R x C matrix of p values. The rows are labeled with
probe (or gene) names and the columns with condition names, as in the
example below.
ProbeID C1 C2
C3 C4 C5
C6 C7
AFFX-BioB-5_at
0.5486 1.20E-05
0.5806 0.1404
0.1124 0.2592 0.1872
We wish to calculate for each condition a suitable threshold for
accepting that an expression change in a gene under that condition is
meaningful. This is how it's done by the method first proposed by Benjamini
& Hochberg (1995): we sort the p values in ascending order (that
is, in descending order of significance). This allows us to assign
ranks 1 through R (where the
rank of the ith
gene is just i) to the p values associated with that
condition. Now, given a desired alpha level or acceptable
false-positive rate (denoted as f)
specified by the analyst, we
will accept all genes for which p
< if/N.
Your program will accept a data file and a f value and will print to a file a
table of the p values of all
genes (call their number G)
that were accepted at that f
value, omitting any genes that were not accepted for any condition. The
FDR, which you should also print out, is fN/G.
Note that you can't sort p
values independently for each condition, because then you'll lose the
associations of gene name with p value
for the other conditions. You must sort the entire table on that
condition. Of course you already know how to do this.
More
about the FDR and subsequent improvements to it
8. Preparing Nimblegen data for analysis
Here are some tasks suggested by Marce Lorenzen, for handling Nimblegen
resequencing-array data that looks like this.
To see the column headings lined up properly with the data, right-click
on the link and save the file, then load it into Microsoft Excel. The
full data file, which is not provided here, contains 2 million lines,
so you couldn't manipulate it in Excel.
- In practice we would load these data into a relational
database rather
than write multiple Perl scripts to parse them, but this is beyond the
scope of our little Perl course. Here's how I would prepare the data
for a database, by separating the fields on which I would want to
query. The chromosome (or linkage group, LG) name appears in both SEQ_ID and PROBE_ID columns. In the PROBE_ID column, it's combined with
the chromosome base position, but that also appears in the POSITION column, so we can discard
it. In the SEQ_ID column, the
chromosome name is combined with the base-position range of that
chromosome, so I would discard the chromosome name and keep
the base range. I'd also rename the column headings to agree with my
data changes. So your task is to
prepare a new file in which these changes have been made to these two
columns. This would allow us to load the data into Microsoft
Access or other relational database and query it, if we knew how. But
since we don't, the rest of our tasks will be to write Perl scripts to
extract the data more laboriously.
- For the rest of the tasks
in this project, assume that we have not
prepared the data file as in task 1. In this
task, you must divide this file up into smaller files, one for each of
the chromosomes. These, for reasons best known to Tribolium geneticists, are named
chrLGX, chrLG2, chrLG3, chrLG4, chrLG5, chrLG6, chrLG7, chrLG8, chrLG9,
chrLG10, UNK_GROUP, and UNK_SGTN. Of
course, your script must not assume that you know these names -- it
should
find them out from the data file. So, for example, the file
named chrLG4.txt should
contain all the lines of the main data file whose PROBE_ID or SEQ_ID column contains CHRLG4 or
chrLG4 as the chromosome part of the data entry.
- Create separate FASTA-formatted files for each of the above
chromosome subsets, following the format
>PROBE_ID
PROBE_SEQUENCE
Example:
>CHRLG10FS006009885
AAGTGTTAAATACATAATTTTTTTCCAGACATTTAAGTTTGTAGCTTGAATCAA
>CHRLG10FS008661858
AGACCCAGTGCAGCTTCCCCCAGATTGAATTATTAATGGATTAAGTCGAAC
- Given a specific chromosome and starting and ending base
positions, create a file containing only the data rows in which the POSITION entry lies between those
positions on that chromosome. For example, this longer
(1000-line) sample file
contains data for 148 probes located on chromosome 2, but only 7 of
those probes are located between positions 700000 and 800000
(CHRLG2FS000747050; CHRLG2FS000748724; CHRLG2FS000749579;
CHRLG2FS000750962; CHRLG2FS000753313; CHRLG2FS000755513; and
CHRLG2FS000767665). So your Perl script, given the arguments 2, 700000,
and 800000, would write an 8-line file (the header line, plus the data
lines for these 7 probes).
9. Converting weather data (HARD for beginners)
The difficulty of this project lies in deciding what kind of data
structure to put the data into so that you can get them back out in a
different arrangement. From the National
Climatic Data Center I downloaded daily surface weather
data for Klamath Falls, Oregon during part of 1992. Each
line gives data for one weather variable (in column ELEM) for 31
days in each
month, with the missing-data code -99999
entered when observations were not available. Your task is to create a
new file in which each line
represents one day, and each column represents one variable, like this
(you may print out just these 5 columns in your solution):
Day TMAX TMIN
PRCP TOBS [and a few other variables]
33298
50 32 4 42
33299
42 32 37 36
33300
53 33 33 45
33301
52 34 58 35
33302
39 24 0 30
How do you extract records from each line of the original file? It's
true that they're comma-separated, but I recommend that you do not use split().
Using regular
expressions will do the job much more concisely.
Note that I converted the dates into Julian days, using
December 31, 1899 as day 0 since that's how Excel does it. I'll show
you how to do this with a Perl module, so that you can focus on the
problem of converting the data organization. At the top of your
program, insert this statement:
use
Date::Calc
qw(Delta_Days Days_in_Month);
which tells Perl to use these two subroutines from module Date::Calc.
Then define these
constants:
my
($kExcel_ref_year,
$kExcel_ref_month, $kExcel_ref_day) = (1900, 1, 1);
my
$kDay_kludge = 2; #
unexplained difference between Excel's and Perl's calculation
The subroutines are called as follows, after you've extracted the
current year, month, and day from the current line of the data (the
year and month are combined in YEARMO):
my
$excel_day =
$kDay_kludge + Delta_Days
($kExcel_ref_year, $kExcel_ref_month, $kExcel_ref_day, $year, $month,
$day);
Because the data files contain entries for 31 days of every month
(with the missing-data code inserted for nonexistent days) you'll need
to know the
number of days in each month, so that you don't try to print out data
for nonexistent days such as April 31.You can get this number with
my
$num_days_in_month =
Days_in_Month($year, $month);
|