Read everything before doing anything!
DNA is made up of sequences of nucleotides. There are four nucleotides: Adenine, Cytosine, Guanine, and Thymine, represented by the letters A, C, G, and T.
Every set of three nucleotides produces one of twenty amino acids. For example, Cytosine, Thymine, and Guanine (CTG) produce Leucine. Note that there are 64 combinations of three nucleotides, and only twenty amino acids, so that means that a single amino acid can be produced in many different ways.
The following tables, taken from the BioInformatics tutorial are Copyright 1999 by Boris Steipe, University of Munich. The table at the left shows the names of the amino acids and their abbreviations. The table at the right shows the abbreviations of the amino acids and the nucleotide sequences which produce them.
|
|
You will write a program that repeatedly accepts as input a string of ACGT triples and produces:
The program will continue to accept input until the user just presses RETURN without entering any DNA codes.
Here is a sample run of the program. Note that some sequences, such
as TAG, do not produce an amino acid. They are listed as
***
for their three-letter abbreviation, and *
for the one-letter abbreviation. Do not presume that there will never
be more than 4 sequences. Somebody could enter a string with 40
three-letter sequences, or 50, or 80, and your program would have to process
that string correctly.
Enter a nucleotide sequence, or just press ENTER to quit: aaacgatcaccc AAA Lys CGA Arg TCA Ser CCC Pro One-letter abbreviations: KRSP Enter a nucleotide sequence, or just press ENTER to quit: cga tgc CGA Arg TGC Cys One-letter abbreviations: RC Enter a nucleotide sequence, or just press ENTER to quit: cgi tat tca CGI invalid sequence TAT Tyr TCA Ser One-letter abbreviations: ?YS Enter a nucleotide sequence, or just press ENTER to quit: tag cat agg ctt TAG *** CAT His AGG Arg CTT Leu One-letter abbreviations: *HRL Enter a nucleotide sequence, or just press ENTER to quit: tca cag gtt cg Error: You must give complete triples. Enter a nucleotide sequence, or just press ENTER to quit:
Breaking up the string with the sequence into triples is not easy to do
unless you know about pattern matching, which we haven’t gotten to
yet. Presuming that the input sequence is in a scalar named
$sequence
, the following Perl code will split it into
an array called @triples
. Just use this as a
“magic spell”; in CIT042B you will understand
how and why it works.
$sequence = uc($sequence); # convert to uppercase $sequence =~ s/[^A-Z]//g; # remove non-letters $n_chars = length $sequence; # use this for error checking @triples = $sequence =~ m/(...)/ig; # separate into triples
Thus, if $sequence
contained the
string "ACT TAG CAT TTG"
, after running
the code, $triples[0]
would contain ACT
,
$triples[1]
would contain TAG
,
$triples[2]
would contain CAT
, and
$triples[3]
would contain TTG
.
The $n_chars
variable would contain 12, the total
number of letters you entered.
This code only generates complete triples.
Thus, this input ACG TC ATG
will generate
$n_chars
as 8 (eight letters total),
$triples[0]
as ACG
and
$triples[1]
as TCA
. The last TG
will drop off into the bit bucket.
Here is one possible algorithm for solving the problem. It is not the only way to do it, but it is relatively straightforward.
"***"
and a value of
"*"
$one_letter_codes
to ""
"invalid sequence"
as the value. This
result is the “amino acid.”"?"
as the result. Call this
result the “abbreviation.”$one_letter_codes
variable.$one_letter-codes
.ACTGA
or CA
. (Hint: use the length
of $sequence
after you have gotten rid of non-letters with
the code above.)
Name the file lastname_firstname_dna.pl
and
upload it.