Programming Assignment
Hashes

Read everything before doing anything!

Background

DNA is made up of sequences of nucleotides. There are four nucleotides: Adenine, Cytosine, Guanine, and Thymine, represented by the letters A, C, G, and T.

Every set of three nucleotides produces one of twenty amino acids. For example, Cytosine, Thymine, and Guanine (CTG) produce Leucine. Note that there are 64 combinations of three nucleotides, and only twenty amino acids, so that means that a single amino acid can be produced in many different ways.

The following tables, taken from the BioInformatics tutorial are Copyright 1999 by Boris Steipe, University of Munich. The table at the left shows the names of the amino acids and their abbreviations. The table at the right shows the abbreviations of the amino acids and the nucleotide sequences which produce them.

1 Letter 3 Letters Name Mnemonic
A Ala alanine Alanine
C Cys cysteine Cysteine
D Asp aspartate Desparate
E Glu glutamate Eutamate
F Phe phenylalanine Fenylalanine
G Gly glycine Glycine
H His histidine Histidine
I Ile Isoleucine Isoleucine
K Lys lysine Kyssing
L Leu leucine Leucine
M Met methionine Methionine
N Asn asparagine asparagiNe
P Pro proline Proline
Q Gln glutamine Qutamine
R Arg arginine "R"-ginine
S Ser serine Serine
T Thr threonine Threonine
V Val valine Valine
W Trp tryptophane tWyptophane (with a lisp)
Y Tyr tyrosine tYrosine
"T" "C" "G" "A"
TTT:Phe TCT:Ser TGT:Cys TAT:Tyr
TTC:Phe TCC:Ser TGC:Cys TAC:Tyr
TTG:Leu TCG:Ser TGG:Trp TAG:***
TTA:Leu TCA:Ser TGA:*** TAA:***
- - - -
CTT:Leu CCT:Pro CGT:Arg CAT:His
CTC:Leu CCC:Pro CGC:Arg CAC:His
CTG:Leu CCG:Pro CGG:Arg CAG:Gln
CTA:Leu CCA:Pro CGA:Arg CAA:Gln
- - - -
GTT:Val GCT:Ala GGT:Gly GAT:Asp
GTC:Val GCC:Ala GGC:Gly GAC:Asp
GTG:Val GCG:Ala GGG:Gly GAG:Glu
GTA:Val GCA:Ala GGA:Gly GAA:Glu
- - - -
ATT:Ile ACT:Thr AGT:Ser AAT:Asn
ATC:Ile ACC:Thr AGC:Ser AAC:Asn
ATG:Met ACG:Thr AGG:Arg AAG:Lys
ATA:Ile ACA:Thr AGA:Arg AAA:Lys

You will write a program that repeatedly accepts as input a string of ACGT triples and produces:

The program will continue to accept input until the user just presses RETURN without entering any DNA codes.

Here is a sample run of the program. Note that some sequences, such as TAG, do not produce an amino acid. They are listed as *** for their three-letter abbreviation, and * for the one-letter abbreviation. Do not presume that there will never be more than 4 sequences. Somebody could enter a string with 40 three-letter sequences, or 50, or 80, and your program would have to process that string correctly.

Enter a nucleotide sequence, or just press ENTER to quit: aaacgatcaccc
AAA	Lys
CGA	Arg
TCA	Ser
CCC Pro
One-letter abbreviations: KRSP
Enter a nucleotide sequence, or just press ENTER to quit: cga tgc
CGA	Arg
TGC	Cys
One-letter abbreviations: RC
Enter a nucleotide sequence, or just press ENTER to quit: cgi tat tca
CGI	invalid sequence
TAT	Tyr
TCA	Ser
One-letter abbreviations: ?YS
Enter a nucleotide sequence, or just press ENTER to quit: tag cat agg ctt
TAG	***
CAT	His
AGG	Arg
CTT	Leu
One-letter abbreviations: *HRL
Enter a nucleotide sequence, or just press ENTER to quit: tca cag gtt cg
Error: You must give complete triples.
Enter a nucleotide sequence, or just press ENTER to quit:

Some Code to Help You

Breaking up the string with the sequence into triples is not easy to do unless you know about pattern matching, which we haven’t gotten to yet. Presuming that the input sequence is in a scalar named $sequence, the following Perl code will split it into an array called @triples. Just use this as a “magic spell”; in CIT042B you will understand how and why it works.

$sequence = uc($sequence);   	# convert to uppercase
$sequence =~ s/[^A-Z]//g;    	# remove non-letters
$n_chars = length $sequence;	# use this for error checking
@triples = $sequence =~ m/(...)/ig;  # separate into triples

Thus, if $sequence contained the string "ACT TAG CAT TTG", after running the code, $triples[0] would contain ACT, $triples[1] would contain TAG, $triples[2] would contain CAT, and $triples[3] would contain TTG. The $n_chars variable would contain 12, the total number of letters you entered.

This code only generates complete triples. Thus, this input ACG TC ATG will generate $n_chars as 8 (eight letters total), $triples[0] as ACG and $triples[1] as TCA. The last TG will drop off into the bit bucket.

Algorithm

Here is one possible algorithm for solving the problem. It is not the only way to do it, but it is relatively straightforward.

  1. Set up a hash with triples as keys and three-letter amino acid abbreviations as values. For now, let’s call this the “acid hash.”
  2. Set up a hash with three-letter abbreviations as keys and one-letter abbreviations as the values. Call this the “singles hash.” It will be advantageous to have an entry in this hash with a key of "***" and a value of "*"
  3. Read in a string of triples.
  4. While the string of triples is not empty, do the following:
    1. Set a variable named $one_letter_codes to ""
    2. Split the string into triples as described above.
    3. If the string has a complete set of triples, do the following; otherwise give an error message:
      • For each triple, look it up in the acid hash. If it doesn’t exist in the acid hash, then use "invalid sequence" as the value. This result is the “amino acid.”
      • Print the triple and the amino acid.
      • Look up the amino acid in the singles hash and find the corresponding single letter. If it doesn’t exist, use a "?" as the result. Call this result the “abbreviation.”
      • Concatenate the abbreviation to the $one_letter_codes variable.
      • After you have finished going through all the triples, print out whatever is in $one_letter-codes.
    4. Read in another string of triples

Program Requirements

When You Finish

Name the file lastname_firstname_dna.pl and upload it.