Programming Assignment
Hashes

Read everything before doing anything!

Background

DNA is made up of sequences of nucleotides. There are four nucleotides: Adenine, Cytosine, Guanine, and Thymine, represented by the letters A, C, G, and T.

Every set of three nucleotides produces one of twenty amino acids. For example, Cytosine, Thymine, and Guanine (CTG) produce Leucine. Note that there are 64 combinations of three nucleotides, and only twenty amino acids, so that means that a single amino acid can be produced in many different ways.

The following tables, taken from the BioInformatics tutorial are Copyright 1999 by Boris Steipe, University of Munich. The table at the left shows the names of the amino acids and their abbreviations. The table at the right shows the abbreviations of the amino acids and the nucleotide sequences which produce them.

1 Letter	3 Letters	Name	Mnemonic
A	Ala	alanine	Alanine
C	Cys	cysteine	Cysteine
D	Asp	aspartate	Desparate
E	Glu	glutamate	Eutamate
F	Phe	phenylalanine	Fenylalanine
G	Gly	glycine	Glycine
H	His	histidine	Histidine
I	Ile	Isoleucine	Isoleucine
K	Lys	lysine	Kyssing
L	Leu	leucine	Leucine
M	Met	methionine	Methionine
N	Asn	asparagine	asparagiNe
P	Pro	proline	Proline
Q	Gln	glutamine	Qutamine
R	Arg	arginine	"R"-ginine
S	Ser	serine	Serine
T	Thr	threonine	Threonine
V	Val	valine	Valine
W	Trp	tryptophane	tWyptophane (with a lisp)
Y	Tyr	tyrosine	tYrosine

"T"	"C"	"G"	"A"
TTT:Phe	TCT:Ser	TGT:Cys	TAT:Tyr
TTC:Phe	TCC:Ser	TGC:Cys	TAC:Tyr
TTG:Leu	TCG:Ser	TGG:Trp	TAG:***
TTA:Leu	TCA:Ser	TGA:***	TAA:***
-	-	-	-
CTT:Leu	CCT:Pro	CGT:Arg	CAT:His
CTC:Leu	CCC:Pro	CGC:Arg	CAC:His
CTG:Leu	CCG:Pro	CGG:Arg	CAG:Gln
CTA:Leu	CCA:Pro	CGA:Arg	CAA:Gln
-	-	-	-
GTT:Val	GCT:Ala	GGT:Gly	GAT:Asp
GTC:Val	GCC:Ala	GGC:Gly	GAC:Asp
GTG:Val	GCG:Ala	GGG:Gly	GAG:Glu
GTA:Val	GCA:Ala	GGA:Gly	GAA:Glu
-	-	-	-
ATT:Ile	ACT:Thr	AGT:Ser	AAT:Asn
ATC:Ile	ACC:Thr	AGC:Ser	AAC:Asn
ATG:Met	ACG:Thr	AGG:Arg	AAG:Lys
ATA:Ile	ACA:Thr	AGA:Arg	AAA:Lys

You will write a program that repeatedly accepts as input a string of ACGT triples and produces:

A list of the triples and the corresponding amino acids, one set per line.
The one-letter abbreviations for the amino acids, all on one line.

The program will continue to accept input until the user just presses RETURN without entering any DNA codes.

Here is a sample run of the program. Note that some sequences, such as TAG, do not produce an amino acid. They are listed as *** for their three-letter abbreviation, and * for the one-letter abbreviation. Do not presume that there will never be more than 4 sequences. Somebody could enter a string with 40 three-letter sequences, or 50, or 80, and your program would have to process that string correctly.

Enter a nucleotide sequence, or just press ENTER to quit: aaacgatcaccc
AAA	Lys
CGA	Arg
TCA	Ser
CCC Pro
One-letter abbreviations: KRSP
Enter a nucleotide sequence, or just press ENTER to quit: cga tgc
CGA	Arg
TGC	Cys
One-letter abbreviations: RC
Enter a nucleotide sequence, or just press ENTER to quit: cgi tat tca
CGI	invalid sequence
TAT	Tyr
TCA	Ser
One-letter abbreviations: ?YS
Enter a nucleotide sequence, or just press ENTER to quit: tag cat agg ctt
TAG	***
CAT	His
AGG	Arg
CTT	Leu
One-letter abbreviations: *HRL
Enter a nucleotide sequence, or just press ENTER to quit: tca cag gtt cg
Error: You must give complete triples.
Enter a nucleotide sequence, or just press ENTER to quit:

Some Code to Help You

Breaking up the string with the sequence into triples is not easy to do unless you know about pattern matching, which we haven’t gotten to yet. Presuming that the input sequence is in a scalar named $sequence, the following Perl code will split it into an array called @triples. Just use this as a “magic spell”; in CIT042B you will understand how and why it works.

$sequence = uc($sequence);   	# convert to uppercase
$sequence =~ s/[^A-Z]//g;    	# remove non-letters
$n_chars = length $sequence;	# use this for error checking
@triples = $sequence =~ m/(...)/ig;  # separate into triples

Thus, if $sequence contained the string "ACT TAG CAT TTG", after running the code, $triples[0] would contain ACT, $triples[1] would contain TAG, $triples[2] would contain CAT, and $triples[3] would contain TTG. The $n_chars variable would contain 12, the total number of letters you entered.

This code only generates complete triples. Thus, this input ACG TC ATG will generate $n_chars as 8 (eight letters total), $triples[0] as ACG and $triples[1] as TCA. The last TG will drop off into the bit bucket.

Algorithm

Here is one possible algorithm for solving the problem. It is not the only way to do it, but it is relatively straightforward.

Set up a hash with triples as keys and three-letter amino acid abbreviations as values. For now, let’s call this the “acid hash.”
Set up a hash with three-letter abbreviations as keys and one-letter abbreviations as the values. Call this the “singles hash.” It will be advantageous to have an entry in this hash with a key of "***" and a value of "*"
Read in a string of triples.
While the string of triples is not empty, do the following:
1. Set a variable named $one_letter_codes to ""
2. Split the string into triples as described above.
3. If the string has a complete set of triples, do the following; otherwise give an error message:
  - For each triple, look it up in the acid hash. If it doesn’t exist in the acid hash, then use "invalid sequence" as the value. This result is the “amino acid.”
  - Print the triple and the amino acid.
  - Look up the amino acid in the singles hash and find the corresponding single letter. If it doesn’t exist, use a "?" as the result. Call this result the “abbreviation.”
  - Concatenate the abbreviation to the $one_letter_codes variable.
  - After you have finished going through all the triples, print out whatever is in $one_letter-codes.
4. Read in another string of triples

Program Requirements

You must use at least one hash to implement this program. This page may help you construct your hash; it’s just the preceding tables without all the HTML and punctuation.
If you get an invalid sequence (one with a letter other than A,C,G, or T), print “invalid sequence” or some other meaningful message in your list and use a “?” as the one-letter abbreviation, as shown in the sample run.
You must provide an error message if the string has extra or missing letters, such as ACTGA or CA. (Hint: use the length of $sequence after you have gotten rid of non-letters with the code above.)

When You Finish