SMS project idea

November 29th, 2009

I would like to look more closely at the currently published structure of RAI1 as presented in the  Toulouse 2003 paper.  This paper uses informatics techniques such as comparing the gene structures of related organisms and combining ESTs and partial mRNAs from NCBI, as well as wet lab techniques to create a structure of the human RAI1 gene.  The Elsea lab, however, suspects this structure to be incorrect.  I would like to closely examine the techniques and logic flow leading the authors to their final conclusion, focusing specifically on holes in the logic and possible alternative explanations of the data presented.  I will also include the reasoning that led the Elsea lab to suspect that this structure is in fact incorrect.  If time permits, I would also like to research what is currently known about the function of RAI1 in the human body.


Hands on informatics

November 8th, 2009

I learned several major things while doing the Markham paper project.  For one, programs and algorithms (especially the free ones) don’t always work, and they don’t always do exactly what you want them to do.  Part of this may have to do with the fact that we, as an undergraduate class, only have a small idea of the full functionality of these programs, and we may be making simple mistakes in the options or input format or any host of other settings.  Part of it, however, is simple interface design.  User friendly sites are difficult to come by.  Most instructions and settings are written for people who understand how the program works at a deep level.  The vocabulary is specific, and often in symbols and abbreviations.

Secondly, manipulation of various settings, sometimes at random, is a part of the informatics game.  This bothers me as someone training in science.  I want to have a strict protocol that gives specific settings for specific situations to get specific results, but instead it seems that the goal is just to change things around until you get a result that makes sense.  I read somewhere that in informatics the difference between getting jibberish and getting results is finding exactly the right parameters.

Lastly, many different interpretations can fit the data you find.  It seems, in informatics especially, it can be difficult to tell whether your results are meaningful or if your seeming results are simply artifacts present either in your input data or created by the various parameters you set in analyzing the input data.  In the end, I think I have come to regard informatics as a tool.  It is a good tool, but one that must be backed up by more conclusive research.

What to work on next:  I’d like to find a program that will give me more detailed information about a protein alignment.  Any program that is able to come up with a consensus sequence should be able to do this.  Surely programs with this functionality exist.


Stephen Williams talk

November 8th, 2009

I am excited to have met my first member of the actual Elsea lab;  I hear some of what’s going on second hand from Dr. Zies, but I am glad to know a little about this other facet of the research on SMS.  I also enjoyed lunch with Stephen Williams and some of the informatics crew and chatting about grad school with him and Dr. Grana.

Like most of informatics seems to do, Stephen’s project is able to make guesses/generalizations that can point future research in the right direction.  Whole genome array comparative genomic hybridization (aCGH), while much more efficient than the older karyotype method, is still only able to find relatively large problems in the genome.  He found issues in, I believe he said, about six patients out of fifty-two.  All the subjects had phenotypes that are more than likely caused by genetic abnormality.  We can only assume that the technique is not specific enough to diagnose the genetic abnormalities of the remaining forty-odd patients.  This is certainly not to downplay the effectiveness of the technique as a blind screening tool.  It seems to be relatively straightforward, fast and efficient, and obviously produces some real results.  It’s amazing that any technique can screen the entire genome at once.  Williams mentioned a few positive results that correspond to previously known disorders.  I am interested to read the paper and see if he identified any regions that are possible candidates for being involved in the same pathway as RAI1.


Swami and beginning HIV research

October 28th, 2009

Sites:

BioQuest: information on HIV and the Markham paper specifically

Swami: program for combining and storing data from different tools

I’ve decided on a preliminary topic for my research project: Are there any conserved domains within the region of the env gene sequenced by the Markham group?  I’m not yet sure if this project is feasible.  I may very well not be sure until I actually begin.  Several things are clear from the start; namely, the region sequenced has been shown to tolerate rapid mutation, so I would be looking for small conserved regions within a larger region that is prone to mutation.  Also, the Markham group used primers adjacent to the specific 285-bp region of interest meaning that the primer region must not mutate to an extent that would prevent the primers from binding.

I am also unsure of how many sequences I will need to analyze in order to be fairly confident in calling certain regions “conserved.”  Preliminary examination of the average mutation rate versus the size of the sequenced to be analyzed seems to suggest that not enough data is available and/or I will not be easily able to analyze enough of the available data to come to any accurate conclusion.  However, I don’t believe that this is a huge issue.  The project is about learning to use the analysis software, not about obtaining high quality results.  I should be able to come up with some results based on the data available to me.  As long as my results reflect the data input, I believe the goals of the project will have been accomplished.

I am still having difficulty with Swami.  If I cannot get it to work, I will attempt to use each tool individually or use the original Biology Workbench or similar site.

Next steps:  I believe I am ready to dive right in and learn as I go.  I will choose a website/program, import some data, and begin testing the functionality.  I need to be able to compare a large number of amino acid sequences (if available) to each other and come up with any regions where they all agree.


Wormbase assignment

September 30th, 2009

1.  brc-1

2.  none

3.  homolog to human gene involved in early onset breast and ovarian cancer, protein forms a heterodimer, gives phenoype of brc-1 (RNAi) worms.

4.  human homolog: brca1, involved in DNA repair, cell cycle control, and genomic stability.  mutations involved in early onset breast and ovarian cancer.

5.  not assigned for this protein

6.  ”

7.  ”

8.  ”

9.  ”

10.  yes, one confirmed cDNA, one partially confirmed by cDNA

11.  Using the first version of the gene (C36A4.8a): coding region – 1839 bp, transcript – 9721 bp

12.  612 aa, yes this is the number expected

13.  III

14.  10 predicted exons, 9 introns

15.  unknown

16.  unknown

17.  cell death, slow growth, chromosome mophology varient, high incidence male progeny

18.

Zinc finger, C3HC4 RING-type
Zinc finger, C3HC4 type (RING finger) [more ...]
Zinc finger, RING-type
Zinc finger, RING-type, conserved site
BRCA1 C Terminus (BRCT) domain [more ...]
BRCT
BRCA1

19.  Biological process: DNA repair, apoptosis, meiotic chromosome segregation…

Cellular component: BRCA1-BARD1 complex

Molecular function: protein binding

20.  inferred from mutant phenotype, direct assay, or physical interaction.  I feel pretty confident about these inferences

21.  two

22.  full list of orthologs is not available, five different species of nemotodes are listed as having orthologs

23.  brca1

24.  C. japonica, C. brenneri…. H. sapiens…

25.  best hit: C. japonica 3.7e-187

H. sapiens % length: 86.3%

26.  link seems to be broken

27.  there are publications from worm meetings, peer reviewed journals, and three entries from wormbook


Genome annotation

September 28th, 2009

Today I downloaded Sequencher, read more of the HGP paper, and added links to papers I’m reading about SMS to my blog.  I am also attempting to keep updating my post about software to include everything we’ve talked about in class and everything I have happened to run across on my own.  I will also continue to update my post on terms from the HGP paper as I read further.

I enjoyed working with Glimmer, GeneMark, and Apollo today.  It seems that by following the complete protocol listed in the document on Blackboard, it should be possible to come up with a pretty decent approximation of the correct gene content of a genome, especially one as small and simple as a prokaryote.  Surely, however, this is still regarded only as an approximation.  Isolating a cDNA of the putative gene would seem to be the confirmation step needed (unless of course it’s a pseudogene…).  I wonder if the method of identification of a gene is clearly listed on NCBI?  It is exciting to think that in the future, as we refine these techniques, entire genomes will be sequenced and annotated relatively quickly and cheaply.

I am very excited about getting to know Sequencher.  I hope to be able to use some of these programs in more depth, for more than just a week or so.  Perhaps this will come with the large-scale projects we will soon begin.


Terms from the HGP paper

September 28th, 2009

I’m looking up terms that I don’t know as I read through this paper, so I thought I would list them here:

CpG islands – stands for cytosine-phosphate-guanine in linear order as opposed to a cytosine base paired to a guanine.  CpG islands are areas of high CpG content and often appear in promoter regions.  Methylation of CpG islands in a promoter region can lead to silencing of the gene; hypomethylation can lead to overexpression.

Alu elements – first identified as restriction enzyme sites cut by Alu endonuclease.  It is unclear to me why these are important.  Perhaps they will be further discussed later in the paper.

Linkage disequilibrium mapping – identifying SNPs or other DNA markers that co-occur with a particular disease more than they would by random chance.


Dot Plots (Math minute 2.1)

September 23rd, 2009

Math minute 2.1

1.  It would have the opposite slope.

2.  The section after the deletion would be shifted off the midline

3.  Comparing two sequences will find regions of identity which may indicate homology or common function such as a similar binding domain.  I can’t wrap my head around the reason for the reverse complement…

4.  It will contain two 20kb runs off the midline.

5.  It should display 30 (5 per repeat times 6 repeats) 50bp runs off the midline.

6.  I’m still not sure about this reverse complement thing.

Question:  With a sliding window, bases at the ends of a run get left off.  For example, with a sliding window of three, the first and last base of a run will be left off, and with a larger sliding window, more bases will be left off the ends.  Why not just specify the minimum size of a run to be displayed?


Sequence analysis software

September 21st, 2009

After speaking with Dr. Zies, we have decided to analyze my traces by hand and also by computer software in order to compare the results as well as the ease of getting them.  Two basic functions are of interest to me: base calling and sequence alignment (in order to form a contiguous sequence).  After sloshing through traces by hand today, I have a better idea of the specific functionality I want from a computer program.

  • While a quantitative base calling software such as PHRED would perhaps be interesting to experiment with, I believe that my data size is small enough that this can just as easily be done by hand.  It is fairly easy for us to trim the beginning and end of the trace so that we are only looking at high quality data.  Also, heterozygous SNPs are not as difficult to see as I had assumed.  True, finding them necessitates scanning each trace line by line, but a single trace can be scanned in around a minute and I only have 40 or so total to scan.  No real hardship there.
  • The real headache involves lining up each trace against the print-out of the RAI1 gene, noting areas of overlap and bases of interest (possible SNPs, etc.), and determining for certain that we have sequenced every base at least twice (generally once in the forward direction and once reverse).  This is especially difficult with reverse primers because we do not have a print-out of the gene in the reverse direction and thus have to mentally translate (a>t and c>g) as well as reading backwards.  I am certain that there is an easy way to avoid this; I just have to figure out what it is.
  • All in all, something that will align my sequences to each other and to a master gene automatically will prevent much blurry vision and searching pages of genetic code.

I don’t know what any of these programs really do yet, but I am beginning to assemble a list to research:

Phred, Phrap, and Consed

Biology Workbench

CodonCode

Sequencher

Geneious

Softgenetics

Apollo

CAP Sequence Assembly Machine


Introduction to SMS

September 21st, 2009

This semester I will be writing a proposal on my research project.  I read this paper in order to produce a introductory paragraph of background information on SMS.