Hacker News new | past | comments | ask | show | jobs | submit login
Timeline: Organisms that have had their genomes sequenced (yourgenome.org)
99 points by lelf on Jan 20, 2017 | hide | past | favorite | 34 comments



Note that this a very incomplete list. For more complete overviews, see the Quick Guide to Sequenced Genomes [1], the NCBI's list of genome sequences [2], or the wikipedia lists of sequenced bacterial [3] and eukaryotic [4] genomes (and their related lists).

[1]: http://www.genomenewsnetwork.org/resources/sequenced_genomes...

[2]: https://www.ncbi.nlm.nih.gov/genome/browse/

[3]: https://en.wikipedia.org/wiki/List_of_sequenced_bacterial_ge...

[4]: https://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_g...


This list highlights the most import organisms that have their genome sequenced, but look at [1] to see the full list of 22244 organisms that are currently sequenced.

Most human pathogens, like Staphylococcus aureus, Streptococcus pneumoniae, Escherichia coli, Salmonella enterica, Mycobacterium tuberculosis, ... have several thousand assemblies each.

[1] https://www.ncbi.nlm.nih.gov/genome/browse/


Many are just deconvolutions of microbiomes.

Not that I'm against it as I'm very much into metagenomics, but many of these genome assemblies can barely qualify as draft genomes.

Also yet another important point here is the strain level resolution, which the depth of sequencing had afforded us. For eg, if one were to look at staph or any other human pathogen, you'll find at the bare minimum more than tens of those.


When we say that we've "sequenced a genome" — what does that mean, exactly? Just that we've scanned/digitized the DNA sequence of a single member of that species? (Since of course the DNA of all members of a species isn't the same, otherwise, we'd all look alike.) Do we ever sample more than one member of a species? How many members of various species have we sampled? (While I'd expect them to be similar, of course, I'd also expect the interesting things to be in the differences.)


https://en.wikipedia.org/wiki/Human_Genome_Project

"The genome published by the HGP does not represent the sequence of every individual's genome. It is the combined mosaic of a small number of anonymous donors, all of European origin."

I work at a for-profit DNA company which sequences humans for genealogical purposes. The human reference, as we call it, has been through several 'revisions'. My company uses anywhere from "build 36" to "build 38", depending on the specific sequencing product purchased. It's an ongoing effort to upgrade older products to use the newer sequencing version: each revision may change, add, or remove genomic locations of specific named variants. That can wreak havoc, of course, when trying to translate older data to the newer revision. Customers would of course notice right away if their genome data were to change on them! It creates rather a conundrum to solve.


> "sequenced a genome" — what does that mean, exactly?

It means we got the sequence of As, Cs, Gs and Ts; but in practice, we don't even achieve that.

The human genome has approximately 3 billion nucleotides in 23 chromosomes. So you would think a "sequenced genome" is 23 strings over the alphabet {A,C,G,T} with a length of 130 million or so each. However, it is only possible to sequence a few hundred nucleotides at a time. So you end up with a jigsaw puzzle of tiny pieces, some containing errors, some missing, some exact duplicates of each other. Putting these back together corectly isn't even possible, doing it approximately is still difficult.

So in practice, a sequenced genome is a sequence of medium size strings, arranged in the mostly correct order, with many small and a few large gaps, so they form a good chunk of the long strings that would really be a genome. And even if all of that was completely correct, we still wouldn't know what most of these nucleotides mean.

The first "draft" of a genome is usually even worse. If you simply dump DNA fragments from a mammal onto a modern day sequencer (Illumina HiSeq), and run software to kind-of solve the jigsaw puzzle, you end up with a million pieces, each about 2000 nucleotides in length, in no particular order. Since most genes by far are longer than that, I'd say it's fraudulent to call something like that a genome, but it's being done all the time.

> Do we ever sample more than one member of a species?

Yes, but not in the same way. Instead of (approximately) solving the impossible jigsaw puzzle again (that's called de-novo assembly), we tend to compare the puzzle pieces from a new individual against the big picture we got from the first genome (that's called mapping). This way, we can see small differences (single substitutitions, small insertions and deletions), which is most of the variation present in a species. What we can't see is large scale variation, like duplicated genes, or rearrangements of large pieces of the genome, or insertions of retroviruses. We know these exist, it's just more expensive to look for them. (All of this changing as sequencing is becoming cheaper.)

Multiple individuals of a species get sequenced when there is interest. Probably tens of thousands of humans have been sequenced by now. For model organisms, the number is much lower. For pathogens, where everything is much easier anyway, the number could easily be higher.


When they compare the sequenced genomes of two species, say mouse and human, do they compare the entire genomes or just the known coding regions ? Are there similar regions in mouse and human that we have no clue what they are about ?


What's compared is usually the "homologous regions". That means literally "similar because of common descent", but in practice, it means "similar enough that we think it's homologous". Since parts of genomes can become duplicated, we sometimes have more than one homologous piece in a different species---and then it's typically excluded from the comparison, because we don't know which of the two pieces to compare to.

As a practical example, it has been reported that humans and chimpanzees "are 98.6% identical". That's shorthand for "in the detectably homologous parts of their genomes, 98.6% of the nucleotides of chimpanzees and humans are identical". This ignores about 25% of the chimpanzee genome and 20% of the human genome, for which no homologous sequence is known. It also ignores small insertions and deletions, because it's unclear how to define similarity in the presence of those.

There are definitely "similar regions" we have no clue about. Comparisons are definitely not restricted to protein coding regions; we typically compare at least whole genes. That includes introns, which sometimes contain regulatory elements, but most of the time nobody knows what, if anything, they do. The same goes for the untranslated regions of genes (5'- and 3'-UTR), which definitely contain regulatory elements, but for the most part aren't understood. And there are probably homologous intergenic regions, too, and we know nothing about their function.

I don't remember exact numbers about human-mouse comparisons, but I don't need to to know that we end up comparing incomprehensible stuff: We "understand" 1.5% of the genome---that's the protein coding pieces. That's shallow understanding, but at least something. We could pretend that we also understand any piece of the genome that contains a few scattered regulatory elements. Counting very generously, that might be 15% of the genome. The remainder is completely mysterious. In the human-chimpanzee comparison, ~70% of the genomes are comparable. Clearly, we have no idea what most of that stuff does.


"Sequencing a genome" usually means coming up with an assembly that's "good enough" for some purposes.

Assembling a genome means fitting together all the raw data of short reads into a larger scaffolding.

A few years ago there was an assessment of many of the current methods here:

http://gage.cbcb.umd.edu/index.html

Since then there's been lots of additional work, but looking through those algorithms you can see what some consider "good enough" to call a genome sequenced.


Does anyone know how expensive it is to fully sequence your own genome today (not genotyping but full sequencing)? And how complete is it, would every bit be correct?


This is super interesting for someone who has no knowledge of the field. Has my interest piqued as to how they do it, what is number of bases, why do mice have more bases than us, etc. Very nice presentation wise, even if it is incomplete (as other comments state).


I can address some of these.

Sequencing today is done mostly using computational methods. Think of DNA as a couple long strings (Number of bases is effectively the character count of those strings, each string is a "chromosome" in higher organisms), so the problem is how do we read these long, physical strings. It turns out that parallel processing is way more effective, so we break the really long strings into much, much smaller strings that overlap (Millions of characters long to hundreds often). Because the strings overlap, we can construct a good portion of the actual sequence computationaly by exploiting this overlapping feature of our small strings.

The physical way they do this is by using machines (Think GPU vs CPU) that are effectively a bunch of parallel microscopes specialized to read those short strings and by "attaching" colors to each of the characters (DNA bases). Initial DNA sequencing methods lacked both the computational and physical devices to do this, so they were done by hand. The move from doing sequencing by hand to doing it computationally is why we see the significant increase in characters read (Number of bases).

Your last comment I think is the most interesting, as it effectively asks "Why do mice have a larger string size than us, which means they contain more information on an absolute level?". The answer is just because. The number of bases, or even the number of blocks of information that produce proteins (These blocks are called genes, and a protein is another chemical construct that mainly focuses on doing actions in the cell), is not strongly correlated with the complexity of the organism. The key is how those bases interact, not necessarily in how many there are.

If you have any more questions or need some clarification I'd love to address them, it is a wonderful time to be alive.


Thanks for volunteering to answer some questions.

1. What is the (maximum) range of read lengths that modern gene sequencers can produce? Any timeline on when those read lengths will increase substantially?

2. How do bioinformatics people contend with repetitive genomic regions?

3. Are there any differences in how gene sequencing technology works on DNA from different species? For example, does an approach that works on humans (e.g. gene sequence alignment or de novo assembly) work on something like wheat?


1. Depends on the technology. On Illumina (cheapest tech and highest throughput), you get the first and last 125 bases of smallish DNA molecules with an acceptable error rate. Pacific Biosciences (lower throughput and more expensive) gives you up to 40.000 bases with a rather horrible error rate.

2. They fail epically. There is nothing you can do computationally. With paired end reads (two reads at an approximately known distance), you still can't assemble repetitive regions, but you can get the contigs around the repeat in the right order.

3. Definitely, but I don't know the details. Plants are often more difficult than animals; they have bigger genomes and often have multiple chromosome sets. Assembly of a wheat genome is more difficult than assembly of the human genome---and I'd argue even the latter isn't actually a solved problem.


This is really interesting. For someone who knows nothing about the subject, how were DNA strands physically read at a low level before computational methods? I was under the impression DNA is too small to see without an electron microscope. You mention reading dna by hand, and I'm really interested in how that is done.


Given a string, I can easily discern one characteristic, which is length. That's because the length of the string is tied to how "massive" it is and thus when I push on things that are more massive they move more slowly. That's the general idea behind gel separation.

Now, I just need a way to make all the combinations of substrings starting from the first position (0 => 1, 0 => 2, etc.). This is a bit more difficult to explain and chemically intensive, but let's assume for each character (C) we have another character (C') that is pretty much the same thing. The key difference, however, is that C' is marked (Radioactivity or with something that lights up) AND that it doesn't allow any more characters to be added on. If each distinct C' is a different color, we can now distinguish between our different substrings, based entirely on the last character. We know that our strings are ordered by size, so we can construct our original sequence based on the terminal member of the substrings.

You can imagine this process being done by hand, it works for that. However, it doesn't scale well to the millions and billions of base pairs we need in the modern day.

As a fun aside, protein sequences were originally determined in a way pretty much the inverse of this. For a given protein string, remove the first element with chemistry. Then, try to figure out what you removed. Now take your string of size N - 1 and repeat, until you have determined each character. This method ended up not being tractable for DNA because of chemical differences. Also, a lot of protein sequencing is done in a similar way to DNA sequencing, in that we break up, shatter may be a better word, the protein. We then try to construct the original protein based on how is shatters (Like reconstructing a window based on knowing where the pieces fall and where the baseball came from).


That depends on the technology. The technology I most often work with will have lots of fake DNA basepairs in a soup, which has the real subject's DNA broken apart into fragments and attached to a substrate to keep it from moving. The fake DNA basepairs bind to the complementary real basepairs and emit fluorescent light when doing so. En masse, the fluorescent light gets captured on camera and each of the possible basepairs' colors are scored, to provide a call for that basepair. Repeat the cycle a few times, map the lights to the same spots, and you generate a sequence of basepair 'reads' which are then sent further down a software pipeline for later analysis.

...the later steps in the pipeline involve using lots of complex math to reassemble the sequenced fragments back together, either using a reference assembly (such as the Human Genome Project) or else de-novo assembly (which basically _builds_ a reference assembly through lots and lots and lots of effort).

There are other technologies as well which I'm not so familiar with.


Look up Sanger sequencing.


Just wanted to say thanks for answering in such detail. Your string analogy really helped me understand it quite easily.


If you think about what is required for a mammal to survive multiple environments it seems intuitive the more environments a mammal can survive the more genes is needed to create different phenotypes (observable characteristics of an organisms due to the interaction of its genes with the environment). Just looking at the organisms themselves a mouse is capable of surviving harsher and diverse environments than humans. Due to this resiliency the mouse must have more phenotypes and thus have more genes than humans.

This would also perhaps explain the reason as to why some plant species have far more genes than mammals as without the ability to move they need to be able to code for multiple environment changes throughout their life.


>why do mice have more bases than us,

We don't fully know. However part of the answer may be eukaryotic transposons (TE) (and retrotransposons):

https://en.wikipedia.org/wiki/Transposable_element

https://en.wikipedia.org/wiki/Retrotransposon



What determines the complexity of sequencing these? The number of pairs?


Today there exists a multitude of different genome sequencing techniques, and distinct complexities are associated with each method. However, the number of pairs is today seldomly the main complexity.

Sanger sequencing was one of the first methods of sequencing, and employs linear sequencing: the synthesis of strands with increasing length. With the advent of the Human Genome Project, Celera instead came up with the idea of fragmenting the genome, amplifying the fragments, sequence the fragments, and match them together using bioinformatics. The complexity here lies in that much of the DNA is repeated (such as microsatellites) or no, which makes it hard to 'phase' the genome. As such, a short 20 nucleotide sequence may be present in may parts of the genome which makes it hard to generate a 100% complete connected genome.

Today, Illumina sequencing is the major sequencing platform (~85% of market share). It relies of the fragmentation of DNA into ~300 bp fragments. By synthesising the complementary strand of each fragment with fluorescent nucleotides, we may employ lasers to detect (sequence) the nucleotides of the fragments. Here we have the same problem as with shotgun sequencing: that we have many repeats in the DNA sequence.

To remedy this, error-prone sequencing methods with long read lengths such as IonTorrent/PacBio/etc. may be employed to. These long reads may then act as a map for stiching together the more precise short reads. This is called 'hybrid' sequencing.

Other sequencing methods, such as Pyrosequencing, has the inherent problem of not being able to discern too many (5) of the same nucleotide in a row. Other methods are primer based (i.e. need to know a short subsequence of the DNA beforehand). This is problematic if we want to perform a de-novo whole genome sequence. Note: Illumina does not rely on primers, and may be deployed directly on unknown sequences, unlike Pyrosequencing/Sanger sequencing.


The other big complexity is the self-similarity of the genome. To sequence, the genome is duplicated, then sharded physically into many overlapping tiles around an average length, each tile starting at a different position. Each sequence is determined in parallel, and then the shards have to be reassembled computationally (really, this is a post-sequencing process, but is critical to being able to call the result a "sequence" rather than a bag of reads).

If the genome in question contains a lot of regions that are similar to each other, the algorithms that do the assembly will get confused.


There are regions with sparse coverage due to higher concentrations of C-G bases. This can make alignment results less reliable. Plasmids may interfere with alignment or need to be isolated out before or during library prep. And you want to detect and discard any members of the population with any evidence of cancer, genetic disease, etc.

And you need good population coverage. What's a normal variant? Newer methods propose a graph alignment instead of just trying to build a single sequence reference genome.


Genome size and number of repeat in the DNA.

Basically the entire genomic is fragmented and then reassembled like a big puzzle. It works like this:

genome => |1234567890abcdefghijklmnopqqrstuvwxyz|

set A => |123| |456| |ab| |90| |l| |m| |tuvw| ...

set B => |1| |45| {23| |0| |abcdefg| ....

....

Assembly sequence:

1x |1 23| |4 56||7| |8| |9 0|...

2x |1||23 4||56||7 8| |9| |0|...

where the fragments overlap enough that you can match it up to a different fragment set.

The goal is to get 8x coverage for each nucleotide. When you can do that you have a finished sequence.

You can see the problem here. The larger the genome the longer it takes to find a match. The other problem is that sequences that are highly repeated are difficult to assemble. Imaging assembling this sequence from fragments:

atatatatatatatatatatatatatatatat

Did you get the length right? It's easy if everything is unique.

Most bacterial organisms are very small and are easily sequenced. This is true of plasmids and virus. Plants and animals are a different ballgame. Organisms that you think are very simple are in fact MUCH more complex to sequence because of the amount of genetic material. After sequencing the human genome we realized that there are about 30K genes and 3 billion base pairs (baterial are almost all in the ballpark of 5 million base pairs by comparison). The onion you had for lunch on your salad? 16 billion base pairs. The lowly pine tree? 60K genes and 22 billion base pairs. A salamander? 70 billion base pairs. The Paris Japonica flower? 149 billion base pairs.

This flower: https://en.wikipedia.org/wiki/Paris_japonica is 50x more complex genetically that you are.


Yes mostly the number of base pairs. Large repeats can also cause issues finishing a genome.


Would be nice to list the approximate number of genes, too.


Agreed, number of bases gives little information about the complexity of an organism. Genes is a better metric.


"Number of characters gives little information about the complexity of a program. Functions is a better metric."

When the human genome was sequences (it was a ten year project), lots of pseudo-philosophers predicted the number of genes that would be found, with numbers ranging from thousands to millions. The genome in hand, we found about 25000 genes, which make up a few percent (1.5-3%) of the genome. We then looked more closely, and now we can't even agree on the defintion of "gene" anymore, which is one reason why I can't give an exact percentage.

It turns out that one gene can produce more than one protein, or sometimes less. There are lots of genes whose only function (assuming "function" is even well defined) is to regulate other genes, a bit like a program could have functions of higher order. We're slowly grasping the fact that we have no idea how complex the genome and the machine interpreting it really are. Number of genes really is no metric at all.


Most non-plant organisms I can think of are around 15,000 - 25,000 genes.


Not true. Majority of bacterial cells have far less genes than 15,000.


You missed Xylella fastidiosa. It was sequenced in 2000's




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: