Hacker News new | past | comments | ask | show | jobs | submit login

Genome size and number of repeat in the DNA.

Basically the entire genomic is fragmented and then reassembled like a big puzzle. It works like this:

genome => |1234567890abcdefghijklmnopqqrstuvwxyz|

set A => |123| |456| |ab| |90| |l| |m| |tuvw| ...

set B => |1| |45| {23| |0| |abcdefg| ....

....

Assembly sequence:

1x |1 23| |4 56||7| |8| |9 0|...

2x |1||23 4||56||7 8| |9| |0|...

where the fragments overlap enough that you can match it up to a different fragment set.

The goal is to get 8x coverage for each nucleotide. When you can do that you have a finished sequence.

You can see the problem here. The larger the genome the longer it takes to find a match. The other problem is that sequences that are highly repeated are difficult to assemble. Imaging assembling this sequence from fragments:

atatatatatatatatatatatatatatatat

Did you get the length right? It's easy if everything is unique.

Most bacterial organisms are very small and are easily sequenced. This is true of plasmids and virus. Plants and animals are a different ballgame. Organisms that you think are very simple are in fact MUCH more complex to sequence because of the amount of genetic material. After sequencing the human genome we realized that there are about 30K genes and 3 billion base pairs (baterial are almost all in the ballpark of 5 million base pairs by comparison). The onion you had for lunch on your salad? 16 billion base pairs. The lowly pine tree? 60K genes and 22 billion base pairs. A salamander? 70 billion base pairs. The Paris Japonica flower? 149 billion base pairs.

This flower: https://en.wikipedia.org/wiki/Paris_japonica is 50x more complex genetically that you are.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: