AI discovers the hidden control code of the genome

One of the great unsolved mysteries about genomics is the second code of the genome. The exact rules of the code-regulation code – segments of non-coding DNA that regulate the transcription of neighbor genes – remain unclear. The code is read by transcription factors that bind to short pieces of DNA called motifs.

Gene mutations are protein motifs bound by proteins that increase the likelihood of transcription of a particular gene. These short-range motifs are essential for linking transcription factors that are unique to a series, but how the combination of a motif and their arrangement affects transcription factors is not well understood. binding in vivo.

Experimental treatment, such as mutations or synthetic design, has provided little evidence of specific motif arrangements, which the authors referred to as meaningful. However, it is difficult to identify rules and patterns of consensus with genome-wide analyzes.

Natural networks can learn flexible, predictive models to capture de novo order motifs among complex and multivariate data without making strong biological assumptions. However, the complexity of the models makes them challenging to explain. Existing models are limited by low resolution and the inability to detect overwriting co-operative factors (including indirect linkage).

Now, an interdisciplinary team of biologists and computer researchers led by Julia Zeitlinger, PhD, of the Stowers Institute for Medical Research and Anshul Kundaje, PhD, of Stanford University have designed a controversial neural network – named Base Pair Network (BPNet) – this can be interpreted to reveal a control code by predicting a transcription factor linked from DNA sequences with unprecedented accuracy.

BPNet can detect the regulatory code of the genome

The researchers used chromatin immunoprecipitation experiments with nucleotide solution via exonuclease, a specific barcode, and single ligation data (ChIP-nexus) in embryonic gas cells to achieve modeling at the highest resolution. The larger resolution allowed them to develop interpretive tools to draw key sequence patterns that directly summarize the effect of a motif on the connection of transcription factors.

“This was very satisfying, as the results match beautifully with the existing experimental results, and they revealed new perspectives that surprised us,” Zeitlinger said, in a statement.

The team found that transcription factor binding is guided by soft rules of concordance, which follow clear intermotif relationships, which are dependent on a speed consistent with protein-protein or co-protein interactions. nucleosome-mediated operation. For example, BPNet predicted that Sox2 and Nanog transcription factors will interact and that this collaborative interaction is directional. In this way, the interaction between two motifs takes place in a flexible but speed-dependent fashion that is specific to each motif pair.

“There has long been evidence of experimental evidence that there is such a time in a regulatory code,” Zeitlinger says. “However, the exact scenarios were not difficult, and Nanog had not been under suspicion. Finding out that Nanog has such a pattern, and seeing more details about the -his interactions, not surprising because we did not specifically study this pattern. “

In addition, they found that the Nanog motif featured a robust helical positioning option for multiples of approximately 10.5 base pairs, independent of direction. This helical space may help Nanog to engage in synergistic protein-protein interactions by exhibiting on the same side of the DNA as a partner motif.

“This is the main benefit of using cloud networks for this task,” said Žiga Avsec, PhD, senior research scientist at the Technical University of Munich and first author of the paper.

“More traditional bioinformatics handle model data using predefined strict rules based on existing knowledge. However, biology is extremely rich and complex,” Avsec explained. “By using cloud networks, we can train much more flexible and advanced models that learn complex patterns from scratch with no prior knowledge, thus allowing new discoveries.”

How does BPNet work?

BPNet learns from the raw DNA sequence and learns the detection of order motifs and finally the higher order rules by which the elements predict the secret-binding binding data. Once the model is trained, the learned patterns are extracted with interpretive tools. The output signal is followed back to the entry lines to reveal a series motif.

Researchers used DNA sequences from high-resolution experiments to train a neural network called BPNet, and the operations within their “black box” were then discovered to determine the order patterns and organization principles of the genome’s regulatory code. to appear. Image courtesy of Mark Miller, Stowers Institute for Medical Research.

The final step is to use the model as an oracle and systematically interrogate it with the design of a specific DNA sequence, similar to what one would do to test an hypothesis experimentally, to reveal the rules by which a motif series works in a balanced way.

“The beauty is that the model can predict more series designs that we could test in experiments,” Zeitlinger said. “Furthermore, by predicting the outcome of experimental collisions, we can identify the most informative experiments to test the model.”

To positively test motif signaling, the researchers performed targeted point mutations in motifs and compared the changes in ChIP-nexus profiles with those predicted by BPNet. They used CRISPR / Cas9 to perform two-base substitution in either a Sox2 or Nanog motif, and then performed ChIP-nexus experiments on wild-type and embryonic mutant cells.

As would be expected, a mutation of Sox2 eliminated any link associated with that transcription factor. However, Nanog mutation did not affect Sox2 binding while Sox2 mutation resulted in Nanog loss of contact near the Sox2 mutation site, confirming the directional correlation of transcription factors.

Both the Zeitlinger laboratory and the Kundaje laboratory already use BPNet to identify binding motifs for other cell types, link motifs to biochemical parameters, and study other structural features in the genome, such as those is related to DNA packing. The teams have freely handed over the entire BPNet software framework to other scientists.

Do you have a particular perspective on your research related to artificial intelligence or genomics? Contact the editor today to learn more.

Related reading

Copyright © 2021 scienceboard.net

.Source