UnfoldCDL (Unfolded Convolutional Dictionary Learning) is a DNA sequence motif discovery method. We first formulate a convolutional dictionary learning problem and then "unfold" its optimization algorithm into a neural network. The resulting network is fully interpretable, fast to train, and outputs a sparse representation of the dataset. The sparse representation allows us to infer the motifs in the dataset efficiently.
Many methods can find statistically significant motifs, but the motifs in the dataset may only be "partially found" because proteins can bind to DNA in complicated ways. For example, motifs may have multiple modes: each mode may share similar patterns (multimeric binding, alternate structural conformations), have distinct "parts" (variable spacing), or have multiple motifs that look entirely dissimilar to each other (multiple DNA binding domains). Some traditional motif discovery methods use heuristics such as substring-masking to deal with the above scenarios, which leads to a sequential motif discovery method and results in some secondary motifs being masked and not revealed. The inference on the motifs using other black-box deep learning approaches is currently challenging.
The sparse representation we obtained from UnfoldCDL reveals where the enriched patterns are in the dataset, and we seek to use such representation to discover all the motifs simultaneously.
We found many unreported motifs on the JASPAR datasets. Check our preprint's result section for detail.
We are currently adding this package to the Julia registry. Once it's added, the user can simply install our package via the Julia's package manager:
pkg> add UnfoldCDL
This package requires Weblogo. You need python3 and install Weblogo with following command:
pip3 install weblogo
We require the user to have an Nvidia GPU; we plan to implement a CPU version in the future.
In Julia, import the UnfoldCDL package first:
using UnfoldCDL
To do motif discovery on a single fasta file, execute
find_motif(<fasta-path>, <output-folder>)
- Perform motif discovery on the fasta file
<fasta-path>
, and - Output the result in a pre-specified folder
<output-folder>
.
To do motif discovery on a batch of fasta files, execute
find_motif_fasta_folder(<fasta-folder-path>, <output-folder>)
- Perform motif discovery on all the fasta files in the
<fasta-folder-path>
, and - Output each of the results in a pre-specified folder
<output-folder>
.
The paper for UnfoldCDL is at https://www.biorxiv.org/content/10.1101/2022.11.06.515322v3. It can be cited using the following BibTex entry:
@article {Chu2022.11.06.515322,
author = {Chu, Shane Kuei-Hsien and Stormo, Gary D},
title = {Deep unfolded convolutional dictionary learning for motif discovery},
elocation-id = {2022.11.06.515322},
year = {2022},
doi = {10.1101/2022.11.06.515322},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2022/11/10/2022.11.06.515322},
eprint = {https://www.biorxiv.org/content/early/2022/11/10/2022.11.06.515322.full.pdf},
journal = {bioRxiv}
}