Deep Learning attempt for RNA inverse design.

Learning to design RNA, internship in NC State Univ.

Attempt to build a Long-Range specific model

I feel excited to begin my research internship in NC State. Considering the outstanding article:

gRNAde: Geometric Deep Learning for 3D RNA Inverse Design.

This paper is clear and friendly to such “rookie” like me. Besides, it’s easy to reproduce and the source code is released at:

https://github.com/chaitjo/geometric-rna-design

Remote server initialization operations

For convenience, I rent the GPU server from AutoDL: https://www.autodl.com/, and my friend let me use his RTX4090 in U.K.

However, servers in China and UK need different settings. For the AutoDL Server, I need to use the data disk autodl-tmp more instead of the system disk, for the limited memory of system disk.

But, that leads to a problem: if use the data disk more, the CPU’s data reading and processing become slower because the data disk is cloud-based, which causes low GPU utilization (like on an RTX 5090). So it depends.

The author required the environment as follows:

Python 3.10.12 and CUDA 11.8, numpy <2.0

In fact, we need python=3.10 and CUDA 12.8(for RTX5090).

Mirror source acceleration

1
git clone https://ghfast.top/github.com/chaitjo/geometric-rna-design.git
1
cd /root/autodl-tmp/geometric-rna-design

Environment

If our system disk doesn’t have enough space, let’ s:

1
mamba create -p /root/autodl-tmp/rna python=3.10
1
mamba activate /root/autodl-tmp/rna

And we can clean rubbish:

1
2
pip cache purge
conda clean --all
  • For current shell, we have the followings. It is suggested that we can add it to bashrc or .env file, then source ~/.bashrc.
1
2
3
4
5
6
export PROJECT_PATH='/root/autodl-tmp/geometric-rna-design/'
export ETERNAFOLD='/root/autodl-tmp/geometric-rna-design/tools/EternaFold'
export X3DNA='/root/autodl-tmp/geometric-rna-design/tools/x3dna-v2.4'
export PATH="/root/autodl-tmp/geometric-rna-design/tools/x3dna-v2.4/bin:$PATH"
export PATH="/root/autodl-tmp/cdhit:$PATH"
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

For UK server, this is mine:

1
2
3
4
5
6
7
8
9
export PROJECT_PATH='/home/remote1/geometric-rna-design/'
export DATA_PATH='/home/remote1/geometric-rna-design/data/'
export WANDB_PROJECT='rna'
export WANDB_ENTITY='wenxy59-sun-yat-sen-university'
export WANDB_DIR='/home/remote1/geometric-rna-design/'
export ETERNAFOLD='/home/remote1/geometric-rna-design/tools/EternaFold'
export X3DNA='/home/remote1/geometric-rna-design/tools/x3dna-v2.4'
export PATH="/home/remote1/geometric-rna-design/tools/x3dna-v2.4/bin:$PATH"
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

When we training the model

Data pre-process

Before we start, we need to run the process_data.py and obtain the process.pt , and then run the notebook to get das_split.pt.

In the progress, we need USalign&qTMclust,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#Install CD-HIT for sequence identity clustering
mamba install cd-hit -c bioconda

#Install US-align/qTMclust for structural similarity clustering
cd ~/geometric-rna-design/tools/
git clone https://github.com/pylelab/USalign.git && cd USalign/ && git checkout 97325d3aad852f8a4407649f25e697bbaa17e186
g++ -static -O3 -ffast-math -lm -o USalign USalign.cpp
g++ -static -O3 -ffast-math -lm -o qTMclust qTMclust.cpp

# (Optional) Install ViennaRNA, mainly used for plotting in design notebook
cd ~/geometric-rna-design/tools/
tar -zxvf ViennaRNA-2.6.4.tar.gz
cd ViennaRNA-2.6.4
./configure  # ./configure --enable-macosx-installer
make
sudo make install

After that, we can make a validation byUSalign -h and qTMclust -h. Pay attention to your path, such as src/data/clustering_units.py

Successfully process the data

We will skip any RNA molecules whose structural data are incomplete once we find out.

problem data

Reproduce the author’s result

The author offered an .py file or we can start by command like:

1
python gRNAde.py     --pdb_filepath data/raw/6J6G_1_L-E.pdb     --output_filepath tutorial/lnc/po/114.fasta     --split das     --max_num_conformers 1  --n_samples 16     --temperature 0.5
let's run gRNAde

We then test it on the whole RNASolo Dataset(both train and test) first as a rough experiment. The model’s performance drops when dealing with long RNA sequences (over 100 nts). Although the recovery drops a little, the self-consistency score for secondary structure (SC Score) shows a clear linear decline.

comparison metircs
SC Score(left) and Recovery(right) with Sequence

Another way: Diffusion Model like RiboDiffusion, RIdiffusion, DiffRNAfold .etc

RiboDiffusion

To reproduce RiboDiffusion, we can set the environment like this on the RTX4090:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
conda create -n rna2 python=3.10 -y
conda activate rna2
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install absl-py==0.15.0
pip install biopython==1.80
pip install dm_tree==0.1.7
pip install fair-esm==2.0.0
pip install ml_collections==0.1.1
pip install numpy==1.24.3
pip install scipy>=1.10.0
pip install tqdm==4.64.1
pip install torch-cluster==1.6.1+pt113cu116 -f https://data.pyg.org/whl/torch-1.13.0+cu116.html
pip install torch-scatter==2.1.1+pt113cu116 -f https://data.pyg.org/whl/torch-1.13.0+cu116.html
pip install torch-geometric==2.3.1

Although we can run it well and it shows a promising way to generate RNA Molecules, maybe there are some problems. Let’s look at the picture below, the recovery rate(you can understand it like “accuracy”) will often comes to 100%! And it drops up and down which seems no clear pattern.

RiboDiffusion输出

To solve my confuse, I test it with RNASolo Data set(manually), but it seemed that is not obvious.

RiboDiffusion,Recovery in log(up) and normal(down) drops with Sequence

We will follow the scripts as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
from tqdm import tqdm
import numpy as np
import random
from models import *
from utils import *
from diffusion import NoiseScheduleVP
from sampling import get_sampling_fn
from datasets import utils as du
import functools
import tree
from configs.inference_ribodiffusion import get_config
config = get_config()
# Choose heckpoint name
checkpoint_path = './ckpts/exp_inf.pth'
# checkpoint_path = './ckpts/exp_inf_large.pth'
config.eval.sampling_steps = 100
# config.eval.sampling_steps = 100
NUM_TO_LETTER = np.array(['A', 'G', 'C', 'U'])

def get_optimizer(config, params):
  """Return a flax optimizer object based on `config`."""
  if config.optim.optimizer == 'Adam':
      optimizer = optim.Adam(params, lr=config.optim.lr, betas=(config.optim.beta1, 0.999), eps=config.optim.eps, weight_decay=config.optim.weight_decay)
  elif config.optim.optimizer == 'AdamW':
      optimizer = torch.optim.AdamW(params, lr=config.optim.lr, amsgrad=True, weight_decay=1e-12)
  else:
      raise NotImplementedError(f'Optimizer {config.optim.optimizer} not supported yet!')
  return optimizer
# Initialize model
model = create_model(config)
ema = ExponentialMovingAverage(model.parameters(), decay=config.model.ema_decay)
params = model.parameters()
optimizer = get_optimizer(config, model.parameters())
state = dict(optimizer=optimizer, model=model, ema=ema, step=0)

model_size = sum(p.numel() for p in model.parameters()) * 4 / 2 ** 20
print('model size: {:.1f}MB'.format(model_size))

# Load checkpoint
state = restore_checkpoint(checkpoint_path, state, device=config.device)
ema.copy_to(model.parameters())

# Initialize noise scheduler
noise_scheduler = NoiseScheduleVP(config.sde.schedule, continuous_beta_0=config.sde.continuous_beta_0,
                                  continuous_beta_1=config.sde.continuous_beta_1)
# Obtain data scalar and inverse scalar
inverse_scaler = get_data_inverse_scaler(config)

# Setup sampling function
test_sampling_fn = get_sampling_fn(config, noise_scheduler, config.eval.sampling_steps, inverse_scaler)
pdb2data = functools.partial(du.PDBtoData, num_posenc=config.data.num_posenc, num_rbf=config.data.num_rbf, knn_num=config.data.knn_num)
# Run inference on a single p
pdb_file= '/home/remote1/geometric-rna-design/data/raw/1FIR_1_A.pdb'
pdb_id = pdb_file.replace('.pdb', '')
if '/' in pdb_id:
    pdb_id = pdb_id.split('/')[-1]

config.eval.dynamic_threshold=True
config.eval.cond_scale=0.4
config.eval.n_samples=16
test_sampling_fn = get_sampling_fn(config, noise_scheduler, config.eval.sampling_steps, inverse_scaler)
struct_data = pdb2data(pdb_file)
struct_data = tree.map_structure(lambda x:x.unsqueeze(0).repeat_interleave(config.eval.n_samples, dim=0).to(config.device), struct_data)
samples = test_sampling_fn(model, struct_data)
print(f'PDB ID: {pdb_id}')
native_seq = ''.join(list(NUM_TO_LETTER[struct_data['seq'][0].cpu().numpy()]))
print(f'Native sequence: {native_seq}')
for i in range(len(samples)):
    # native_seq = ''.join(list(NUM_TO_LETTER[struct_data['seq'].squeeze(0).cpu().numpy()]))
    # print(f'Native sequence: {native_seq}')
    designed_seq = ''.join(list(NUM_TO_LETTER[samples[i].cpu().numpy()]))
    print(f'Generated sequence {i+1}: {designed_seq}')
    recovery_ = samples[i].eq(struct_data['seq'][0]).float().mean().item()
    print(f'Recovery rate {i+1}: {recovery_:.4f}')

However, too slow, too tired, and I design a automatic scripts like:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
import torch
from tqdm import tqdm
import numpy as np
import random
import os
import sys
from datetime import datetime

# Add the gRNAde path to import evaluation tools
sys.path.append('/home/remote1/geometric-rna-design/src')
sys.path.append('/home/remote1/geometric-rna-design/')

from models import *
from utils import *
from diffusion import NoiseScheduleVP
from sampling import get_sampling_fn
from datasets import utils as du
import functools
import tree
from configs.inference_ribodiffusion import get_config

# Import gRNAde evaluation functions
from src.evaluator import (
    self_consistency_score_eternafold,
    edit_distance
)
from src.data.data_utils import pdb_to_tensor, get_c4p_coords
from src.constants import NUM_TO_LETTER, PROJECT_PATH

# Import BioPython for sequence handling
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

config = get_config()

# Choose checkpoint name
checkpoint_path = './ckpts/exp_inf.pth'
config.eval.sampling_steps = 100

def get_optimizer(config, params):
    """Return a flax optimizer object based on `config`."""
    if config.optim.optimizer == 'Adam':
        optimizer = optim.Adam(params, lr=config.optim.lr, betas=(config.optim.beta1, 0.999), 
                              eps=config.optim.eps, weight_decay=config.optim.weight_decay)
    elif config.optim.optimizer == 'AdamW':
        optimizer = torch.optim.AdamW(params, lr=config.optim.lr, amsgrad=True, weight_decay=1e-12)
    else:
        raise NotImplementedError(f'Optimizer {config.optim.optimizer} not supported yet!')
    return optimizer

def extract_sec_struct_from_pdb(pdb_file):
    """
    Extract secondary structure from PDB file using gRNAde's data utilities.
    """
    try:
        sequence, coords, sec_struct, sasa = pdb_to_tensor(
            pdb_file, 
            return_sec_struct=True, 
            return_sasa=True,
            keep_insertions=False
        )
        return [sec_struct] if sec_struct else None
    except Exception as e:
        print(f"Warning: Could not extract secondary structure from {pdb_file}: {e}")
        return None

def prepare_raw_data_for_grnade_eval(pdb_file, native_seq):
    """
    Prepare raw data structure compatible with gRNAde evaluator.
    """
    try:
        sequence, coords, sec_struct, sasa = pdb_to_tensor(
            pdb_file, 
            return_sec_struct=True, 
            return_sasa=True,
            keep_insertions=False
        )
        
        raw_data = {
            'sequence': native_seq,
            'coords_list': [coords] if coords is not None else [],
            'sec_struct_list': [sec_struct] if sec_struct else ["."] * len(native_seq),
            'sasa_list': [sasa] if sasa is not None else [np.ones(len(native_seq))]
        }
        
        return raw_data
    except Exception as e:
        print(f"Warning: Could not prepare raw data from {pdb_file}: {e}")
        # Return minimal raw data structure
        return {
            'sequence': native_seq,
            'coords_list': [],
            'sec_struct_list': ["."] * len(native_seq),
            'sasa_list': [np.ones(len(native_seq))]
        }

def evaluate_ribodiffusion_with_grnade_sc(samples, native_seq, pdb_file, pdb_id, 
                                         output_dir=None, save_results=True):
    """
    Evaluate RiboDiffusion samples using gRNAde's self-consistency evaluation.
    """
    # Prepare raw data for gRNAde evaluator
    raw_data = prepare_raw_data_for_grnade_eval(pdb_file, native_seq)
    
    # Convert RiboDiffusion samples to numpy arrays
    sample_arrays = []
    for sample in samples:
        if isinstance(sample, torch.Tensor):
            sample_arrays.append(sample.cpu().numpy())
        else:
            sample_arrays.append(np.array(sample))
    
    # Create mask for coordinates (assume all positions are valid for now)
    mask_coords = np.ones(len(native_seq), dtype=bool)
    
    # Calculate basic metrics
    results = {
        'pdb_id': pdb_id,
        'native_seq': native_seq,
        'samples': [],
        'recovery_rates': [],
        'edit_distances': [],
        'sc_scores_eternafold': []
    }
    
    # Convert native sequence to numerical for recovery calculation
    letter_to_num = {letter: idx for idx, letter in enumerate(NUM_TO_LETTER)}
    native_array = np.array([letter_to_num[char] for char in native_seq])
    
    print(f"\nEvaluating {len(samples)} samples for {pdb_id}:")
    
    for i, sample_array in enumerate(sample_arrays):
        designed_seq = ''.join([NUM_TO_LETTER[num] for num in sample_array])
        results['samples'].append(designed_seq)
        
        # Calculate recovery rate
        recovery = (sample_array == native_array).mean()
        results['recovery_rates'].append(recovery)
        
        # Calculate edit distance using gRNAde's function
        edit_dist = edit_distance(designed_seq, native_seq)
        results['edit_distances'].append(edit_dist)
        
        print(f'Sample {i+1}: Recovery={recovery:.4f}, Edit_dist={edit_dist}')
    
    # Calculate self-consistency scores using gRNAde's EternaFold evaluator
    print("\nCalculating self-consistency scores with EternaFold...")
    try:
        sc_scores = self_consistency_score_eternafold(
            sample_arrays,
            raw_data['sec_struct_list'],
            mask_coords
        )
        results['sc_scores_eternafold'] = sc_scores.tolist()
        
        for i, sc_score in enumerate(sc_scores):
            print(f'Sample {i+1}: SC_score={sc_score:.4f}')
            
    except Exception as e:
        print(f"Warning: Could not calculate SC scores: {e}")
        print("This might be due to EternaFold not being properly installed or configured.")
        results['sc_scores_eternafold'] = [0.0] * len(samples)
    
    # Save results
    if save_results and output_dir:
        os.makedirs(output_dir, exist_ok=True)
        
        # Save as FASTA file compatible with gRNAde format
        sequences = []
        
        # First record: input sequence with metadata
        sequences.append(SeqRecord(
            Seq(native_seq), 
            id="input_sequence,",
            description=f"pdb_id={pdb_id}, ribodiffusion_evaluation"
        ))
        
        # Remaining records: designed sequences with metrics
        for i, (seq, recovery, edit_dist, sc_score) in enumerate(zip(
            results['samples'], 
            results['recovery_rates'], 
            results['edit_distances'],
            results['sc_scores_eternafold']
        )):
            sequences.append(SeqRecord(
                Seq(seq), 
                id=f"sample={i},",
                description=f"recovery={recovery:.4f}, edit_dist={edit_dist}, sc_score={sc_score:.4f}"
            ))
        
        # Save FASTA
        fasta_path = os.path.join(output_dir, f"{pdb_id}_ribodiffusion_designs.fasta")
        SeqIO.write(sequences, fasta_path, "fasta")
        print(f"\nResults saved to: {fasta_path}")
    
    # Print summary statistics
    print(f"\n{'='*50}")
    print(f"Summary for {pdb_id}:")
    print(f"Native sequence length: {len(native_seq)}")
    print(f"Number of samples: {len(samples)}")
    print(f"Mean Recovery: {np.mean(results['recovery_rates']):.4f} ± {np.std(results['recovery_rates']):.4f}")
    print(f"Mean Edit Distance: {np.mean(results['edit_distances']):.2f} ± {np.std(results['edit_distances']):.2f}")
    
    if results['sc_scores_eternafold'][0] != 0.0:
        print(f"Mean SC Score (EternaFold): {np.mean(results['sc_scores_eternafold']):.4f} ± {np.std(results['sc_scores_eternafold']):.4f}")
    else:
        print("SC Scores not calculated (EternaFold unavailable)")
    print(f"{'='*50}")
    
    return results

# Initialize model
model = create_model(config)
ema = ExponentialMovingAverage(model.parameters(), decay=config.model.ema_decay)
params = model.parameters()
optimizer = get_optimizer(config, model.parameters())
state = dict(optimizer=optimizer, model=model, ema=ema, step=0)

model_size = sum(p.numel() for p in model.parameters()) * 4 / 2 ** 20
print('Model size: {:.1f}MB'.format(model_size))

# Load checkpoint
state = restore_checkpoint(checkpoint_path, state, device=config.device)
ema.copy_to(model.parameters())

# Initialize noise scheduler
noise_scheduler = NoiseScheduleVP(config.sde.schedule, 
                                 continuous_beta_0=config.sde.continuous_beta_0,
                                 continuous_beta_1=config.sde.continuous_beta_1)

# Obtain data scalar and inverse scalar
inverse_scaler = get_data_inverse_scaler(config)

# Setup sampling function
test_sampling_fn = get_sampling_fn(config, noise_scheduler, config.eval.sampling_steps, inverse_scaler)
pdb2data = functools.partial(du.PDBtoData, num_posenc=config.data.num_posenc, 
                            num_rbf=config.data.num_rbf, knn_num=config.data.knn_num)

# Run inference
pdb_file = '/home/remote1/geometric-rna-design/data/raw/7PIC_1_5.pdb'
pdb_id = os.path.basename(pdb_file).replace('.pdb', '')

# Configure sampling
config.eval.dynamic_threshold = True
config.eval.cond_scale = 0.4
config.eval.n_samples = 16

# Generate samples
print(f'Processing PDB: {pdb_file}')
test_sampling_fn = get_sampling_fn(config, noise_scheduler, config.eval.sampling_steps, inverse_scaler)
struct_data = pdb2data(pdb_file)
struct_data = tree.map_structure(
    lambda x: x.unsqueeze(0).repeat_interleave(config.eval.n_samples, dim=0).to(config.device), 
    struct_data
)
samples = test_sampling_fn(model, struct_data)

# Get native sequence
native_seq = ''.join(list(NUM_TO_LETTER[struct_data['seq'][0].cpu().numpy()]))
print(f'Native sequence: {native_seq}')

# Create output directory
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = f"ribodiffusion_grnade_eval_{current_time}"

# Evaluate with gRNAde's self-consistency framework
results = evaluate_ribodiffusion_with_grnade_sc(
    samples=samples,
    native_seq=native_seq,
    pdb_file=pdb_file,
    pdb_id=pdb_id,
    output_dir=output_dir,
    save_results=True
)

Well, let’ s run it with almost 13000 data to test the actually performance, and we got:

Just do it!
result and result(after log)

Oh, actually, diffusion model is like buying a lottery ticket. Well, well, well, let’s move on towards geometirc or Large Language Model.

Simple method for gRNAde

One simple attention layer added to the model

Since multi-layer attention mechanisms can help capture long sequences, I began by using MultiheadAttention and added a simple layer after the encoder and before the pooling set in the whole model for training. The original released code had some bugs, especially in the masking logic of the mask_coords function — it often caused dimension mismatches with the samples and threw errors. I’ve pasted my modified evaluator.py here. Also, to prevent GPU memory overflow, you’ll need to set the batch size properly.

A simple attention,models.py is here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
        #code before, Encoder layers (supports multiple conformations)
        self.encoder_layers = nn.ModuleList(
                MultiGVPConvLayer(self.node_h_dim, self.edge_h_dim, 
                                  activations=activations, vector_gate=True,
                                  drop_rate=drop_rate, norm_first=True)
            for _ in range(num_layers))
        #######################################
        # Simple self-attention on pooled scalar node features, our changes
        self.attn = nn.MultiheadAttention(
            embed_dim=self.node_h_dim[0], num_heads=4, dropout=drop_rate, batch_first=True
        )
        self.attn_ln = nn.LayerNorm(self.node_h_dim[0])
        ###########################################
        # Decoder layers

In the forward, we add:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# edges: (n_edges, d_se), (n_edges, d_ve, 3)
        h_V, h_E = self.pool_multi_conf(h_V, h_E, batch.mask_confs, edge_index)
###########################################################
        # Apply simple self-attention over nodes (sequence length = n_nodes)
        x = h_V[0].unsqueeze(0)  # (1, n_nodes, d_s)
        attn_out, _ = self.attn(x, x, x, need_weights=False)
        x = self.attn_ln(x + attn_out)
        h_V = (x.squeeze(0), h_V[1])
##########################################################
        encoder_embeddings = h_V

and in the sample forward,

1
2
3
4
5
6
7
8
9
# edges: (n_edges, d_se), (n_edges, d_ve, 3)
        h_V, h_E = self.pool_multi_conf(h_V, h_E, batch.mask_confs, edge_index)
        #############################################
        # Apply simple self-attention over nodes (sequence length = n_nodes)
        x = h_V[0].unsqueeze(0)  # (1, n_nodes, d_s)
        attn_out, _ = self.attn(x, x, x, need_weights=False)
        x = self.attn_ln(x + attn_out)
        ############################################
        h_V = (x.squeeze(0), h_V[1])

Since it’s usually hard to access wandb from local servers, and the original code relies heavily on wandb-related paths, I had to change the paths in trainer.py,

1
2
3
4
5
6
7
8
 if device.type == 'xpu':
        import intel_extension_for_pytorch as ipex
        model, optimizer = ipex.optimize(model, optimizer=optimizer)
    #=======same as before=======
        # Initialise save directory
    save_dir = os.path.join(os.path.dirname(__file__), "..", "mymodel")
    #save_dir = os.path.abspath(save_dir) or you can use this  
    os.makedirs(save_dir, exist_ok=True)

And we need to replace wandb.run.dir path to save_dir. Besides, we use autoaggresive as default.

When we wanna train it, let’ s keep it moving:

1
2
3
nohup python main.py --no_wandb > main.log 2>&1 &
tail -f main.log
nvidia-smi

Get our first model

We finally getbest_checkpoint.h5 as the trained model,then, following the gRNAde.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
CHECKPOINT_PATH = {
    'all': {
        1: os.path.join(PROJECT_PATH, "mymodel/best_checkpoint.h5"),#change
        2: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_2state_all.h5"),
        3: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_3state_all.h5"),
        5: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_5state_all.h5"),
    },
    'das': {
        1: os.path.join(PROJECT_PATH, "mymodel/best_checkpoint.h5"),#change
        2: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_2state_das.h5"),
        3: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_3state_das.h5"),
        5: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_5state_das.h5"),
    },
    'multi': {
        1: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_1state_multi.h5"),
        2: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_2state_multi.h5"),
        3: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_3state_multi.h5"),
        5: os.path.join(PROJECT_PATH, "checkpoints/gRNAde_ARv1_5state_multi.h5"),
    }
}

Attention, sometimes sec_struct_utils.py will automatically remove something before we process it,so we just remove the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
 output = subprocess.run(cmd, check=True, capture_output=True).stdout.decode("utf-8")

    # Delete temporary files这三行注释掉
   # if sequence is not None:
   #     os.remove(fasta_file_path)

    if n_samples > 1:
        return output.split("\n")[:-1]
    else:
        return [output.split("\n")[-2]]

then,

1
2
3
4
5
    current_datetime = datetime.now().strftime("%Y%m%d_%H%M%S")
    try:
        fasta_file_path = os.path.join(wandb.run.dir, f"temp_{current_datetime}.fasta")
    except AttributeError:
        fasta_file_path = os.path.join(PROJECT_PATH, "temp", f"temp_{current_datetime}.fasta")#改这里路径后是这样,我一般不用wandb

It works, I test it just as epoch = 1 but the performance is bull shit.

OMG, such epoch1

Then we improve it to 50epochs, like the author’s default, it performed well. However, in the trainer.py, the author train it without early stopping method. So I add it to the script.

epoch=50
test it well

Then I am thinking about multi-scale attention, from Atlas: Multi-Scale Attention Improves Long Context Image Modeling. So, I added 3 layers of attention, but not multi-scale.

Multi-layer attention or multihead attention?

Test it on the 500 dataset, grnade is:

Length Range Statistic Perplexity Recovery Edit Distance SC Score Sample Count
0-100 Median 1.4128 0.56445 17.5972 0.62315 234
100-200 Median 1.26885 0.73585 30.2634 0.61495 72
200+ Median 1.3132 0.64605 226.875 0.42835 138
0-100 Mean 1.47854274 0.569795726 18.55480427 0.634651282 234
100-200 Mean 1.34514583 0.685369444 37.49759306 0.569984722 72
200+ Mean 1.41243696 0.639286232 211.1403986 0.421031884 138

multiheadattention:

Length Range Statistic Perplexity Recovery Edit Distance SC Score Sample Count
0-100 Median 1.30675 0.5532 17.21875 0.62105 234
100-200 Median 1.17275 0.7444 29.75 0.62205 72
200+ Median 1.30785 0.70485 183.84375 0.41915 140
0-100 Mean 1.36121325 0.572758974 18.34507564 0.638930342 234
100-200 Mean 1.24458056 0.704726389 35.62898472 0.605529167 72
200+ Mean 1.36845071 0.685876429 180.6455357 0.430865714 140

Multi-layer attention:

Length Range Statistic Perplexity Recovery Edit Distance SC Score Sample Count
0-100 Median 1.3358 0.5592 17.6667 0.6396 235
100-200 Median 1.2226 0.71205 33.78125 0.6771 72
200+ Median 1.32545 0.68665 193.0625 0.42465 140
0-100 Mean 1.36930766 0.557402979 18.98926 0.660275319 235
100-200 Mean 1.28780417 0.682672222 37.93399306 0.624776389 72
200+ Mean 1.40186786 0.673555 187.8361607 0.434380714 140

We then compare the SCC,Recovery.

After that, we test grnade and multi-layer attention on 12000+data,

SCC improved

Data leakage issue came up

The author had over 13,000 samples (from RNAsolo 2023), with about 12,000 usable ones. Among them, the training split has 12,000+, the validation set has about 500+, and only around 200 are in the test set.

What’s that supposed to mean? 12,000 : 500 : 200 — that’s nowhere near the usual 8:1:1 split! And it gets even worse — those 200 test samples don’t even go beyond a length of 200. Like, what am I supposed to test with that?

test set,0-160 length range

So we need to optimize this split method. My idea is to first split the clusters based on length ranges — 0–100, 100–200, and 200+. Then, within each range, I’ll select clusters using an 8:1:1 ratio. Finally, I’ll extract the PDB indices contained in each cluster. That way, it should make our experiments more balanced and meaningful.

train split, range, 12000+data
validation split, range,450data
test split, almost 660 data

Meanwhile, RNASolo has 3000+ more data in 2025 than 2023. Maybe we can use this? But let’s focus on our primary 2023 dataset to train and test equally.

2025 RNASolo

Optimization methods and experiments

After I evenly split the test set, I first retrained and tested the gRNAde model. The results were a bit unstable, but it wasn’t a big deal. Simple plot below:

result in test split

After that, I trained gRNAde with 100 epochs and early stopping methods with seed = 0, 1, 2 . Besides, I found that the previous attention approach helped capture long-range interactions, so I’m planning to optimize the GVP layer next.

Graph-Mamba

Mamba

Different from RNN, the SSM(State Space Model) will “remember” the graph or text interactions better.

SSM model structure
SSM architecture details

For it promising performance,

SSM performance comparison

SSMs have rich theoretical properties but suffer from high computational cost and numerical instability. So HiPPO(High-order Polynomial Projection Operator) is proposed to optimize the SSM, as follows,

HiPPO framework
HiPPO optimization details

Notably, it is developed into the S4 model, which shows its strong capability to capture long-range dependencies.

S4 model capabilities

Graph-Mamba: Towards Long-Range Graph Sequence Modeling withSelective State Spaces

From the Graph-Mamba, we will know that,

  • SSM:State Space Models, linear.

  • Based on GraphGPS.

  • Graph-Mamba-Block(GMB) replaced Attention.

  • MPNN (Message Passing Neural Network)—GatedGCN.

  • GraphGPS(MPNN)+GMB = Graph-Mamba.

  • Optimization: Node Prioritization; Permutation-based Training.

Graph-Mamba architecture

In the same way, we just follow the algorithm and replace the gvpcov layer to our graphgpsMamba.

Algorithm implementation
GatedGCN and GraphGPS integration

The code will load GatedGCN as MPNN and combine it with graphGPS then add mamba block. However, graphGPS, or more specifically, gatedGCN, is not proposed to model our RNA molecules. So in our test, it performs worse than baseline.

Baseline comparison results

Long-Short Transformer and Graph Long-Short Transformer

We will learn from the article:

Long-Short Transformer: Effcient Transformers for Language and Vision

It shows a promising way to combine the different attention output. Actually, for the detail,

  • Long-Short Transformer

  • Short-term attention via sliding window

  • Long-range attention via dynamic projection

  • Aggregating Long-range and Short-term Attentions&DualLN

Long-Short Transformer architecture

We then develop it into Long-short Graph Transformer, using MultiGVPConv for message passing, and sliding window for short attention, dynamic projection for long range. In the code, we mixed the edge features to node and processed the node scalar features, because the author’ s model from long-short transformer is 1D information for sentence modeling.

Long-short Graph Transformer implementation
Feature processing pipeline

Unluckily, although it outperformed than any others in 200+ nts long range rna modeling, the short and medium range would drop down.

GAT-GVP long-short graph transformer

According to Graph Attention Networks, we changed the short range to the GAT to capture the neighbor interactions, maintain the 3D information.

GAT-GVP architecture
GAT neighbor interaction mechanism

Experiments

Learning rate

Learning rate comparison

We use our graph transformer which just adds a transformer into the layer, training with different lr, and finds out that when lr=0.0002, it will be the best.

Comparison

We trained with seed0, and,

Training results with seed0

After that, we chose some of them to train 3 seeds as seed=0,1,2, then:

Multi-seed training comparison

Other Paper Reading

Off-topic story — So last week I was totally slacking off when I suddenly saw an email from my advisor’s Zoom: “xxx invites you… at 11:30.” I thought maybe the meeting time had changed. But then I noticed the original 9:30 link was still active, not canceled.

So on the meeting day, the theory slides in my PPT still weren’t done, and I figured I’d hop into the 9:30 call just to check—you know, just in case they actually started then. And sure enough, I heard: “Alright, let’s get started.”

…And that’s how I GAME OVER…

This week I’m not doing anything fancy — just reading papers properly. Gotta figure out how to tell a story in the next report, right? Anyway, I’ll jot down a few things here that I think are somewhat insightful.

A Hyperbolic Discrete Diffusion 3D RNA Inverse Folding Model for functional RNA design。

Abstract&Introduction

  • Background: revolutionary opportunities for diverse RNA-based biotechnologies and biomedical applications.

primary goal of RNA inverse folding is to design nucleotide sequences that can fold into a given RNA secondary or tertiary structure.

  • Challenge: limited availability of experimentally derived 3D structural data and unique characteristics of RNA 3D structures; existing tools either focus on 2D inverse folding tasks or learn limited 3D structural features from experimentally determined and predicted 3D structure datasets for the 3D inverse folding problem.
  • High-resolution RNA structures are significantly less common compared to those of proteins.

RIdiffusion, a hyperbolic denoising diffusion generative RNA inverse folding model.

RIdiffusion efficiently recovers the distribution of nucleotides for targeted RNA 3D structures based on limited training samples using a discrete diffusion model.

Honestly, the background part is usually just the same old recycled stuff, but different papers tend to focus on different challenges. What really stood out to me in this one is how it combines geometric deep learning with a discrete diffusion model — plus, the evaluation is super thorough. You can tell they really put in the work on this one.

  • structure based.
  • the promise of creating novel functional RNAs.

The “low-sample” challenge underscores the importance of learning efficiency in generative models in 3D RNA inverse folding tasks.

Diffusion: modeling inverse folding problem as diffusio ndenoising.

RNA structures exhibit inherent hierarchical organization (e.g., from base-paring, secondary structure elements and remote pseudoknots to complex 3D shapes)——hyperbolic space.

Why? Because:

compared to Euclidean space, a hyperbolic distance better captures molecular differences and improve performance in small molecule generation tasks

In other words, this HB metric captures structural features better than the Euclidean one — or you could say it reflects the more subtle structural differences more effectively. The authors illustrate this with the following figure:

image-20250923222419058

We can mainly focus on panels A and D. Panel A shows a schematic of the transformation, while panel D visualizes how the RNA structures are distributed in the HB space. Simply put, different chains (or backbones) are separated more clearly there — which means the features can be captured more effectively.

about the model

hyperbolic equivariant graph neural networks (HEGNNs) to parameterize the discrete diffusion model, and effectively capture the 3D structural characteristics by incorporating RNA’s geometric features and topological properties into the generation process.

framework:

image-20250923230102547

MATERIALS AND METHODS

I would just understand the Hyperbolic Equivariant Graph Denoising Network, Normally, models tend to lose some information when dealing with this kind of 3D data.

So the author utilizes SE(3) equivariant neural layers from Equivariant Graph Neural Networks (EGNN).49 This approach allows the model to preserve SO(3) rotational equivariance and E(3) translational invariance when updating the representations of nodes and edges, thereby retaining geometric information and ensuring the robustness and effectiveness of the hidden representations, to enable the EGNN network to update edge.其实就是加了个层。

Datasets and Setup

So how the author split?

We followed the same splitting methods outlined in RDesign to ensure a high-quality dataset with only experimentally derived structures, which consists of 1773 structures for training, 221 structures for validation, and 223 structures for testing in an 8:1:1 ratio.

According 8:1:1.

In addition, to evaluate the generalization ability, the authors used PSI-CD-HIT to cluster the data based on nucleotide similarity, setting similarity thresholds of 0.8, 0.9, and 1.0。

Evaluation Metrics

Besides Recovery Rate, there is Macro-F1 score,

The Macro-F1 score is used to assess the accuracy and comprehensiveness of the model in predicting protein or RNA sequences, particularly in cases where the distribution of different letter residues is imbalanced, by evaluating the Macro-F1 score for each residue in the sequence.

$$ NoveltyScore(x_i) = 1.0 - RecoveryRate(x_i,x_{mostsimilar}) $$

There’s also a GSN (Global-scale Novelty) metric, where 100 samples are drawn using RR-based stratified sampling.

And there’s another one, quite similar to SCC — it’s based on structures folded with RhoFold, then structural similarity is computed using US-align, which gives the RMSD value.

Baselines

1 RNA tertiary structure inverse folding model, 3 protein tertiary structure inverse folding baselines, 1 Transformer-based graph GNN network for graph structure: gRNAde, PiFold, StructGNN, GVP-GNN, and GraphTrans.

Author of RiboDiffusion&RhoDesign provided no training scripts in the released code.

Results

we divided the dataset of seq-0.8 into short (<50nt), medium (50~100nt), and long (>100nt)

Emm… is that long???

The novelty of generative designs can be viewed from two perspectives: sequence novelty and physicochemical properties novelty. In our case, sequence and physicochemical properties (derived from sequence) novelty are important in RNA 3D inverse folding. Indeed, sequence features such as physicochemical properties play a crucial role in functional RNA design. For instance, it was reported that natural RNAs show privileged physicochemical property space which is related to their biological functions. Other sequence features like GC content are important properties related to developability of RNA molecules. Given the nature of one-to-many relationship between the targeted RNA structure and its possible sequences, the novelty and diversity of generated sequences are important considerations for improving physicochemical properties of the designed sequences.

physicochemical properties space, evaluation,

Physicochemical properties evaluation

Beyond RNA Structure Alone: Complex-Aware Feature Fusion for Tertiary Structure-based RNA Design

Abstract

  • important in synthetic biology and therapeutics
  • limitation: overlook the role of complex-level information

In fact, the authors argue that analyzing RNA alone isn’t enough — you also have to look at the related complexes.

our method incorporates protein features extracted by protein language model (e.g., ESM-2)

To address bio-logical complexity of protein-RNA interactions, propose a distance-aware filtering for local features from protein representation.

a high-affinity design framework that combines our CARD with an affinity evaluation model.

Final goal: produce high-affinity RNA sequences

Introduction

  • more sophisticated computational approaches that can model the complex relationship between RNA sequences and their structures.

The authors also mentioned LEARNA and Meta-LEARNA, which are RL-based, as well as RDesign and RhoDesign, which are based on 3D graph representations.

  • different from protein folding which roughly follows Anfinsen’s rule, RNAs are highly flexible and rely on interactions with proteins, DNA, and other biomolecules to achieve folding.
  • RNAs may adopt various conformations by interacting with distinct macromolecules at different stages of functioning. At different stages, they also bind with macromolecules.
image-20250924001621731

This paper considers both RNA structural features and protein–RNA interactions to generate sequences comprehensively.

The outline, as shown in Figure 1, includes a pre-trained protein language model (PLM) to handle proteins, and a Complex-Aware Transformer (CAFormer) to integrate RNA and protein features. They use distance-aware filtering to select relevant regions while ignoring unrelated areas. This setup allows the model to capture the context of protein–RNA complexes and focus on the interaction regions within the complex.

Overall, the authors iteratively screen the generated RNA sequences: candidate sequences are first filtered based on predicted affinity scores, and then structural compatibility is validated using multiple folding models.

Affinity Evaluation

Regarding “affinity”, the authors mention predictive models like PNAB, DeePNAP, PredPRBA, PRdeltaGPred, and PRA-Pred.

However, structural data for RNA–protein complexes is very limited. CoPRA integrates PRA310 and introduces a multi-modal model, which combines RNA and protein language models with comprehensive structural information.

Additionally, the authors also discuss the development and background of inverse folding:

RNA design aims to generate sequences that fold into predefined structures. Early methods focused on secondary structure optimization, using thermodynamic parameters and energy minimization. Tools like RNAfold predict RNA secondary structures based on the minimum free energy principle. As understanding of RNA structure has advanced, focus has shifted to complex tertiary structure-based design due to RNA’s high conformational flexibility, which challenges traditional thermodynamic methods. Recent deep learning approaches for RNA tertiary structure design include gRNAde, which utilizes geometric deep learning and graph neural networks to generate RNA sequences; RiboDiffusion, a diffusion model for inverse folding leveraging RNA backbone structures; RDesign, which employs a data-efficient learning framework with contrastive learning for tertiary structures; and Rhodesign, focusing on RNA aptamer design by guiding sequence generation through structural predictions.

Methods

image-20250924003914813

It’s actually quite comprehensive. Regarding that filter, since a single protein can bind RNA through multiple sites (different conformations), and a single RNA can also interact with different proteins, the authors provide the following figure to illustrate this:

image-20250924004136830

COMPLEX-AWARE ATTENTION BLOCK

For this part, the authors first concatenate the RNA and the filtered protein information, and then use N multi-head self-attention layers to enhance the model’s ability to capture contextual interactions.

image-20250924004411039

After that, they use a Transformer-based decoder, similar to RhoDesign, to generate RNA sequences. The decoding is framed as a next-token prediction task, where the decoder predicts the next nucleotide based on the previously generated ones. During training, the model is optimized using cross-entropy loss.

High-Affinity RNA Design Framework

  • structure-to-sequence design model and evaluation tools.

First, RNA sequences are designed using the authors’ model to meet complex-specific constraints (like binding sites and interactions). The sequences are then evaluated with ensemble regression models trained on the PRA201 dataset, while their structures are folded using AlphaFold3, RhoFold, and RoseTTAFold2NA to calculate RMSD. In each iteration, the top 10%–20% of sequences are selected, and the process is repeated until an optimal set is obtained.

Experiments

Dataset and Implementation Details

  • Datasets: PRI30K and PRA201.
  • The latter is used for the blind test, while the former removes structurally similar cases for training and testing. After processing, the dataset contains 21,050 protein–RNA pairs and 2,309 RNA sequences. Clustering is also done using CD-HIT with an 80% similarity threshold.
  • 200 epochs, batch size of 48 (24 per GPU), attention blocks N = 6, local filtered size K = 64, ESM-2 650M.
  • Comparison: SeqRNN and SeqLSTM, StructGNN, GraphTrans, RDesign, and RhoDesign.

Ablation Studies

key components of method, including the number of attention blocks in CAFormer, the fashion of filtering the protein representation, and the choices of protein representation model.

Ablation study results

RiboDiffusion: tertiary structure-based RNA inverse folding with generative diffusion models

Abstract:

  • scarcity of data, nonunique structure-sequence mapping, flexibility of RNA conformation.

  • growing applications in synthetic biology and therapeutics;

  • inverse folding problem;

Introduction:

  • RNA-based biotechnology;

  • early methods for focus on folding into RNA secondary structures; Some use efficient local search strategies guided by the energy function; Or globally by modeling the sequence distribution or directly manipulating diverse candidates;

  • DAS physically based approach, but still constrained by local design strategy and computational efficiency.

– only 2D, without considering 3D structures of RNA

RiboDiffusion model overview

Methods:

Diffusion;

Three-atom coarse-grained representation including the atom coordinates of C4’, C1’, N1 (pyrimidine) or N9 (purine) for every nucleotide;

Consider the RNA inverse folding problem as modeling the conditional distribution p(S|X);

Based on the GVP-GNN architecture;

Results

Dataset: predicting structures with RhoFold. The structures predicted from RNAcentral sequences are filtered by pLDDT to keep only high-quality predictions, resulting in 17 000 structures.

Cluster: sequence–PSI-CD-HIT to cluster sequences based on nucleotide similarity. We set the threshold at 0:8=0:6=0:4 and obtain 1252=1157=1114 clusters; structure similarity clustering, calculate the TM-score matrix using US-align and apply scipy, achieve 2036=1659=1302 clusters with TM-score thresholds of 0:6=0:5=0:4. 15% for testing, 10% for validation.

Evaluation Metrics:

RR, F1 Score(RNAfold and RMSD)

Learning to Design RNA

Abstract:

  • Based on deep reinforcement learning;

  • Jointly optimize over a rich space of architectures for the policy network, the hyperparameters of the training procedure and the formulation of the decision process.

Introduction:

  • LEARNA: deep RL algorithm for RNA Design. Given a target secondary structure, can predict the entire RNA sequence. After generating an RNA sequence, then folds this and locally adapts it, and uses the distance of the resulting structure to the target structure as an error signal for the RL agent;

  • Meta-LEARNA: learns a single policy across many RNA Design tasks directly applicable to new RNA Design tasks;

  • The first application of architecture search (AS) to RL;

  • New benchmark dataset with an explicit training, validation and test split;

  • Faster.

THE RNA DESIGN PROBLEM:

  • Folding algorithm: zuker;

  • Loss Function: Hamming diastance.

LEARNING TO DESIGN RNA:

Action Space, State Space, Transition Function, Reward Function;

  • One problem of current deep reinforcement learning methods is that their performance can be very sensitive to choices regarding the architecture of the policy network, the training hyperparameters, and the formulation of the problem as a decision process.

  • RL often yield noisy or unreliable outcomes in single optimization runs, (a) The number of unsolved sequences, (b) the sum of mean distances, (c) the sum of minimum distances to the target structure.

RL optimization metrics

RNA-DCGen: Dual Constrained RNA Sequence Generation with LLM-Attack

  • Challenge: recent diffusion and flowmatching-based models face two key limitations: specialization for fixed constraint types, such as tertiary structures, and lack of flexibility in imposing additional conditions beyond the primary property of interest.

  • Generative –> Search;

  • Dual Constrained: on the sequence itself and on specified constraints.

RNA-DCGen framework
使用 Hugo 构建
主题 StackJimmy 设计