spann3r论文阅读

发表于 2025/03/10

作者 winka9587

12 分钟阅读

Spann3R 主要数据流向：
┌──> 输入图像 I_t  (当前帧)
│
├──> ViT 编码器 (Encoder_I)  # 提取视觉特征
│     └──> 视觉特征 f_I^t
│
├──> 记忆查询特征 f_Q^{t-1} (上一帧查询)
│     └──> 进入 记忆读取 (Memory Read)
│
├──> 记忆模块 (Spatial Memory)
│     ├──> 记忆键 f_K (Memory Key)
│     ├──> 记忆值 f_V (Memory Value)
│     └──> 输出 融合特征 f_G^{t-1}  (基于 f_Q^{t-1} 读取)
│
├──> 交织解码器 (Intertwined Decoders)
│     ├──> 目标解码器 (Target Decoder)
│     │     ├──> 输入: f_I^t (当前帧视觉特征), f_G^{t-1} (记忆读取结果)
│     │     ├──> 交叉注意力 处理两者
│     │     └──> 输出 查询特征 f_Q^t (用于下一帧)
│     │
│     ├──> 参考解码器 (Reference Decoder)
│     │     ├──> 输入: f_I^t, f_G^{t-1}
│     │     ├──> 交叉注意力 处理
│     │     ├──> 输出 点图 X_{t-1}, 置信度 C_{t-1}
│     │     ├──> 输出 记忆键 f_K^{t-1}
│     │     ├──> 输出 记忆值 f_V^{t-1} (编码自 X_{t-1})
│     │     └──> 更新 记忆模块
│
└──> 生成最终 3D 点云 (Pointmap X_t) 并用于下一帧

Encoder

$f^Q_{t-1}$是前一帧($t-1$时刻)通过Target Decoder生成的查询特征(参考公式4)。

这其实是一个loop, 用于重建场景的几何信息其实是隐式地存储在Decoder中的。

train

Q: spann3r可以接收有序/无序的数据作为输入, 那么在训练的时候, 是使用成对的数据训练还是整个连续序列进行训练?

A: 训练数据可以是成对的图像，也可以是整个序列，根据数据集和参数设置来改变。

数据集

下载进度

图标	说明
:white_check_mark:	基本完成不排除有新增任务的可能
:black_square_button:	尚未完成
:heavy_check_mark:	可预见的未来不需要进行任何修改
:construction:	近期（2周内）正在处理
:warning:	实现有待讨论

dataset	size	download	preprocess	ready
Scannet++	1.5T	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Scannet		:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Habitat		:construction:	:construction:	:construction:
ArkitScenes		:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:
Co3D		:construction:	:construction:	:construction:

Scannet++

download

python download_scannetpp.py download_scannetpp.yml

下载过程中难免出现中断的情况, 下载完成后将download_scannetpp.yml中的dry_run修改为true可以检查缺失的文件并自动下载缺失文件。

download_scannetpp.yml中描述:

###### specify the assets to download ######
# by default, these assets are downloaded for each scene:
# mesh+3D semantics, lowres dslr, iphone data
# see `scene_assets` and the dataset documentation page for more info on each asset
default_assets: [
                # dslr lists, images
                dslr_train_test_lists_path, dslr_resized_dir, dslr_resized_mask_dir,
                # camera poses
                dslr_colmap_dir, dslr_nerfstudio_transform_path,
                # mesh
                scan_mesh_path, scan_mesh_mask_path,
                # annotation
                scan_mesh_segs_path, scan_anno_json_path, scan_sem_mesh_path,
                # iphone video and depth
                iphone_video_path, iphone_video_mask_path, iphone_depth_path,
                # camera poses
                iphone_pose_intrinsic_imu_path, iphone_colmap_dir, iphone_exif_path
              ]

每个scene的结构：

/dataset/ScanNetpp/data/00777c41d4$ tree -L 4
.
├── dslr  # 单反相机
│   ├── colmap
│   │   ├── cameras.txt
│   │   ├── images.txt
│   │   └── points3D.txt
│   ├── nerfstudio
│   │   └── transforms.json
│   ├── resized_anon_masks
│   │   ├── DSC00850.png
│   │   ├── DSC00851.png
        ...
│   │   └── DSC01111.png
│   ├── resized_images
│   │   ├── DSC00850.JPG
│   │   ├── DSC00851.JPG
        ...
│   │   └── DSC01111.JPG
│   └── train_test_lists.json
├── iphone
│   ├── colmap
│   │   ├── cameras.txt
│   │   ├── images.txt
│   │   └── points3D.txt
│   ├── depth.bin
│   ├── exif.json
│   ├── pose_intrinsic_imu.json
│   ├── rgb_mask.mkv
│   └── rgb.mkv
└── scans
    ├── mesh_aligned_0.05_mask.txt
    ├── mesh_aligned_0.05.ply
    ├── mesh_aligned_0.05_semantic.ply
    ├── segments_anno.json
    └── segments.json

预处理

官方toolbox

spann3r仅使用DSLR数据（dust3r使用DSLR+iphone 参考https://github.com/HengyiWang/spann3r/issues/47）

generate undistorted DSLR depth.

先按scannetpp的预处理提取sens

然后按SplaTAM的处理提取

执行代码

1.Undistortion: convert fisheye images to pinhole with OpenCV

python -m dslr.undistort dslr/configs/undistort.yml

2.Render Depth for DSLR and iPhone

https://github.com/liu115/renderpy

安装依赖

git clone --recursive https://github.com/liu115/renderpy    # clone with submodules
conda create -n renderpy python=3.9
conda activate renderpy
pip install .

编辑data_root路径

python -m common.render common/configs/render.yml

python -m common.render common/configs/render.yml
scene:   0%|                                                                                                                                                                                                              | 0/2 [00:00<?, ?it/s]Init EGL
Detected 0 devices
scene:   0%|                                                                                                                                                                                                              | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data3/cxx/conda_envs/scanntpp/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data3/cxx/conda_envs/scanntpp/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data5/cxx/dataset/SplaTAM_Nik-V9_scannetpp/common/render.py", line 89, in <module>
    main(args)
  File "/data5/cxx/dataset/SplaTAM_Nik-V9_scannetpp/common/render.py", line 47, in main
    render_engine = renderpy.Render()
RuntimeError: Failed to initialize EGL

没有运行Undistortion: convert fisheye images to pinhole with COLMAP

官方仓库标记该预处理已经被废弃，可能是因为数据集中已经包含了结果。参考dslr/colmap下已经有了对应的colmap数据。

生成深度图

在执行下面代码前的结构是

data/7b6477cb95/dslr$ ls
colmap  nerfstudio  resized_anon_masks  resized_images  train_test_lists.json  undistorted_anon_masks  undistorted_images

https://github.com/Nik-V9/scannetpp?tab=readme-ov-file#render-depth-for-dslr-and-iphone

渲染深度图

python -m common.render common/configs/render.yml

生成dslr深度图

python -m common.render dslr/configs/undistort.yml

iphone预处理

这部分使用的是官方toolbox仓库的代码, 因为后来的Scannet++v2版本中将iphone数据的mp4格式修改为了mkv格式, Nik-V9的仓库中没有更新。

复制一份iphone/configs/prepare_iphone_data.yml模版, 修改data_root路径。

python -m iphone.prepare_iphone_data iphone/configs/prepare_iphone_data.yml

Scannet

Scannetv2 structure

Pre-process

follow instruction in data_preprocess.md of spann3r

ScanNet -> Step2. Extract and organize the dataset using pre-process script in SimpleRecon

install packages in env.yml of simplerecon

# simplerecon/data_scripts/scannet_wrangling_scripts/env.yml
name: scannet_extraction
dependencies:
  - python=3.9.7
  - numpy
  - imageio
  - pillow
  - tqdm
  - pip
  - pip:
    - opencv-python
    - pypng

Extracting data from .sens files

生成无失真的DSLR深度

set SCANNET_ROOT and OUTPUT_PATH.

OUTPUT_PATH can be the same directory as the ScanNet root directory SCANNET_ROOT.

# SCANNET_ROOT=/data4/cxx/dataset/ScanNet/ScanNet
# OUTPUT_PATH=/data4/cxx/dataset/ScanNet/ScanNet
SCANNET_ROOT=/PATH/TO/ScanNet
OUTPUT_PATH=/PATH/TO//ScanNet
cd simplerecon/data_scripts/scannet_wrangling_scripts/

extract all scans for test

python reader.py --scans_folder $SCANNET_ROOT/scans_test \
                 --output_path  $OUTPUT_PATH/scans_test \
                 --scan_list_file splits/scannetv2_test.txt \
                 --num_workers 12 \
                 --export_poses \
                 --export_depth_images \
                 --export_color_images \
                 --export_intrinsics;

extract all scans for train & val

python reader.py --scans_folder $SCANNET_ROOT/scans \
                 --output_path  $OUTPUT_PATH/scans \
                 --scan_list_file splits/scannetv2_train.txt \
                 --num_workers 12 \
                 --export_poses \
                 --export_depth_images \
                 --export_color_images \
                 --export_intrinsics;

python reader.py --scans_folder $SCANNET_ROOT/scans \
                 --output_path  $OUTPUT_PATH/scans \
                 --scan_list_file splits/scannetv2_val.txt \
                 --num_workers 12 \
                 --export_poses \
                 --export_depth_images \
                 --export_color_images \
                 --export_intrinsics;

最终处理后的结构为:

SCANNET_ROOT(will be mentioned as /PATH/TO/ScanNet)
    scans_test (test scans)
        scene0707
            scene0707_00_vh_clean_2.ply (gt mesh)
            sensor_data
                frame-000261.pose.txt
                frame-000261.color.jpg 
                frame-000261.color.512.png (optional, image at 512x384)
                frame-000261.color.640.png (optional, image at 640x480)
                frame-000261.depth.png (full res depth, stored scale *1000)
                frame-000261.depth.256.png (optional, depth at 256x192 also scaled)
            scene0707.txt (scan metadata and image sizes)
            intrinsic
                intrinsic_depth.txt
                intrinsic_color.txt

        ...
    scans (val and train scans)
        scene0000_00
            (see above)
        scene0000_01
        ....

habitat

habitat中合成训练数据, 使用0.1的最小重叠率作为超参数，渲染5帧训练序列。作者认为这可能会引入偏差，导致模型在重建Habitat-sim等合成数据集（例如Habitat-sim中的Replica数据集，未将其纳入训练）时表现不佳

根据issue44, Habitat 的训练数据是否仅包含 path.py 中两个未注释的场景（Gibson 和habitat-test-scene，如果目标只是复现结果，不需要HM3D数据集。

ARKitScenes

ARKitScenes数据集包含3个数据集: 3dod, upsampling, raw

根据官方描述:

3dod - 用于训练三维物体检测的数据集。该数据集包含3类资源：低分辨率RGB图像、低分辨率深度图像及标注文件（5047次threedod扫描总容量为623.4GB）
upsampling - 用于训练深度图上采样的数据集。包含3类资源：高分辨率RGB图像、低分辨率深度图像及高分辨率深度图像
raw - 包含ARKitScenes所有原始数据的全集，其中3dod和深度上采样数据集均为其子集。该数据集还包含大量未用于3DOD或深度上采样的其他资源

3dod/threedod

python download_data.py 3dod –video_id_csv threedod/3dod_train_val_splits.csv –download_dir ./data/raw_ARKitScenes

depth_upsampling

python download_data.py upsampling –video_id_csv depth_upsampling/upsampling_train_val_splits.csv –download_dir ./data/raw_ARKitScenes

raw

raw数据集有些特殊, 可以指定分辨率版本(这里只使用lowres)

python3 download_data.py raw –video_id_csv raw/raw_train_val_splits.csv –download_dir /tmp/ar_raw_all/ –raw_dataset_assets mov annotation mesh confidence highres_depth lowres_depth lowres_wide.traj lowres_wide lowres_wide_intrinsics

运行命令:

(/data3/cxx/conda_envs/spann3r) cxx@king:/data3/cxx/workspace/siggraph/spann3r/data/ARKitScenes/ARKitScenes$ python3 download_data.py raw --video_id_csv raw/raw_train_val_splits.csv --download_dir ./data/ar_raw_all/ --raw_dataset_assets mov annotation mesh confidence highres_depth lowres_depth lowres_wide.traj lowres_wide lowres_wide_intrinsics

cd /PATH/spann3r/data/ARKitScenes/ARKitScenes
python3 download_data.py raw --video_id_csv raw/raw_train_val_splits.csv --download_dir ./data/ar_raw_all/ --raw_dataset_assets mov annotation mesh confidence highres_depth lowres_depth lowres_wide.traj lowres_wide lowres_wide_intrinsics

Co3D

cd /PATH/TO/co3d_repo  # github repo位置
# debug /data5/cxx/dataset/co3d/co3d_repo
python ./co3d/download_dataset.py --download_folder /data5/cxx/dataset/co3d/data

gpu limit

参考论文4.4节

作者反复提到, 由于gpu的内存限制(32G显存)，训练使用5帧序列，限制了Spann3R能够处理的空间区域的大小。

build_dataset

def build_dataset(dataset, batch_size, num_workers, test=False):
    split = ['Train', 'Test'][test]
    print(f'Building {split} Data loader for dataset: ', dataset)
    loader = get_data_loader(dataset,
                             batch_size=batch_size,
                             num_workers=num_workers,
                             pin_mem=True,
                             shuffle=not (test),
                             drop_last=not (test))

    print(f"{split} dataset length: ", len(loader))
    return loader

以以下输入为例

"--batch_size", "4",
"--train_dataset", "10000 @ Scannet(split='train', ROOT='/dataset/ScanNet', resolution=224, transform=ColorJitter, max_thresh=50)",

例如

train_dataset = "10000 @ Co3d(...) + 10000 @ BlendMVS(...)"

这里表示将两个数据集分别调整为 10000 个样本后，再将它们组合到一起。

具体代码在dust3r/datasets/base/easy_dataset.py中。

完整的训练数据集, 包含所有的样本, 经过10000 @ Scatnet采样后, 数量为10000

$ len(data_loader_train.dataset)
10000

当batch_size为4时,

$ len(data_loader_train)
2500

本文由作者按照 CC BY 4.0 进行授权