Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (2024)

Weiming Zhi Haozhan Tang Tianyi Zhang Matthew Johnson-RobersonThe authors are with the Robotics Institute, Carnegie Mellon University. Correspondence to wzhi@andrew.cmu.edu.

Abstract

Representing the environment is a central challenge in robotics, and is essential for effective decision-making. Traditionally, before capturing images with a manipulator-mounted camera, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag. However, recent advances in computer vision have led to the development of 3D foundation models. These are large, pre-trained neural networks that can establish fast and accurate multi-view correspondences with very few images, even in the absence of rich visual features. This paper advocates for the integration of 3D foundation models into scene representation approaches for robotic systems equipped with manipulator-mounted RGB cameras. Specifically, we propose the Joint Calibration and Representation (JCR) method. JCR uses RGB images, captured by a manipulator-mounted camera, to simultaneously construct an environmental representation and calibrate the camera relative to the robot’s end-effector, in the absence of specific calibration markers. The resulting 3D environment representation is aligned with the robot’s coordinate frame and maintains physically accurate scales. We demonstrate that JCR can build effective scene representations using a low-cost RGB camera attached to a manipulator, without prior calibration.

I Introduction

The manipulator-mounted camera setup, where the camera is rigidly attached to the manipulator, enables the robot to actively perceive its environment and is a common setup for robot manipulation. The robot needs to extract a concise representation of the physical properties of the environment from the collected data, enabling it to operate safely and make informed decisions. Compared to fixed cameras, manipulator-mounted cameras allow the robot system to adjust its viewing pose to reduce occlusion and obtain measurements at diverse angles and distances. However, manipulator-mounted cameras also come with their challenges — the camera must be calibrated before collecting data from the environment. Specifically, to obtain scene representations that the robot can plan within, it is important to transform the representation into the reference frame of the robot base. This process of finding the camera pose relative to the end of the manipulator, or end-effector, is known as hand-eye calibration [1].

Classical hand-eye calibration is an elaborate procedure that requires the camera to move to a diverse dataset of poses and record multiple images of an external calibration marker, usually a checkerboard or an AprilTag [2]. Then, the rigid body transformation between the camera and the end-effector can be computed. This complex procedure can be a hurdle for non-experts since it necessitates the creation of dedicated markers and the collection of a new dataset each time the camera is recalibrated.

Deep learning approaches have driven recent advances within the computer vision community. This has led to the emergence of large pre-trained models, such as DUSt3R [3], which greatly outperform classical approaches for multi-view problems. These models are trained on large datasets and intended as plug-and-play modules to facilitate a wide range of downstream tasks. We describe these models as 3D Foundation Models [4] and advocate for their integration in robot camera calibration and scene representation.

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (1)

In this paper, we contribute the Joint Calibration and Representation (JCR) method. JCR leverages a 3D foundation model to jointly conduct hand-eye calibration and construct a scene representation in the coordinate frame of the manipulator’s base from a small set of images, collected from an RGB camera mounted on the manipulator. Previous approaches which use manipulator-mounted sensors require capturing images of external markers, which are then used to perform calibration. To the best of our knowledge, the proposed approach is the first to simultaneously calibrate the camera and build a scene representation from the same set of images captured by a manipulator-mounted camera. We obtain a model of our environment in the robot’s coordinate frame, without any a priori calibration, external markers, or depth readings. The constructed scene representation is a continuous model which can be used for collision-checking in subsequent motion planning. We validate the robustness of our approach using a variety of collected real-world datasets, collected from a low-cost camera mounted on a 6-DOF manipulator. A diagrammatic overview of JCR is presented in Figure1.

The remainder of the paper is organized as follows: We begin by discussing related work in SectionII and then introduce the necessary background on 3D foundation models in SectionIII. We detail the technical aspects of the Joint Calibration and Representation (JCR) method in SectionIV, follow with empirical evaluations of JCR in SectionV, and conclude by summarizing our findings and outlining future research directions in SectionVI.

II Related Work

Scene Representation:Early work on representing environments in robotics typically recorded environment properties in discretized cells, with the most notable approach being Occupancy Grid Maps [5]. Distance-based representations have also been applied for robotics tasks [6, 7] to check for collisions. Advances in machine learning have motivated the development of continuous representation methods, which functionally represent structure in the environment. For example, by using Gaussian processes [8], kernel regression [9, 10], Bayesian methods [11, 12, 13], and neural networks [14]. Deep learning approaches to directly operate on point clouds [15, 16] have also been developed. Methods to ingest point clouds directly for robot planning have also been explored [17]. Concurrently, in the computer vision community, there has been an effort to create photo-realistic scene representations. These approaches include Neural Radiance Fields (NeRFs) [18] and subsequent variants [19, 20]. They rely on obtaining an initial solution from Structure-from-Motion methods [21], and train an implicit model that matches the environment’s appearance.

Hand-eye Calibration: Hand-eye calibration is a well-studied problem with geometric solutions [1, 22, 23] developed to solve for the transformation when some external calibration marker, such as a checkerboard, is provided. A recent learning approach for hand-eye calibration is presented in [24], but requires a part of the robot’s gripper to be visible in the camera view. Additionally, there exist methods in end-to-end policy learning [25, 26] which directly train for actions from the camera images, and do not require hand-eye calibration. However, unlike our method, these methods are unable to construct an environment model which can then be used for collision-checking in downstream motion planning and decision-making [27, 28, 29].

Pre-trained Models:The machine learning community has made considerable efforts to develop large-scale models trained on extensive web data, resulting in significant advancements in large deep learning models for natural language processing [30] and computer vision [31]. In particular, [3, 32] are pre-trained models that can be applied to 3D tasks. These large pre-trained models are known as foundation models [4] and are typically treated as back-boxes whose output is used in subsequent downstream tasks. Although the outputs of these models generally require further processing before applying them to robotics tasks, there has been widespread interest in incorporating foundation models within robot systems [33].

III Prelminaries: 3D Foundation Models for Dense Reconstruction

Traditional methods for 3D tasks such as Structure-from-Motion [21] or multi-view stereo [34] depend on identifying visual features over a set of images to construct corresponding 3D structures. On the other hand, pre-trained models, such as Dense Unconstrained Stereo 3D Reconstruction (DUSt3R), have been trained on large datasets and can identify correspondences over a set of images without strong visual features. Throughout this work, we use DUSt3R [3] as the foundation model and follow its conventions. Here we shall briefly outline how the foundation model estimates relative camera poses from RGB images and more details can be found in [3].

Pairwise Pixel Correspondence: Suppose we have a pair of RGB images with width $W$ and height $H$ , i.e. $I_{1},I_{2}\in\mathbb{R}^{W\times H\times 3}$ , our foundation model can produces pointmaps, $X^{1,1},X^{1,2}\in\mathbb{R}^{W\times H\times 3}$ . Pointmaps assign each pixel in the 2D image to its predicted 3D coordinates and are critically in the same coordinate frame of $I_{1}$ . Confidence maps, $C^{1,1},C^{1,2}\in\mathbb{R}^{W\times H}$ , for each of the pointmaps are also produced. These indicate the uncertainty of the foundation model’s prediction for each point. By finding the nearest predicted 3D coordinates of each pixel in the pointmap with the coordinates of other pointmap, we can find dense correspondences between pixels in the image pair, without handcrafted features.

Recovering Relative Camera Poses: We optimize to globally align the pairwise pointmaps predicted by the foundation model to recover the relative camera poses corresponding to a set of images. For a set of $N$ images, we have the cameras $n=1,\dots,N$ and possible image pairs with indices $(n,m)\in\varepsilon$ , where $m=1,\ldots,N$ and $m\neq n$ . For each pair, the foundation model gives us:(1) Pointmaps in $X^{n,n},X^{n,m}\in\mathbb{R}^{W\times H\times 3}$ in the frame of $I_{n}$ ;(2) Corresponding confidence maps $C^{n,n},C^{n,m}\in\mathbb{R}^{W\times H}$ .With these, we seek to optimize to find:(1) For each of the $N$ images, a pointmap in global coordinates $\bar{X}^{n}$ ;(2) A rigid transformation described $P_{n}\in\mathbb{R}^{3\times 4}$ and factor $\sigma_{n}>0$ .

Intuitively, the same transformation should be able to align both images in each pair to their equivalents in the global coordinate. We can then minimize the distance between the transformed pointmaps and the predicted pointmaps in global coordinates:

\min_{\hat{X},P,\sigma}\sum_{(n,m)}\sum_{i\in(n,m)}\sum_{(w,h)}C^{n,i}_{w,h}%\lvert\lvert\hat{X}_{i}-\sigma_{e}P_{e}X^{n,e}_{w,h}\lvert\lvert_{2}.

(1)

Here, $(n,m)\in\varepsilon$ denotes the pairs, $i$ iterates through the two images in each pair, and $(w,h)$ iterates through each pixel in the image. Equation1 can be optimized efficiently via gradient descent. With the pointmaps over a set of images in the same coordinate frame, we can extract the set of camera poses $P$ and the globally aligned pointmaps $\hat{X}$ over the set of images, and form a pointset representation of the environment.

However, the obtained outputs of the foundation model cannot be directly used to build representations for robots to operate in. We need the outputs of the foundation model to be in the robot’s coordinate frame. Additionally, the 3D foundation models typically cannot recover physically accurate scale. Here, $\sigma_{n}$ does not correspond to physical scales, and we need to re-scale distances by the unknown model-to-reality factor to be physically accurate.

input :End-effector poses $\{E_{i},\ldots,E_{N}\}$ , Captured images $\{I_{1},\ldots,I_{N}\}$ , 3D foundation model ( $\mathtt{3DFM}$ ), Neural network $f_{\theta}$

1 $\{P_{i}\}_{i=1}^{N},\{\hat{X}_{i}\}_{i=1}^{N},\{C_{i}\}_{i=1}^{N}\leftarrow%\mathtt{3DFM}(\{I_{1},\ldots,I_{N}\})$ ; Input images into foundation model to find uncalibrated camera poses, pointmaps, confidence.

2 $\bigg{\{}\begin{array}[]{l}T_{E_{i}}^{E_{i+1}}=E_{i+1}(E_{i})^{-1},\\T_{P_{i}}^{P_{i+1}}=P_{i+1}(P_{i})^{-1}.\end{array}\text{for }i\!=\!1,\ldots,N-1$ ; Rearranging Equation4.

3 Obtain ${R_{c}^{e}}^{*}$ via Equation12;

4 Solve for ${\mathbf{t}_{c}^{e}}^{*},\lambda$ via Equation13;

5 Obtain $\{\mathbf{x}_{i}\}_{i=1}^{N_{pc}}$ by filtering confident 3D points from $\{\hat{X}_{i}\}_{i=1}^{N}$ with $\{C_{i}\}_{i=1}^{N}$ ;

IV Joint Calibration and Representation

IV-A Overview

We tackle the problem setup of a manipulator with an inexpensive RGB camera rigidly mounted on the manipulator. Here, we do not require the mounted camera to be calibrated, that is, the camera pose relative to the end-effector is unknown. We control the end-effector manipulator to go to a small set of $N$ poses, $\{E_{1},\ldots,E_{N}\}$ , and capture an image at each pose. This gives us a set of images $N$ of the environment, $\{I_{1},\ldots,I_{N}\}$ , which can be inputted into the foundation model to obtain a set of aligned relative camera poses $\{P_{1},\ldots,P_{N}\}$ and pointmaps $\{\hat{X}_{1},\ldots,\hat{X}_{N}\}$ with respect to an arbitrary coordinate system and scale. Figure2 shows an example of before and after calibration and scaling. We observe that the relative camera poses from the foundation model are inconsistent with the end-effector poses and physical scaling.

Here, JCR seeks to recover:

•
The rigid transformation, $T_{c}^{e}$ , from the frame of the mounted camera to that of the end-effector.
•
A scale factor $\lambda$ , that scales the foundation model’s unscaled outputs to true physical scale.
•
A representation of the environment in the robot’s frame that identifies occupied space, and other properties of interest, such as segmentation classes and RGB colour.

Obtaining $T_{c}^{e}$ will enable us to efficiently incorporate new images from the camera into the robot’s frame, while the environment representation is critical downstream to planning tasks. An algorithmic outline of JCR is presented in Algorithm1.

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (2)

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (3)

IV-B Calibration With Foundation Model Outputs

Here, we seek to solve for $T_{c}^{e}$ with the end-effector poses $\{E_{1},\ldots,E_{N}\}$ the predicted unscaled relative camera poses $\{P_{1},\ldots,P_{N}\}$ . We shall consider transformations between subsequent end-effector poses $T_{E_{i}}^{E_{i+1}}$ and transformations between camera poses $T_{P_{i}}^{P_{i+1}}$ , where

\displaystyle\Bigg{\{}\begin{array}[]{l}E_{i+1}=T_{E_{i}}^{E_{i+1}}E_{i}\\P_{i+1}=T_{P_{i}}^{P_{i+1}}P_{i}\end{array}\quad\text{for }i=1,\ldots,N-1.

(4)

As the foundation model does not recover absolute scale, we introduce a scale factor $\lambda$ . The transformation between scaled estimated camera poses as:

T_{P_{i}}^{P_{i+1}}(\lambda)=\begin{bmatrix}R_{P_{i}}^{P_{i+1}}&\lambda\mathbf%{t}_{P_{i}}^{P_{i+1}}\\0\quad 0\quad 0&1\end{bmatrix}\in\mathbf{SE}(3),

(5)

where $R_{P_{i}}^{P_{i+1}}\in\mathbf{SO}(3)$ denotes the rotation component of $T_{P_{i}}^{P_{i+1}}$ and $\mathbf{t}_{P_{i}}^{P_{i+1}}\in\mathbb{R}^{3}$ denotes the translation. Here, we note that scaling the distances between predicted camera poses does not affect the rotation but scales the translation.

The relationship between $T_{E_{i}}^{E_{i+1}}$ , $T_{P_{i}}^{P_{i+1}}(\lambda)$ and the desired $T_{c}^{e}$ follows the matrix equation from classical hand-eye calibration [1]:

T_{E_{i}}^{E_{i+1}}T_{c}^{e}=T_{c}^{e}T_{P_{i}}^{P_{i+1}}(\lambda),

(6)

and we shall solve for the best fit $T_{c}^{e}$ and $\lambda$ . We begin by solving for the rotational term $R_{c}^{e}$ by following [23], and considering the log map of $\mathbf{SO}(3)$ to its lie algebra ( $\mathfrak{so}(3)$ ) where for some $R\in\mathbf{SO}(3)$ ,

	$\displaystyle\omega=$	$\displaystyle\arccos(\frac{\mathrm{Tr}(R)-1}{2}),$		(7)
	$\displaystyle\mathrm{LogMap}(R):=$	$\displaystyle\frac{\omega}{2\sin(\omega)}\begin{bmatrix}R_{3,2}-R_{2,3}\\R_{1,3}-R_{3,1}\\R_{2,1}-R_{1,2}\end{bmatrix}\in\mathfrak{so}(3).$		(11)

Here, the subscripts indicate the elements in $R$ , and $\mathrm{Tr}(\cdot)$ indicates the trace operator. Then, the best fit rotation ${R_{c}^{e}}^{*}$ can be found via:

	$\displaystyle{R_{c}^{e}}^{*}$	$\displaystyle=(M^{\top}M)^{-\frac{1}{2}}M^{\top},$		(12)
	$\displaystyle\text{where }M$	$\displaystyle=\sum_{i=1}^{N-1}\mathrm{LogMap}(R_{E_{i}}^{E_{i+1}})\otimes%\mathrm{LogMap}(R_{P_{i}}^{P_{i+1}}),$

where $\otimes$ denotes the outer product and the matrix inverse square root can be computed efficiently via singular value decomposition.

IV-C Map Construction with Foundation Model Outputs

Next, we seek to build representations of the environment with the output of the foundation model: (1) a set of aligned pointmaps $\{\hat{X}_{1},\ldots,\hat{X}_{N}\}$ with associated confidence maps $\{C_{1},\ldots,C_{N}\}$ . From there, we can set a confidence threshold and filter out points in each $\hat{X}$ to be below the threshold, and obtain a 3D point cloud $\{\mathbf{x}_{i}\}_{i=1}^{N_{pc}}$ , which is in the coordinate frame of some camera pose $P$ , with the end-effector pose $E$ . We transform the point cloud from the coordinate frame of the camera to that of the robot and adjust the scale to match the real world via:

\displaystyle\mathbf{\bar{x}}_{i}=E^{-1}{T_{c}^{e}}^{*}(\lambda^{*}\mathbf{x}_%{i}),

\displaystyle\text{for }i=1,\ldots,N_{pc},

(17)

where $\mathbf{\bar{x}}_{i}\in\mathbb{R}^{3}$ are now in the robot’s frame, and ${T_{c}^{e}}^{*}$ , $\lambda^{*}$ are solutions of Equation13.

Representing Occupancy: The occupancy information, i.e. whether a coordinate is occupied or not, is useful for planning tasks in the environment. Here, we use a small neural network $f_{\theta}$ to learn a continuous and implicit model of occupancy. It assigns each spatial coordinate a probability of being occupied. We take a Noise Contrastive Estimate (NCE) [35] approach and minimize the binary cross-entropy loss (BCELoss) [36], with $\mathbf{\bar{x}}_{i}$ as positive examples and uniformly drawing negative examples $\mathbf{\bar{x}}^{neg}_{i}$ . Similar to NeRF models [18], we apply sinusoidal position embedding $\phi$ on the positions before inputting the encoding to the network. Our loss function is given by:

	$\displaystyle L(\theta)=$	$\displaystyle BCELoss(\{f_{\theta}(\phi(\mathbf{\bar{x}}_{i}))\}_{i=1}^{N_{pc}%},\{f_{\theta}(\phi(\mathbf{\bar{x}}_{i}^{neg}))\}_{i=1}^{N_{pc}}),$
		$\displaystyle\text{where}\quad\mathbf{\bar{x}}_{i}^{neg}\sim U(\mathbf{\bar{x}%}_{min}^{neg},\mathbf{\bar{x}}_{max}^{neg}),$		(18)

where $U(\mathbf{\bar{x}}_{min}^{neg},\mathbf{\bar{x}}_{max}^{neg})$ denotes a uniform distribution between boundaries $\mathbf{\bar{x}}_{min}^{neg}$ , $\mathbf{\bar{x}}_{max}^{neg}$ . We can then train the fully connected neural network, with a Sigmoid output layer, by optimizing $f_{\theta}$ with respect to parameters $\theta$ . We can then query the trained neural network to predict whether a region of space is occupied. We can further build representations that capture properties in the occupied spatial coordinates.

Representing Segmentation: After querying for occupancy in the environment, we may also wish to capture the segmentation of the 3D space into semantically meaningful parts. For example, we may want to be able to differentiate objects on the tabletop in the representation. As pointmaps outputted by the foundation model correspond to pixels in the RGB images, we can run 2D segmentation on the images (for example, using pre-trained models such as Segment Anything [37]). Provided segmentation labels over each pixel in the set of provided images, we can assign a segmentation class to each 3D point. We arrive at $\{\mathbf{\bar{x}}_{i},{y}^{seg}_{i}\}_{i=1}^{N_{pc}}$ , where ${y}^{seg}_{i}$ are segmentation class labels that correspond to each $\mathbf{\bar{x}}$ . Then, we can treat the representation construction as a multi-class classification problem, and apply a positional encoding $\phi$ and optimize the multi-class cross-entropy [36] loss:

\displaystyle L(\theta)=CrossEntropy(\{f_{\theta}(\phi(\mathbf{\bar{x}}_{i}))%\}_{i=1}^{N_{pc}},\{{y}^{seg}_{i}\}_{i=1}^{N_{pc}}).

(19)

Here $f_{\theta}$ will be a fully connected neural network with a Softmax activation output layer.

Representing Continuous Properties: We can learn a neural network model $f_{\theta}$ to assign potentially multi-dimensional continuous properties to spatial coordinates by simply regressing onto labels. We can, for example, assign continuous colour values to points in the scene. As the pointmaps from the foundation model correspond pixel-wise to input images, we can obtain a 3-dimensional RGB colour label for each point, giving us a dataset $\{\mathbf{\bar{x}}_{i},\mathbf{y}^{rgb}_{i}\}_{i=1}^{N_{pc}}$ . Then, we can optimize the MSE loss:

\displaystyle L(\theta)=MSELoss(\{f_{\theta}(\phi(\mathbf{\bar{x}}_{i}))\}_{i=%1}^{N_{pc}},\{\mathbf{y}^{rgb}_{i}\}_{i=1}^{N_{pc}}).

(20)

After training, we can check for occupied regions from the occupancy representation, and then predict the colour assigned of the coordinates via a forward pass of $f_{\theta}$ .

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (6)

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (7)

V Empirical Evaluations

In this section, we evaluate the quality of the proposed Joint calibration and Representation (JCR) method to calibrate the camera with respect to the manipulator end-effector, as well as to build environment representations. We attach an inexpensive USB webcam (with an estimated retail cost of $10$ USD), which captures low-resolution RGB images, onto a Unitree Z1 6 degrees-of-freedom manipulator. We illustrate our robot setup in Figure3. Compared to depth cameras, RGB cameras are smaller in size and lower in cost, making our vision-only setup an attractive option. The foundation model used within the joint calibration and representation framework is DUSt3R [3], using pre-trained weights for image inputs with width $512$ pixels. Here, the questions we seek to answer are:

1.
Can JCR, with foundation models, enable image efficient hand-eye calibration, when the number of images provided is low?
2.
Can we recover the scale accurately by solving the scale recovery problem 13, such that our representation’s sizes match the physical world?
3.
Can high-quality environment representations be built with JCR?

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (8)

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (9)

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (10)

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (21)

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (22)

V-A Hand-Eye Calibration with JFR

Hand-eye calibration requires the determination of relative camera poses. Historically, this has been done via artificial external markers such as checkerboards or Apriltags [2], which are highly feature-rich and easy to identify. In the absence of such markers, Structure-from-Motion (SfM) methods, such as COLMAP [21], are typical alternative approaches to estimate relative camera poses. Here, we compare our calibration results against using COLMAP, instead of a 3D foundation model, to retrieve camera poses, with the rest of the calibration process remaining the same.

We take images in 3 different environments, two of which are table-top scenes on a light-coloured table with 2 sets of different objects with 8 and 7 items respectively, along with a scene on a dark table. We evaluate JCR with an increasing number of input images, then check whether the calibration has converged and the residual values of Equation6 rearranged as

\delta T=T_{E_{i}}^{E_{i+1}}T_{c}^{e}-T_{c}^{e}T_{P_{i}}^{P_{i+1}}(\lambda),

(21)

where lower residual values indicate a higher degree of consistency. We report the $L2$ norm of the translation term residuals $\delta_{\mathbf{t}}$ and the Frobenius norm of the rotation term residuals $\delta_{R}$ .

We compare against running hand-eye calibration on camera poses estimated from COLMAP, across three different scenes. COLMAP is a widely-used SfM software, and similar to DUSt3R, estimates the relative camera poses along with the environment structure. Here, we note that running COLMAP to obtain relative camera poses is the first step in constructing NeRF [18] models, and constructing NeRF models requires successful solutions from COLMAP. We are interested in investigating the behaviour of both methods when the number of images is low: we run the methods on image sets of sizes 10, 12 and 15. The sizes of these sets of images are much lower than the image datasets used to build NeRF models, which often exceed 100 images.

We tabulate the results in TableI. As COLMAP relies on matching consistent hand-crafted features, many camera poses cannot be found when the number of provided images is low, resulting in divergence during calibration. On the other hand, JCR leverages foundation models to predict the correspondence and can consistently estimate relative camera poses. This results in convergent hand-eye calibration as demonstrated by the small residual sizes. As a result, JCR is more image-efficient and allows for hand-eye calibration to be conducted even when the number of images is low.

V-B Scale Recovery with JCR

Unlike traditional hand-eye calibration, JCR requires not only solving for the hand-eye transformation but also recovering a scale factor $\lambda$ to obtain real-world scales. Here, we run JCR on sets of 8 images and 10 images of a tabletop with a roll of tape, a box, a mug, and a toolbox (displayed in Figure3(a)). We measure the heights of the objects and compare them against their respective heights in the reconstruction. The percentage height errors are given in TableI(a), we observe that even with very few images, we can obtain sufficiently small errors. In particular, with just 10 images, the percentage errors in height for every item are at most $3.1\%$ , highlighting the accuracy of the recovered scale.

V-C Constructing Representations with JFR

We construct representations of the three environments to capture occupancy and colour. Neural networks with one hidden layer of size $256$ with $ReLU$ activation functions were used as the continuous representations, where each representation can be trained to convergence within 15 seconds on a standard laptop with an NVIDIA RTX 4090 GPU. We sample points at locations with high occupancy and visualize their predicted colours. We observe that JCR can construct accurate and dense representations from small sets of RGB images, without depth information. Example images taken by the manipulator-mounted camera and visualizations of our constructed representations are provided in Figure5. Additionally, we provide segmentation labels for the 2D images of each of the environments and visualize the reconstructed segmented 3D representations in Figure6.

Unifying Scene Representation and Hand-Eye Calibration with 3D Foundation Models (24)

As our JCR method leverages DUSt3R, which has been trained to find correspondences by predicting pointmaps, we can extract much denser representations than traditional SfM methods, which rely on visual feature-based pixel matching. In Figure7, we overlay the point clouds produced by COLMAP, followed by its built-in densification, onto points produced by the foundation model. Previous methods to create structures from multi-view images generally rely on correspondences between visual features. We observe that COLMAP cannot produce dense point clouds over smooth surfaces such as the tabletop, as the tabletop generally lacks clear features. Instead, COLMAP can primarily identify regions that correspond to highly identifiable edges with sharp contrast, such as the text on the open book. The dense outputs of the foundation model enable us to calibrate the camera and map the environment jointly.

VI Conclusions and Future Work

The last few years have seen the rapid boom of using large pre-trained models, or foundation models, to facilitate a range of downstream tasks. In this paper, we advocate for the usage of foundation models to construct environment representations from a small set of images taken by a manipulator-mounted RGB camera. In particular, we propose the Joint Calibration and Representation (JCR) method which leverages foundation models to jointly calibrate the RGB camera with respect to the robot’s end-effector and construct a map. JCR enables the accurate construction of 3D representations of the environment from RBG images, in the coordinate frame of the robot without tedious a priori calibration of the camera against external markers. We demonstrate JCR’s ability to calibrate and represent the environment in an image-efficient manner, in several real-world environments. Future avenues of research include adapting JCR to dynamic environments and incorporating uncertainty information from the calibration into the constructed representations.

References

[1]R.Y. Tsai and R.K. Lenz, “A new technique for fully autonomous and efficient 3d robotics hand/eye calibration,” IEEE Trans. Robotics Autom., 1988.
[2]E.Olson, “Apriltag: A robust and flexible visual fiducial system,” in IEEE International Conference on Robotics and Automation, 2011.
[3]S.Wang, V.Leroy, Y.Cabon, B.Chidlovskii, and J.Revaud, “Dust3r: Geometric 3d vision made easy,” CoRR, 2023.
[4]R.Bommasani and etal., “On the opportunities and risks of foundation models,” CoRR, 2021.
[5]A.Elfes, “Sonar-based real-world mapping and navigation,” IEEE Journal on Robotics and Automation, 1987.
[6]R.Malladi, J.A. Sethian, and B.C. Vemuri, “Shape modeling with front propagation: a level set approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995.
[7]R.A. Newcombe, S.Izadi, O.Hilliges, D.Molyneaux, D.Kim, A.J. Davison, P.Kohi, J.Shotton, S.Hodges, and A.Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in 2011 10th IEEE International Symposium on Mixed and Augmented Reality, 2011.
[8]S.O’Callaghan, F.T. Ramos, and H.Durrant-Whyte, “Contextual occupancy maps using gaussian processes,” in 2009 IEEE International Conference on Robotics and Automation, 2009.
[9]F.Ramos and L.Ott, “Hilbert maps: Scalable continuous occupancy mapping with stochastic gradient descent,” International Journal Robotics Research, 2016.
[10]W.Zhi, R.Senanayake, L.Ott, and F.Ramos, “Spatiotemporal learning of directional uncertainty in urban environments with kernel recurrent mixture density networks,” IEEE Robotics and Automation Letters, 2019.
[11]W.Zhi, L.Ott, R.Senanayake, and F.Ramos, “Continuous occupancy map fusion with fast bayesian hilbert maps,” in International Conference on Robotics and Automation (ICRA), 2019.
[12]R.Senanayake and F.Ramos, “Bayesian hilbert maps for dynamic continuous occupancy mapping,” in Conference on Robot Learning (CoRL), 2017.
[13]H.Wright, W.Zhi, M.Johnson-Roberson, and T.Hermans, “V-prism: Probabilistic mapping of unknown tabletop scenes,” arXiv, 2024.
[14]J.J. Park, P.Florence, J.Straub, R.Newcombe, and S.Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[15]R.Q. Charles, H.Su, M.Kaichun, and L.J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[16]C.R. Qi, L.Yi, H.Su, and L.J. Guibas, “Pointnet++: deep hierarchical feature learning on point sets in a metric space,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
[17]W.Zhi, I.Akinola, K.van Wyk, N.Ratliff, and F.Ramos, “Global and reactive motion generation with geometric fabric command sequences,” in IEEE International Conference on Robotics and Automation, ICRA, 2023.
[18]B.Mildenhall, P.P. Srinivasan, M.Tancik, J.T. Barron, R.Ramamoorthi, and R.Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020.
[19]T.Müller, A.Evans, C.Schied, and A.Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph., 2022.
[20]T.Zhang, K.Huang, W.Zhi, and M.Johnson-Roberson, “Darkgs: Learning neural illumination and 3d gaussians relighting for robotic exploration in the dark,” CoRR, 2024.
[21]J.L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[22]R.Horaud and F.Dornaika, “Hand-eye calibration,” I. J. Robotic Res., 1995.
[23]F.Park and B.Martin, “Robot Sensor Calibration: Solving AX = XB on the Euclidean Group,” IEEE Transactions on Robotics and Automation, 1994.
[24]E.Valassakis, K.Dreczkowski, and E.Johns, “Learning eye-in-hand camera calibration from a single image,” in Proceedings of the 5th Conference on Robot Learning, 2022.
[25]L.Chen, K.Lu, A.Rajeswaran, K.Lee, A.Grover, M.Laskin, P.Abbeel, A.Srinivas, and I.Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” in Advances in Neural Information Processing Systems, 2021.
[26]P.Florence, C.Lynch, A.Zeng, O.Ramirez, A.Wahid, L.Downs, A.Wong, J.Lee, I.Mordatch, and J.Tompson, “Implicit behavioral cloning,” in Conference on Robot Learning (CoRL), 2021.
[27]S.M. LaValle, “Rapidly-exploring random trees: A new tool for path planning,”
[28]T.Lai, W.Zhi, T.Hermans, and F.Ramos, “Parallelised diffeomorphic sampling-based motion planning,” in Conference on Robot Learning (CoRL), 2021.
[29]W.Zhi, T.Zhang, and M.Johnson-Roberson, “Instructing robots by sketching: Learning from demonstration via probabilistic diagrammatic teaching,” in IEEE International Conference on Robotics and Automation, 2024.
[30]H.Touvron, T.Lavril, G.Izacard, X.Martinet, M.-A. Lachaux, T.Lacroix, B.Rozière, N.Goyal, E.Hambro, F.Azhar, A.Rodriguez, A.Joulin, E.Grave, and G.Lample, “Llama: Open and efficient foundation language models,” CoRR, 2023.
[31]A.Radford, J.W. Kim, C.Hallacy, A.Ramesh, G.Goh, S.Agarwal, G.Sastry, A.Askell, P.Mishkin, J.Clark, G.Krueger, and I.Sutskever, “Learning transferable visual models from natural language supervision,” CoRR, 2021.
[32]L.Yang, B.Kang, Z.Huang, X.Xu, J.Feng, and H.Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in CVPR, 2024.
[33]R.Firoozi, J.Tucker, S.Tian, A.Majumdar, J.Sun, W.Liu, Y.Zhu, S.Song, A.Kapoor, K.Hausman, B.Ichter, D.Driess, J.Wu, C.Lu, and M.Schwager, “Foundation models in robotics: Applications, challenges, and the future,” CoRR, 2023.
[34]Y.Furukawa and C.Hernández, “Multi-view stereo: A tutorial,” 2015.
[35]M.Gutmann and A.Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Conference on Artificial Intelligence and Statistics, 2010.
[36]C.M. Bishop, Pattern recognition and machine learning, 5th Edition.Information science and statistics, Springer, 2007.
[37]A.Kirillov, E.Mintun, N.Ravi, H.Mao, C.Rolland, L.Gustafson, T.Xiao, S.Whitehead, A.C. Berg, W.-Y. Lo, P.Dollár, and R.Girshick, “Segment anything,” arXiv, 2023.

	$\displaystyle\text{SRP:}\quad\arg\min_{\mathbf{t}_{c}^{e},\lambda}$	$\displaystyle\sum^{N-1}_{i=1}\lvert\lvert C_{i}\mathbf{t}_{c}^{e}-\mathbf{d}_{%i}(\lambda)\lvert\lvert_{2}^{2},$		(13)
	$\displaystyle\text{where }C_{i}=I-R_{E_{i}}^{E_{i+1}}$	$\displaystyle,\quad\mathbf{d}_{i}(\lambda)=\mathbf{t}_{E_{i}}^{E_{i+1}}-{R_{c}%^{e}}^{*}(\lambda\mathbf{t}_{c}^{e}).$		(14)