Competition: Pose Bowl: Pose Estimation Track

Navigation

Quick access

In this challenge, you will help to develop new methods for conducting spacecraft inspections by identifying sequential changes in the position and orientation of a chaser spacecraft camera across a series of images.

The challenge consists of a spacecraft detection track and a pose estimation track. In the spacecraft detection track your task will be to draw bounding boxes around the spacecraft in an image. In the pose estimation track (that's this one!), you will identify changes in the position and orientation of the chaser spacecraft's camera across a sequence of images.

Background

Spacecraft inspection in the context of this challenge is the process of closely examining a spacecraft in orbit to assess its condition and functionality. Currently, the most common methods for spacecraft inspection may rely on astronauts to conduct inspections, which can be time-consuming and dangerous, or use expensive and heavy equipment such as LIDAR sensors and robotic arms. Instead, the methods being explored in this challenge involve a small inspector vehicle (or "chaser") that is deployed from a host spacecraft in order to inspect the same host spacecraft. Compared with existing methods, the inspector vehicle is relatively inexpensive, making use of affordable, lightweight cameras and off-the-shelf hardware.

This challenge is specifically geared towards addressing a couple of the main operational hurdles in spacecraft inspection. Firstly, the images in the challenge dataset require that successful solutions work on a variety of different, unknown, and potentially damaged spacecraft types. Secondly, solutions must run in our code execution platform in an environment that simulates the small, off-the-shelf computer board on a NASA R5 spacecraft that demonstrates inspection technologies.

Data

Images

The data for this challenge consists of simulated images created using the open-source 3D software Blender using models of representative host spacecraft. The images are of a target spacecraft taken from a nearby location in space, as if by a chaser spacecraft that is moving around the target in a random walk. In some image chains, the light source is fixed relative to target spacecraft, while in others, the light source moves with the chaser.

Distortions were applied to some images in post-processing to realistically simulate image imperfections that result from camera defects or field conditions. The distortions applied to the images include blur (e.g., as the chaser vehicle camera would be moving), hot pixels (a common defect in which some pixels have an intensity value of 0 or 255), and random noise (a typical distortion method to support generalizability and robustness).

For the pose estimation track you will be working with sequences of images ("chains") showing a spacecraft moving and rotating in space. Each image chain consists of up to 180 individual images. In the images, the chaser and target will be a maximum of 600 meters apart and a minimum of 20 meters apart. Between any two sequential images, the maximum rotation of the chaser camera is 30 degrees and the maximum movement in or out is 1 meter. You are encouraged to use these constraints in your solutions.

The dynamics of the spacecraft movement are generated from a non-Gaussian biased random walk. This is designed so that explicit Gaussian-based dynamics estimation will be unproductive, and you are discouraged from modeling the dynamics explicitly.

Below is an example of a typical chain.

Labels

Your task on this track will be to output the transformation vector representing the relative pose of the chaser spacecraft camera for each of the images in each chain. The first image in each chain is the reference image, representing the initial position and orientation of the chaser spacecraft. For each subsequent image, you are to calculate the transformation required to get back to that reference image. That is, for each image supply the transformation - the rotation first, then the translation - we would need to apply to the chaser to get the target spacecraft (treated as stationary) to appear exactly as it does in the reference image.

The transformation vector you will be providing consists of seven values (x, y, z, qw, qx, qy, qz), corresponding to the following components of pose:

x, y, z are the translational components that describe changes in the position of the chaser spacecraft in three dimensions (i.e., forward-backward, left-right, up-down). These values represent how far and in which direction we would need to move the chaser spacecraft to get it back to its base location as given by the reference frame (the first image in the chain).
qw, qx, qy, qz are the rotational components that describe changes in the orientation of the chaser spacecraft in three-dimensional space (e.g., roll, pitch, yaw). These values represent how we would need to rotate the chaser spacecraft to return it to its base orientation as given by the reference frame (the first image in the chain). There are different ways to describe rotation. Here, what we are asking for is called the right-handed passive quaternion rotation with Hamiltonian composition.

Don't worry if you're new to this type of problem! We have additional background information and guidance in the further reading section to help get you started.

Ground truth

The ground truth data (transformation vectors) for the public set is available in the train_labels.csv on the Data Download page. This CSV file contains a row for each image in each chain of the public set, and the corresponding transformation vector describing how the spacecraft has moved and rotated from the first image in the chain, which we call the "reference image". Here's what the first few rows of the ground truth look like:

chain_id,i,x,y,z,qw,qx,qy,qz
0009b5e13d,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
0009b5e13d,1,1.48327637,-22.29007339,10.90795231,0.99660528,-0.07391381,0.00920438,0.03506964
0009b5e13d,2,3.79498291,-48.67992783,0.83620596,0.9859612,-0.15591848,-0.01345022,0.0582153
0009b5e13d,3,19.06918335,-35.53461456,-110.05810547,0.97497016,-0.16594024,-0.14679143,0.01868875
0009b5e13d,4,41.83984375,-81.447258,-151.54153442,0.91200227,-0.34401971,-0.22160748,0.02815061
...

The chain_id will be the same for all images in a single chain, and the i column indicates each image's position in the chain. So the first image in a chain will have i of 0, the second image will have i of 1, etc.

Note that the reference image in a chain always has transformation values of (0.0, 0.0, 0.0) for (x, y, z) and (1.0, 0.0, 0.0, 0.0) for (qw, qx, qy, qz) because it serves as the reference point or baseline for all subsequent images in the sequence. In comparing any subsequent image in a chain to the reference image, the reference image's chaser position (the camera position) can be assumed to be at the origin, with position values of (x, y, z)=(0, 0, 0), and lacking rotation, with quaternion values of (qw, qx, qy, qz)=(1, 0, 0, 0). You may need to transform your solution outputs to align with this format.

Camera intrinsic parameters

Camera intrinsic parameters are the internal characteristics of a camera which define how the camera projects a 3D scene onto a 2D image. These can help to correct for distortions, and infer the scale and depth of objects. For this challenge's dataset, the intrinsic matrix is:

[[5.2125371e+03 0.0000000e+00 6.4000000e+02]
[0.0000000e+00 6.2550444e+03 5.1200000e+02]
[0.0000000e+00 0.0000000e+00 1.0000000e+00]]

Laser range finder data

For the pose estimation track, a supplementary "laser range finder" dataset is also provided on the Data Download page. This dataset shows the distance in meters from the chaser vehicle to the host spacecraft, simulating the reading one would get from a laser range finder situated on the chaser vehicle and pointed at the center of the host spacecraft.

Range data is provided for every individual image in the dataset, but in keeping with real world conditions, these measurements have some noise and there may be missing values.

You are not required to use the laser range finder data in your modeling, but it will be available and may prove helpful.

Public vs private datasets

Data generated for this challenge have been divided into different datasets, some of which you will have direct access to for the purpose of training your solution.

Public training data consisting of images and ground truth labels will be made available for all solvers to download. Training data labels and instructions for downloading the training data images are available on the data download page.

Private test data for evaluating your solution's performance will be mounted to the code execution runtime. It will not be directly available to challenge solvers. Performance will be evaluated in two ways, using different splits or subsets of this private data. One subset of the private data will be used to calculate your public leaderboard score, and another will be used to calculate a private leaderboard score. The private leaderboard score will determine final prize rankings.

The public training data and private data used for evaluation are identical in format and were all drawn from the same data generation and post-processing pipeline. However, their contents are slightly different: The training data and private datasets each contain some unique spacecraft models. Many spacecraft models will be represented in all datasets, some spacecraft models will only be present in the training data, others will only be present in the test set used to calculate public leaderboard scores, and still others will only be present in the test set used to calculate private leaderboard scores. Importantly, the two private test datasets contain similar numbers of unique spacecraft models and there is nothing notable about the unqiue models that appear in the test sets but not the training set. The two test datasets were designed to resemble the training dataset equally.

Note: As a result of the above process for determining your public and private leaderboard scores, there is no benefit to "overfitting" the public leaderboard. Using sound ML practices to produce robust solutions that generalize well to new feature conditions is the best way to ensure that your private score is comparable to your public score.

Accessing the data

Check out the Data Download page to download training labels, a submission format, and instructions on downloading the training data on AWS S3. The training dataset is quite large and is made available in smaller (several GB) chunks, each containing a random subset of images. You may want to initially develop your solution with a subset of data, and then incorporate more training data after establishing your approach.

Performance metric

The performance metric for this challenge is a pose error score that is the sum of the normalized translational and rotational components of the absolute pose error of your pose estimates:

$$ \begin{align} E &= E_{\text{trans}} + E_{\text{rot}} \end{align} $$

The translational error for one image is the magnitude of the difference between your estimated translation and the ground truth translation, normalized by the magnitude of the ground truth translation:

$$ E_{\text{trans}} = \frac{\lVert P_{\text{est}} \ominus P_{\text{gt}} \rVert}{\lVert P_{\text{gt}} \rVert} = \frac{\lVert \vec{t}_{\text{est}} - \vec{t}_{\text{gt}} \rVert}{\lVert \vec{t}_{\text{gt}} \rVert} $$

where $P$ represents the pose, $\ominus$ is the inverse compositional operator, the $\text{est}$ subscript indicates your estimate, the $\text{gt}$ indicates the ground truth, and $\lVert \cdot \rVert$ indicates magnitude. The final expression in the equation is an equivalent representation in terms of position vectors: $\vec{t}$ is the vector representation of the translation component of the pose.

The rotational error for one image is the angle difference between your estimated rotation and the ground truth rotation, normalized by the angle of the ground truth rotation:

$$ E_{\text{rot}} = \frac{ \angle \left [ P_{\text{est}} \ominus P_{\text{gt}} \right ]}{\angle \left [ P_{\text{gt}} \right ]} = \frac{ \angle \left [ \mathbf{R}_{\text{gt}}^{-1} \mathbf{R}_{\text{est}} \right ]}{\angle \left [ \mathbf{R}_{\text{gt}} \right ]} = \frac{ \angle \left [ \mathbf{q}_{\text{gt}}^{*} \mathbf{q}_{\text{est}} \right ]}{\angle \left [ \mathbf{q}_{\text{gt}} \right ]} $$

where $P$, $\ominus$, the $\text{est}$ subscript, and the $\text{gt}$ subscript mean the same as above, and $\angle[\cdot]$ is the angle. The final two expressions are equivalent representations in terms of rotation matrices $\mathbf{R}$ or quaternions $\mathbf{q}$.

Your score will be a micro-average over each image's pose error score for all non-reference images in the test set.

Submission format

This is a code execution challenge, which means that you'll be submitting a script that will run inference on an unseen test set in our code execution platform. This setup allows us to simulate the computational constraints a cubesat would actually have in orbit.

If you've never participated in a code execution challenge before, don't be intimidated! We make it as easy as possible for you.

See the code submission format page for details on how to make your submission.

Runtime constraints

Our sponsors at NASA hope to adapt the winning solutions for use on future flights, which means your code could one day be launched into orbit. With that goal in mind, the code execution platform is configured to simulate the constraints of the ultimate production environment... SPACE! 🚀

This means that the resources available to you in running your solution will be comparable to the off-the-shelf hardware available to an actual R5 spacecraft, which may be somewhat less powerful than you're used to, especially if you typically use a GPU for these types of tasks.

Your submissions will be run on an A4 v2 virtual machine with an Intel processor and the following constraints:

No GPU
Limited to 3 cores
Limited to 4 GB of RAM
Your submission must complete execution in 2.5 hours or less.

Additional rules

In order to ensure that solutions are useful for our competition sponsors, the following rules are also in place:

Your solution must make pose estimates based on the visual content of the images (i.e., the pixel data). You must not use any other image data or metadata to make predictions other than that provided.
Images must be processed one at a time. Parallelization of multiple images is not permitted.
For the pose estimation track, your relative pose prediction for each image in the chain may only use images up to and preceding the given image. Your model is never permitted to "look ahead" in the image chain.
Your solution must complete execution within a time limit. For the pose estimation track, your time is limited to 150 mins (2.5 hours) which allows for about 5 seconds per test image, plus a buffer for overhead and image loading.

All submissions must complete execution within the maximum allowed time, but the most useful solutions will run even faster. For this reason, top performing solutions are eligible for a speed bonus prize.

Note: Finalist solutions that don't comply with these rules may be disqualified. Please ask questions on the discussion forum if anything is unclear.

Pose estimation

Pose estimation is the process of determining the position and orientation of an object in the world using visual information. In the technical literature, you may see references to "camera pose estimation" or "object pose estimation", which are distinct but related concepts. In camera pose estimation, also known as visual localization, the task is to identify the position and orientation of a camera, for example one that may be moving through a 3D environment. In object pose estimation, the task is to identify the position and orientation of a target object, typically within a camera's field of view.

This challenge relates primarily to camera pose estimation, where in this case you are estimating the position and orientation of the chaser spacecraft bearing the camera relative to its initial state given by the reference image.

A variety of techniques may be useful in developing your approach, much of it informed by commercial work in robotics, self-driving cars and virtual reality, all of which depend on pose estimation. Here are some of the key terms and techniques you may come across when reviewing the literature.

SLAM or simultaneous localization and mapping is a technique to build a map of an unknown environment while simultaneously keeping track of a vehicle's location within that environment. The pose estimation task in this challenge is typically characterized as a SLAM problem. However, while many solutions in this space rely on LiDAR and other more expensive sensors, the goal of this challenge is to solve the problem using input from a single camera and a laser range finder.
Structure or feature-based pose estimation: This family of techniques involves detecting and tracking specific features or structures on the target object. For example, these could be distinct markers or shapes on the host spacecraft that the chaser's camera identifies. By tracking how these features move and change in the camera's field of view, the system can infer the relative pose of the target object. Often these techniques are geared towards a specific spacecraft with known geometries. Note that a key goal of this challenge is to develop solutions that can handle a variety of target spacecrafts where the geometry is not known beforehand.
Regression-based pose estimation: This family of techniques uses deep learning models to directly predict pose from an input image. Neural networks are trained on large datasets of images and corresponding pose data. Once trained, these models can infer the pose of a target object in new images. This approach requires less manual feature engineering but can often handle a wide range of poses and environmental conditions.

Rotations

As noted above, in this challenge we ask for the right-handed passive quaternion rotation with Hamiltonian composition. Here, we provide more explanation of that term, and further references for understanding rotations.

Intuitive definition

An intuitive way to think about the described rotation in this challenge is that we are asking you to provide the rotation that will get the current orientation of the chaser spacecraft back to the base orientation given by the reference image. In Blender, if the rotation is applied to the current orientation of the chaser spacecraft, and then the translation to get it back to the original location is also applied, you should end up with a rendering of the base image.

The 2D representation below illustrates both a change in relative pose and an application of the kind of transformation we ask for in this challenge. Box 1 is the reference frame that establishes the base orientation. The rectangle is like the target spacecraft and the triangle is like the chaser spacecraft. Box 2 shows a change in both the position and orientation of the triangle. Box 3 shows the application of a rotation to the triangle, the result of which is shown in the changed orientation of the triangle in Box 4. Box 4 shows the application of a translation to the triangle, the result of which would be to return the triangle to its original state (shown as an outline).

If the triangle and rectangle in the diagram contributed data to this challenge, the reference image would be a photo of the rectangle (target spacecraft) taken by a camera on the triangle (chaser spacecraft) in Box 1. A photo of the rectangle from the triangle's camera in Box 2 would be used to compute relative pose and estimate a transformation vector with respect to the reference image.

Quaternions

There are different ways to represent rotations. In this challenge, we ask you to provide rotations as quaternions. Quaternions are a 4D representation of rotation, used as a replacement to Euler angles in order to avoid gimbal lock and ensure well-defined rotations. See Zanetti 2019 for more on the use of quaternions.

Quaternion handedness

In this competition, we ask you to provide right-handed quaternions. The "handedness" of a quaternion is an artifact that results from the direction a quaternion is applied when rotating. The choice of handedness is arbitrary, but defining it explicitly ensures that the format of submitted quaternions matches the ground truth labels. Left-handed quaternions will just have the qw component of the quaternion negated. See the section titled "Shuster's Convention" in Zanetti 2019 for more on quaternion handedness.

Active vs. Passive rotation

In this competition, we ask you to provide passive rotation. "Active" vs. "passive" rotation refers to the composition rule used when applying a rotation. It mostly applies to compositions of rotations, and describes whether the observer stays fixed in space when the body that is being transformed rotates (active) or the observer experiences the rotation of the spacecraft, and so sees the reference frame rotating rather than the spacecraft (passive). In the context of a single rotation (as with this competition), active and passive rotations are inversely related. See Zanetti 2019 for more on active vs passive rotation.

Good luck! If you have any questions or issues you can always head over to the user forum!

Pose Bowl: Pose Estimation Track

Quick Facts

Participants

No. of Entries

Prize

Winner

dylanliu