Apollo Synthetic is a photo-realistic synthetic dataset for autonomous driving. It contains 273k distinct (not continuous frames from a video) from various virtual scenes of high visual fidelity, including highway, urban, residential, downtown, indoor parking garage environments. These virtual worlds were created using Unity 3D engine. The biggest advantage of a synthetic dataset is precise ground truth data it provides. Another benefit is more environmental variations (which are relatively harder & costlier to achieve in the real world), such as different times of day, different weather conditions, different traffic / obstacles, and varied road surface qualities. Our dataset provides extensive set of ground truth data: 2D/3D object data, semantic/instance-level segmentation, depth, and 3D lane line data.
|Dataset||Year||Synthetic?||#labeled frames||Resolution||Diversity||Ground truth||Supported sensors|
|VKITTI||2016||Yes||21k||1242x375||5 urban scenes under different imaging and weather conditions||2D/3D box, semantic/instance-level segmentation, optical flow, depth||Camera|
|Synthia||2016||Yes||213k||1280x760||Urban / highway / green area scenes under different times of day / weather conditions / seasons||Semantic segmentation, depth||Camera|
|FCAV||2017||Yes||200k||1914x1052||Diverse scenes from GTA 5 under different times of day / weather conditions||2D box, segmentation||Camera|
|Playing for Benchmarks||2017||Yes||250k||1920x1080||Diverse scenes from GTA 5 under different times of day / weather conditions||2D/3D box, semantic/instance-level segmentation, optical flow||Camera|
|ApolloScape||2018||No||144k||3384x2710||4 regions in China under different times of day / weather conditions||Semantic/instance-level segmentation, depth, 3D semantic point cloud||Camera, Lidar|
|BDD100k||2018||No||100k||1280x720||4 regions in US under different times of day / weather conditions||2D box, semantic/instance-level segmentation||Camera|
|nuScenes||2019||No||40k||1600x900||Boston / Singapore under different times of day / weather conditions||3D box with semantic||Camera, Lidar, Radar|
|Apollo Synthetic||2019||Yes||273k||1920x1080||Highway / urban / residential scenes under different times of day / weather conditions / road qualities and an indoor parking garage scene||2D/3D box, semantic/instance-level segmentation, depth, 3D lane line||Camera|
Jpg in HD resolution (1920x1080)
HD resolution png with a color encoding text file
A png files contain semantic and instance-level segmentation per pixel.
An encoding text file (per variation) contains one line per category formatted like ‘<category>[:<tid>] <R> <G> <B>’, where ‘<category>’ is the name of the semantic category of that pixel, ‘<tid>’ is the (optional) integer track identifier to differentiate between instances of the same category (only vehicles and pedestrians have the instance distinction), and ‘<R> <G> <B>’ is the color encoding of that label in the corresponding ground truth images.
Supported semantic categories (bold ones have instance distinction)
HD resolution png whose R / G channels contain 16-bit depth info.
Depth values are distances to the camera plane obtained from the z-buffer (https://en.wikipedia.org/wiki/Z-buffering). They correspond to the z coordinate of each pixel in camera coordinate space (not the distance to the camera optical center). We assume a fixed far plane of 655.35 meters, i.e. points at infinity like sky pixels are clipped to a depth of 655.35m. This allows us to truncate and normalize the z values to the [0;2^16 – 1] integer range such that we can have 1cm precision. This 16-bit values is encoded into Red / Green channel of a png file. You can decode the depth by (R + G / 255.0) * 65536.0 where R / G are a normalized float value ([0.0;1.0]) of the pixel's red / green channel.
Red channel only:
Green channel only:
The images are rendered using the following camera intrinsics:
• Resolution: 1920x1080 • Vertical FOV: 30 ° • K = [[2015.0, 0, 960.0], [ 0, 2015.0, 540.0], [ 0, 0, 1]]
A right-handed coordinate system is used. In our system of 3D camera coordinates, x is going to the right, y is going down, and z is going forward (the origin is the optical center of the camera). Y is up in world space.
Our object ground truth is similar to KITTI one, but there are some differences, too, that you need to be aware of. Space-separated fields in each row has following meanings:
1 frame id 1 object id 1 object category (sedan, pedestrian, cyclist, ...) ====== KITTI compatible part start 1 KITTI-like truncation flag (0.0 ~ 1.0) => from non-truncated to truncated 1 KITTI-like occlusion flag (0 = fully visible, 1 = partly occluded, 2 = largely occluded, 3 = unknown) 1 KITTI-like observation angle in radian 4 2D bounding box [left, top, right, bottom] 3 dimension [height, width, length] 3 object center in camera coordinates [X, Y, Z] 1 Y rotation in camera space (in radian) ====== KITTI compatible part end 1 occlusion percentage (0.0 ~ 1.0) => fully visible to fully blocked 2 horizontal/verticaltruncation [truncation_x, truncation_y] 3 object's rotation in camera space in Euler angles [X, Y, Z] (in radian) 3 velocity in world coordinates [X, Y, Z] 3 object center in world coordinates [X, Y, Z] 3 object's rotation in world space in Euler angles [X, Y, Z] (in radian)
(The order of the Euler angle is Z-X-Y and uses ‘extrinsic rotation’.)
Visible portion of each lane line is sampled regularly along its 3D length (every 1m) and outputted as a sequence of points. The lines represent inner boundaries of the lane markings in the perspective of the ego. For each line, its samples are listed per row in the direction of the ego's progression. Each row representing a single point has following space-separated fields:
1 global id 1 lane marker type (SingleSolid, SingleDash, DoubleSolid, DoubleDash, LeftDashRightSolid, LeftSolidRightDash, Curb, Imaginary, Other) 2 normalized pixel position of this lane point sample (the origin at top-left) 1 ego-centric lane index (-4 ~ 4, e.g. -1 means the left boundary of the ego lane and 1 means the right boundary of the ego lane.) 1 lane topology type (ForkLaneLeft, ForkLaneRight, MergeLaneLeft, MergeLaneRight, ParkingLane, CenterLane) 1 lane marker color (White, Yellow) 3 3D position of this sample in camera coordinate
Each row consists of the frame index followed by row-wise flattened 4×4 extrinsic matrix at that frame (again space-separated):
r1,1 r1,2 r1,3 t1 M = r2,1 r2,2 r2,3 t2 r3,1 r3,2 r3,3 t3 0 0 0 1
where ri,j are the coefficients of the camera rotation matrix R and ti are the coefficients of the camera translation vector T.
For full data set, please click the "Download" button and fill out the form. A direct download link will be sent to you via email you provide in the form.