Tesla’s Lean and Mean Full-Self-Driving software

By Kerry Dennis Clancy    updated 9/21/2023

A Re-imagined Driving System

The Tesla developers have reimagined full self-driving using a straightforward hybrid approach. The upgraded system applies AI data models trained on real-world driving video. End-to-end neural nets use data models instead of C++ code to manage self-driving.

On top of a superb vision system lies radar to bounce signals ahead to see through the atmospheric conditions and vehicles ahead blocking your view. The combination of these technologies provides superhuman driving perception.

The training of the driving data model is constrained by the amount of compute needed.

A supercomputer called Dojo is being upsized to increase the training compute capacity. Dojo uses proprietary hardware and software that is in addition to the Nvidia based supercomputer already at work on the V12 FSD efforts.

Dojo is absorbing millions of hours of driving video to learn how to handle various road situations known as edge cases. The videos are selected based on good driving. Tesla is training the car’s software how to drive when it encounters odd conditions that can occur anywhere- not just on highways. The data captures what can happen to a driver on various types of roads.

The updated vision system incorporates various tech in its stack. The stack of software behaves like tasks working in parallel and integrated into the driving response system. Some of the tasks are Occupancy Networks, Diffusion, Road Lane Modeling, three-dimensional look-ahead, and learned object recognition tasks within the framework of their self-driving system. The various computer tasks are tied end-to-end like pieces of a puzzle being put together. However, only about 3,000 lines of code are needed to stitch together driving response signals.

The accomplishment of curing latency problems for immediate response cannot be overstated. The software must be lean enough to compute the condition/response in a matter of milliseconds. Since self-driving does not have the luxury of figuring out what is happening at a moment in time, it has to use artificial intelligence to predict what is happening in the upcoming moments. Imagine something veering into your lane. You can easily predict it will probably collide. The self-driving software must see this also.

Competitive Advantage

Trying to keep up with the innovations is a problem for competitors.

  • iPhone-like updates: Tesla provides software updates over the air.
  • Energized workforce: The engineering relentless pursuit of innovation driven by an exciting and compelling mission. Attracting the best and brightest who are young enough to handle the grind
  • Cost of admission: Having the cash for research and development expenditures like the Dojo supercomputer- that’s $Billions with a “B”
  • Loads of Data: Billions of miles of curated driving data to feed the AI learning model.

About “Occupancy Networks”

 Occupancy networks are about the network of things moving in the field of vision. Occupancy networks follow objects frame-by-frame in the video stream.
Two dimensions provide surface area whereas three dimensions provide volume. The FSD team smartly encoded (tokens in ai speak) the geometric features in order to quickly synthesize the area. Then they neural nets that can do the encoding artificially. Using this abbreviated technique, the software can manage all eight cameras to synthesize a 360-degree view.
A frame-by-frame analysis differentiates stationary things from moveable objects. Moveable objects are boxed out and tracked. A predictive task projects or anticipates the movement of these objects relative to their movement with respect to the movement of stationary background as you proceed on your path.

About “Generative Modeling of Lanes”

A predictive model was developed to draw the lane lines. This is a separate task of the self-driving software. The requirement for this special task is that lanes may not be visible to the eye or the camera. In this case, they need to be figured out. It’s like- road geometry in, driving lanes out. The “generative” semantics has to do with using neuro nets to learn how to draw the lines.

About “Diffusion”

Diffusion is about bringing vision into focus. One task is to clean up the video when something is blurring the sight. The algorithm is trained by clouding up a picture and then unclouding it back to its original state. This is a training job for the artificial intelligence team.

3D Road Mapping

Three-dimensional graphics have been employed in engineering pursuits for decades. The Tesla team has developed a 3D look-ahead to simulate a video stream. The software can reconstruct the sights in advance.

Any video can be used using a frame-by-frame comparison to understand the dimensional perception of the driving experience. For example, Youtube videos can be input into the learning model and then the 3D video can be reconstructed from memory. However, the task is to synthesize accurately the road vision ahead.

If you know the background of an area, then it’s easier to spot moving objects within. The software has learned about spatiality, that is to say, things moving around relative to the coordinate grid and the permanent structures within the grid.

Once objects have been delineated from the background, the software needs to understand the object’s properties and behaviors, that is, it needs to know what these objects are and what to expect from them. More simply, vehicles, pedestrians, or whatever are objects that are either a problem to be avoided or otherwise just things moving around in the background. The artificial intelligence piece of the equation is in understanding the difference.

First, object edge boundaries are boxed out and identified as to the type of object. Understanding what they are, like another vehicle, for instance, the software can predict where they are going. The frame-by-frame comparison derives the location and velocity of the driver and of the objects within the visual windows.

Object Recognition

Things moving within the grid have behaviors that FSD can quickly pick up on. Their direction and velocity are calculated.  The FSD training categorizes stuff so that they are predictable. Traffic signs and signals are learned and recognized in order to operate the FSD vehicle correctly.

Types of objects are encoded in the AI learning model. In biological terms, we’d call it imprinting. Like the baby duck imprinting on its mother. They see the momma duck and know it’s momma because of her physical characteristics.

About Deconvolution

FSD software uses deconvolution algorithms to transform pixels into physical objects.

One example of deconvolution is seismic deconvolution for oil exploration. Explosions send sound vibrations into the ground. The sound energy bounces off layers of different materials at different speeds. The energy reverberates, that is, echoes, back to the recording machines. The sound vibrations are convoluted. The deconvolution process transforms sound waves into a slice of the earth’s physical subsurface.

Deconvoluting a pixelated image implies determining the outline of a physical object. It identifies the oject as a thing to be boxed out for analysis. Since a frame-by-frame comparison provides a lot of behavior patterns, it is not a huge leap to fill in the blank to what the object is. So the expression “If it looks like a duck, waddles like a duck, and quacks like a duck- it’s a duck” applies.

Object Tracking

Tracking oncoming traffic objects is somewhat straightforward. More difficult calculations are needed when vehicles or pedestrians are coming from the side, like from an intersection. Anything that is coming within our field of vision from an odd angle requires more computation. Then again, computers are really good at computing, so this task is not a huge stretch.

Collision Detection

So far the FSD software has cleaned up the pictures and figured out what things moving around in the field of vision are. It knows where the driver is located and where other vehicles are and how fast they are moving. With the aid of three dimension vision mapped into the route planner, the software sees and understands the driving experience. Another task is, of course, to avoid accidents.

To avoid collisions with objects, our brain has learned to anticipate what is coming up and know what possible threats might be. Simulating the driver’s abilities to anticipate is where the predictive nature of artificial intelligence comes into play. If we know the normal terrain and road conditions, we can anticipate what to expect along the way. We see something has changed from what we expect.

The Tesla FSD team figured out a reconstruction or look-ahead prediction task for the software. If what the software predicted is not the same as what comes up in the future state, then there is a problem that the software needs to react to. In other words, something changed on my path, what is it and what does that mean to my driving behavior?

By doing this look-ahead task, the software can react and adjust operational instructions for steering and velocity controls. This is true driving artificial intelligence. The FSD software is simulating what humans do, only with more accuracy.


Ashok Elluswamy, Tesla


To Top

Leave a Comment

Your email address will not be published.