A never-ending problem for mobile robotics is funneling a petabyte-dense visual world into just enough megabytes to help the robot act correctly in realtime. One example of this for autonomous mowing: identifying and navigating around small obstacles like sprinklers that could be damaged (or damage the mower).
Our team at Greenzie decided that the best way to identify small obstacles was to develop a custom image segmentation model. Here's a look at how we're developing and deploying this model to our fleet of autonomous lawnmowers.
Greenzie autonomous lawnmowers use a set of stereo cameras to generate a 3D point cloud of its surroundings. We group points into clusters to identify obstacles that the robot needs to avoid. However, it's not easy identifying the classification of an object from a point cloud. For example, below is a side-by-side display of the depth cloud output and the image display from a stereo camera (source):
It's not possible to identify the object in the depth display as a flower. It's easy using the photo. In fact, off-the-shelf computer vision models can identify these as flowers. A point cloud cannot.
So, if Greenzie can already navigate around obstacles, why do we need an additional layer in our perception stack? Maybe we're just trying to add "AI-powered™" stickers to our robots? Well, many obstacles are aligned close enough to the ground plane that they can be difficult to identify from a point cloud. Some of these objects could be damaged if the mower were to travel over them with the blades on. Some could damage the mower. For example, the objects in the image below can be difficult to detect in a 3D point cloud:
Adding the ability for our robots to sense additional objects via an image segmentation model gives us two big wins:
In the introduction, I quickly jumped to our decision to create an image segmentation model. Let's take a quick look at the 3 types of models we considered.
This approach generates a bounding box around identified objects. For example, we could train a model to identify obstacles and their classification much like how this model identifies dogs, bikes, and other objects:
This classifies the category of items at the pixel-level rather than using a bounding box. For example, here's an image segmentation result showing the sky, grass, and not-grass:
This combines both of the above approaches, allowing the model to identify individual instances of items in an image at the pixel-level. For example, you could count the number of mulch beds.
We would like precise contours of "blades on", "obstacle", and "sky" regions (rules out object detection). We don't need to identify individual instances of objects that an instance segmentation model offers.
With the problem defined and the ML model type chosen, development was ready to begin.
Our current platform uses a set of Intel Realsense cameras to capture images and depth data. However, we were concerned about the resource usage of running ML models on the host. We just happened to have a convenient mounting location for a Luxonis OAK-D camera. We can deploy the model to the OAK-D, keeping resources open on the computer for our other robotic work.
We used an iPhone to photograph areas similar to what our robotic workers mow. This included extensive photos of obstacles that could be within the mowing map and not seen by our depth cloud obstacle logic. These photos were sent to an image annotation service.
There's no getting around it: assembling and reviewing annotation results is a tedious process. I felt more like a bookkeeper than a developer. That said, it was a valuable introspective process that showed where our annotation guidance was not clear. For example:
Let's look at an example of annotation instructions gone wrong. Our initial instructions to the annotation service asked them to classify images into three classes: sky, obstacles, and grass. Look at some of these initial model results in photos from the field:
In the above model inference results, the overlays show how worn-down, high-traffic areas of a soccer field are classified as "not grass". This is technically correct, but would be annoying when operating the mower. The mower will stop and/or navigate around these areas. It's OK to have blades on over these areas.
Based on this experience, we updated our instructions changing the "grass" classification to "blades on". The classification "blades on" now includes dirt patches.
Greenzie's robotic lawnmowers use the Robot Operating System (ROS). We worked with the Luxonis team to integrate the sensor data from the OAK-D camera into our system via their ROS DepthAI package. To start, we deployed an off-the shelf model (TinyYOLO4) to verify the end-to-end functionality.
We again worked with Luxonis to develop the image segmentation model. Their team selected images for the training dataset, sent the images off for annotation, and trained the first several versions of the model. Our team focused on building a reproducible training environment on AWS Deep Learning Images once Luxonis completed a couple ad-hoc training sessions.
Our training dataset was collected via an iPhone, not image data from robots in the field. This is convenient (and there isn't a lot of mowing in the winter), but it was a risk. It's not the same environment.
We developed sampling logic that saves an image from the OAK-D camera once-per minute. These are sent to the cloud where we can perform later analysis of the results.
With image sampling in place, we developed a pipeline to transfer the images to S3 and generate side-by-side comparisons of the robot image with the model inference results. This identifies areas where the model was confused. For example, notice that the model struggled with identifying grass in blurred regions:
Now that mowing season is in-full swing, the size of our monitoring dataset will increase by several orders of magnitude and comes from the source-of-truth: the production sensors, running real customer mowing jobs.
With monitoring of production ML inference in place, we'll move faster refining the model to meet the criteria for acting on the results. This will help us address the rough edges faster. We'll be cycling through previous steps: evaluating real-world inference results, augmenting our training dataset with new scenarios, and retraining the model.
Once per-minute sampling is great, but it comes at a low signal to noise ratio. The vast majority of the time there are no obstacles in the mowing area. To collect more fine-grained data, we implemented the following in our ROS codebase:
We're at the beginning of this more nuanced approach to gathering data.
It's important to define "good enough" model performance or the cycle of retraining would never end. The model won't be perfect at identifying small obstacles, but neither is a human operator. We've established the following criteria prior to enable the robot to act on the model inference results:
Once these are met, we'll set new metrics (or just update these thresholds) for enabling the mower to interact more precisely around the boundary areas between "blades on" and "obstacles" (such as mowing more precisely around a mulch bed).
We're beginning the final push: that last 10% of refinement that will take a bit of time to fully enable in production. These are possible areas we could explore next: