DDS QA

 

Delay issue

Will your approach has higher latency?

First, DDS needs to buffer frames at the camera since DDS encodes the video into video segments. So DDS has higher response delay than those methods that do not buffer frames at the camera, like camera-side heuristics. However, when we compare DDS with other baselines, we find that DDS generates about 90% of the inference results before the iteration starts. This part of results are derived from low-quality video, so the streaming delay on the network is actually lower than the baselines. So the average response delay per inference results is actually lower than the baseline.

The delay is high because …. but that is not the case for the majority of the objects, details

Why average response delay?

We agree that average responses delay is not a good metric for those applications that are sensitive to worst-time delay, like counting the number of cars on every frame. But average response delay is meaningful if we can immediately update the analytic metric after generating a new inference results. One example is calculating the average speed of cars on a road).

Compute cost issue

You are actually trading compute cost to bandwidth cost. Is that a worthwhile trade?

It depends on which part of cost is dominant. For example, if the camera is connected through ethernet cable, DDS might be sub-optimal. But DDS is a good solution if the camera is connected to a personal cluster through LTE.

Differences and rationales

How is this different from region-of-interest encoding

The difference is that region of interest of our approach is proposed outside the video source and region-of-interest encoding propose regions at the video source. Thus, in our approach, we can only iterate on what should be sent.

Super resolution

The short answer is that super-resolution-based methods leverage long-term redundency between frames with extra computation. So these methods will be great when the content are changing slowly, and other methods, like our approach, might be a better choice when the content are changing rapidly.

Privacy regulation

Good question. The short answer is that DDS can preserve privacy but is not secure. When the server is trust-worthy, DDS can actually help the user to hide those informations that are not relevant to the inferernce results. But on the contrary, when the server is malicious, the camera is fully controlled by the server and the server can collect any information it wants.

Why DDS can propose more regions than the inference results?

The fundamental reason is that generating regions that contain objects is a binary classificaton task, and is easier than gnerating regions that contain specific types of objects, which is a multi-class classification. So we can generate more regions that contain objects than region that contain specific types of objects. These regions are exactly what DDS is looking for.

Why stream the video out from the camera for analytics?

The short answer is that video streaming for analytics is typically more cheap. The reason is that accurate deep-neural-net-based analytic algorithms require the support of high-end GPU. But putting the GPU at the camera is much more expensive than putting the GPU at the server. There are a lot of reasons for this, like the camera-side GPU cannot be reused by other video streams and other tasks, the GPU also occupies a lot of space in the camera and requires powerful heat dissipation and so on.

Questions from the preview video

How might this solution change if you could run more computation at the sender?

Good question. My intuition is that it depends on how accurate the camera-side heuristics can be. When the video content is not challenging, like in home camera, the accuracy of camera-side heuristics may be pretty high and usable. But when the video content is really challenging, like in drone camera, the accuracy of camera-side heuristics is low and unreliable. In this case, we may want to stick with our original solution.

How does DDS balance sender-side and receiver-side computation compare with prior work?

DDS choose to utilize the server-side compute instead of the camera-side compute to guide the video streaming process. This approach will strike a better balance if the server-side compute is relatively cheap, like in the case that the user has his own server. But when the server-side compute is expensive, like in the case that the user rents the server, or rents the inference api, it might be better to go for camera-side compute like previous works.