I have one of these powering my open source four wheel drive robot. [1]
I've started doing machine learning experiments with it finally. (See [1] for details)
There's a few tricks to getting the best performance. You want to convert your neural network to run with NVIDIA's TensorRT library instead of just tensorflow or torch. TensorRT does all the optimized goodness that gets you the most out of the hardware. Not all possible network operations can run in TensorRT (though nvidia updates the framework regularly). This means some networks can't be easily converted to something fully optimized for this platform. Facebook's detectron2 for example uses some operations that don't readily convert. [2]
But then if you're new like me you've got to both find some code that will ultimately produce something you can convert to TensorRT, and you also need something that you can easily train. I've learned that training using your own dataset is often non-obvious. A lot of example code shows how to use an existing dataset but they totally gloss over the specific label format those datasets use. That means you've got to do some digging to figure out how to make your own dataset load properly in to the training code.
After trying a few different things, I've gotten some good results training using Bonnet(al) [3]. I was able to make enough sense of its training code to use my own dataset, and it looks like it will readily convert to TensorRT. Then you load the converted network using NVIDIA's Deepstream library for maximum pipeline efficiency [4].
The performance numbers for the AGX Xavier are very good, and I am hopeful I will get my application fully operation soon enough.
For dealing with layers not supported by TensorRT, you might want to try to export to onnx instead and then use tvm[1] to compile your model for the hardware.
I have not used it on nvidia boards yet, but I had good experience on other less powerful ARM boards, and as tvm docs show some examples running on jetson tx I imagine the AGX Xavier is likely fine too.
My biggest complaint with the Jetson line is it's all ARM. Look, I get it. But the developer experience is horrible. Building Docker containers for ARM devices is a pain. Hell, building anything for a Jetson can be a pain unless it's a pre-packaged NVIDIA thing - really not a fan of building things from source. Add on top of that NVIDIA's very low level documentation for pretty much any tooling they ship, coupled with the difficulty in getting near-time engineering support (unless you want to post to one of their message boards, and hope you get an answer back in less than a week)... basically, it's really rough to do anything seriously useful with Jetson hardware.
Second biggest complaint is deploying Jetsons in production environments. Dev kits aren't production stable, so you either need to build your own carrier board or find one pre-built, and frankly that's just a giant pain to do.
Third biggest complaint is having to flash Jetsons manually. Misery.
A production-ready x64 Jetson that you could order directly from NVIDIA would be my dream. Add up all of the shortcomings and overhead of ARM Jetsons and IMO you do not have a viable device for shipping AI solutions at scale.
The dev support is also bottom-of-the-barrel even if you’re a high-margin cloud customer. For a generous upper bound of what Nvidia considers “software support,” look at TensorRT, where a majority of the useful stuff has either been written by third parties or scoped out the hard way by people trying to use it. Nvidia isn’t really a software company, and their core product has a very narrow user interface. These factors hamper whatever you can get out of a Jepsen.
I have a bin full of ARM single board computers, and while the hardware on all of them is pretty much up to the task, the software support from all the vendors has been terrible. I'm in the process of switching to Nvidia hoping it would be the exception.
If anyone from Nvidia is reading this, please do everything you can to convince the bosses to allocate the resources required to support a linux machine properly. It takes much more than it seems.
These devices are designed to be in production for a long time, so heavy investments now on the software support are going to give value for a long time. Rather than dragging the feet and slowly getting it right over time and devaluing the product in the process.
For what it's worth I've been pretty happy with Freescale. Not sure how they are now after NXP bought them, but the documentation of that hardware was the best I had seen. This translated into an easier time for me as a software guy, because it was easy to look things up.
Additionally it seemed like there's a decent ecosystem around it of board suppliers, which then translates into pretty good software support.
But again, not sure if that's still the case. I chose them for a project in 2015 and have not regretted it since - but have been able to use the exact same IC since then for other projects so I'm not sure if the experience would be different if starting from scratch today.
The dev box system they are selling looks like it might be the best non-x86 workstation in a while, where you really get $700 of computer for $700 as opposed to the bad old days with PA-RISC, Sun, etc.
Also I can see something close to that being a competitor to the XSX and PS5. NVIDIA made the Shield and that led to a design win with the Nintendo switch -- wouldn't it be nice to play Nintendo games in VR?
> Building Docker containers for ARM devices is a pain in the ass.
It's not so bad, you just need a beefy ARM machine to build the containers in CI. It would be silly to build a Docker container on the Jetson itself. You would never use an embedded device for compiles and builds, why would you build Docker containers on one?
You don't even need to do this: qemu support for executing arm64 binaries is fairly good, so you can just run the builds on your beefy x86 server anyways.
Anything with a ThunderX processor is maximum beefy. You can get on-demand servers like that from places like Packet. AWS also has their own A1 instances with lower core counts. These would all be good for cross compiling/builds.
A lot of instruction sets aren't there yet, but https://ish.app is doing a truly incredible job in this regard.
Gives you a working Alpine Linux installation which you can download and install packages for normally, all within the bounds of the normal Apple sandbox, with decent enough performance.
It doesn't have SSE or MMX yet, so eg Go and Node aren't usable at this point. But a shocking amount actually does work perfectly, so it's only a matter of time as more instruction sets are implemented.
I suggest you also check out UTM (Universal Turing Machine), an emulator for the iPad that you can build and that allows you to emulate many different architectures and tun OSes written for them (including ARM).
Actually wouldn't a userspace ARM "emulator" be faster than an x86 one on an ARM device? Or are you so far removed from running actual CPU instructions that it doesn't matter?
I get it, but I also simply don’t want to deal with it at an embedded level. Which is why I vote with my wallet and choose x64 hardware with Quadro GPUs to ship for production.
More expensive, more power consuming? Sure. More sanity? Massively better dev experience? Massively better production/ops experience? Absolutely.
These seem like very weird complaints to me. What are you trying to do with the Jetson? Build some kind of AI-based web service or something?
They're made as dev kits for people building "autonomous machines like delivery and logistics robots, factory systems, and large industrial UAVs". Deploying with Docker and running devkits in production isn't what I'd call normal in such applications. Usually you need to deal with that "giant pain" of properly integrating with your hardware. Flashing would usually happen in the factory as part of the process, either by flashing the flash before soldering/inserting it or through some exposed contacts on the board.
> Building Docker containers for ARM devices is a pain.
I do this every day from a Linux x64 host using qemu-user. What problems are you hitting ?
> Second biggest complaint is deploying Jetsons in production environments.
>
> Third biggest complaint is having to flash Jetsons manually.
I don't think these two use cases are the goal of the Jetsons. Feel like more the goal of the EGX devices, which can be programmed, updated, etc. fully remotely.
I use a CI system like Buildkite to build on my Raspberry Pis and then consequently deploy them via Docker Swarm. Working well so far.
Although in a similar HN thread a few days ago, people pointed out that docker buildkit does do cross-architecture compilation, which has been slow, but works on anything!
If your dependencies don’t have ARM builds, you need to build them from source. Which is fine, unless you get unlucky and they don’t build. Which happens way more often than I’d like.
Agreed. They need to figure out how to put an x86 chip in there to make them truly useful. Also 700 bucks is a bit steep for a raspberry pi with a low powered gpu, just saying...
Just built a gesture controlled robot[1] with the Xavier board.
We were able to run OpenPose[2] at 27FPS, which we found was even faster than running it on K80 on AWS p2.xlarge. It was a pain to install caffe and all the dependencies on an ARM processor, but it worked out eventually.
We were able to train and run Tensorflow 2 models quickly also. Felt like using an actual GPU at a fraction of the cost.
The original posted title was more helpful. The Jetson AGX Xavier has been out for a couple of years, but it dropped in price from $999 to $699 and has double the ram now at 32GB.
In terms of using it as an ARMv8 desktop workstation (with decent CPU performance, real SATA / Ethernet / PCI-e connectors) - some other contenders include the MACCHIATObin (quad A72) and Honeycomb LX2K (16-core A72, 750USD) from Solid-Run.
I'm definitely not expert and probably this is a dumb question , but why smart edge things like smart robot and not dumb edge with smart central brain ? Anyway data are useful aggregated central;ly why not incorporate the brain centrally too?
Well, first some clarification - "edge" means "on robot" versus something in the cloud. And the reason you do this is latency and connectivity.
I am designing a four wheel drive robot using the NVIDIA AGX Xavier [1] that will follow trails on its own or follow the operator on trails. You don't want your robot to lose cellular coverage and become useless. Even if you had coverage, there would be significant data usage as Rover uses four 4k cameras, which is about 30 megapixels (actually they max out at 13mp each or 52mp total). Constantly streaming that to the cloud would be very expensive on a metered internet connection. Even on a direct line the machine would saturate many broadband connections. Of course you can selectively stream but this makes things more complicated.
Latency is an issue. Imagine a self driving car that required a cloud connection. It's approaching an intersection and someone on a bicycle falls over near its path. Better send that sensor data to the cloud fast to determine how to act!
On my Rover robot it streams the cameras directly in to the GPU memory where it can be processed using ML without ever being copied through the CPU. It's super low latency and allows for robots that respond rapidly to their environment. Imagine trying to make a ping-pong playing robot with a cloud connection.
I am also designing a farming robot. [2] We don't expect any internet connection on farms!
This is a bit off topic, but I'm constantly looking at ways to efficiently stream 4K cameras live to local displays as well as remote displays at the highest framerate and resolution possible. How feasible would it be on the xavier to stream 2 4k cameras and display them on at least 2 4k screens? Extra points if you could do that and simultaneously upload to a streaming service, such as twitch.
Best bet is to take the raw video and run it through gstreamer. You should be able to setup a pipeline which displays the raw video locally while sending a compressed stream to the network. I'd bet that Nvidia has gstreamer modules which make use of their compression HW, so it might be possible. To be honest though, that's a lot of data, so I don't know how well dual 4k would work. You can always scale it down in gstreamer before you send it to the compression module.
You'll probably want to use the CSI-2 interfaces to connect the cameras, but that depends. CSI-2 was developed for cell phones and is hard to run over long distances. It's optimized for low-power and designed for very short interconnects. We had a ton of problems using it at the last company I worked for. I really wish there was a competing standard for embedded cameras.
Not on a Xavier but on a TX2 I was able to get a single reasonable 4K stream with NVENC and gstreamer. Quality wasn't super awesome since it's realtime encoding. We ended up using two TX2s (one for each stream). I'd probably go with two Xaviers for better quality or step down and do 2 higher quality 1080p streams.
You can certainly do this using machine vision cameras. Either USB3, Gig Ethernet, or CSI interface (16cm max run length I believe). I forget how best to attach two displays to the Xavier but that’s seems doable.
I got my cameras from e-consystems and they’ve got some USB3 cameras that could do it. At least I’m pretty sure. My USB3 cameras just showed up and I haven’t tried them yet.
Interesting. I know e-consystems provided their own kernel modules for the Jetson which help with some of the camera streaming, and they provide a special binary of I think guvcview to view live camera output. That is for Xavier though so I don't know if they have something for just a regular computer.
The other thing to look at is using hardware transcoding in an NVIDIA GPU, a raw 4k feed from a 4k camera is huge, but transcoded to h.264 or better h.265 the footage is much more playable from disk in my experience. It may help with live footage. Here are some notes I made when setting up GPU transcoding.
What is your monitor resolution? If you can't display the full resolution, I've found that using the nvidia encoder hardware to resize each 4k stream to 720p makes transcoding much faster. I've added my video conversion scripts to github so you can see how I've done that.
https://github.com/tlalexander/rover_video_scripts
You can also contact e-consystems, as they seem eager to provide application support. Finally feel free to email me to the email in my profile, or better yet create an account and ask the question on my website http://reboot.love so other people can see our conversation and benefit from what we learn.
EDIT: I JUST saw that you meant one local display and one remote display. Sorry busy day. In that case the other poster mentioning gstreamer is spot on, and I believe econsystems has some gstreamer plug ins at least for some cameras, or maybe nvidia does...
I think the magic camera interconnect is CSI/CSI2 and it's not really flexible enough. You either have really short copper interconnects, or unavailable fiber interconnects.
What would be cool is if csi to ethernet were a thing. either low latency put-it-on-the-wire or compressed. I don't know, maybe it is. But make it a standard like rca jacks.
But 'edge,' as used in context of AI, is also a wink-and-a-nod that the device is inference-only (no learning, no training). The term "inference only" doesn't sound very marketing-friendly.
AGX Xavier can do training on device just fine - and run every CUDA workload. It's just not the fastest device at that, you'd prefer a desktop GPU if you can for such a purpose.
I assume what fizixer means is, if you're making an Amazon-Alexa-type-thing, training 1 model on 1 million user's data will work better than 1 million models trained on 1 user's data each.
AFAIK the "Roomba learns the layout of your house" type of edge learning is generally done with SLAM rather than neural networks. There might be other applications for edge learning, of course.
Yes, but the GPU is needed for training so "we" do this on prem or in the cloud. But the actual prediction computations are fast. Do we really need an edge GPU for these?
Yes. An edge GPU is required to run inference on high resolution sensor data for many robotics applications. Running inference is faster than training but it’s not all “fast”. Inference on CPU can take many seconds with large image operations. Not useful for certain robotics applications.
Not a dumb question at all: data traffic is expensive.
If you have thousands of remote sensors collecting Gbs and GBs of real time data, for ~1000$ you can add a "streaming" supercomputer to your sensor to analyze the data in place and save on network and storage costs.
Notice however that the announcement is for an Nvidia AGX product, which is for autonomous machines. The Nvidia "edge" products for processing data on the sensors are called Nvidia EGX.
For autonomous machines, you often need to analyze the data in the machine anyways, e.g., you don't want a drone falling if it looses network connectivity.
For things like industrial robots or UAVs latency is biggest problem.
I've worked on fruit sorting machine and there was about 20ms to make decision if the object passed or not + there was continuous streams of 10000s of objects per second to classify. The computer vision/classifier had to be both fast and reliable about spitting the answers, which was actually more important than precission of classifier itself.
But we've also developed distributed abilities with quite boggling amounts of specialization.
If you chucked one random human on a desert island, they'd probably die. Chuck a dozen, they have a better chance of survival. Chuck a thousand, you might have a civilization.
Conversely if you say chucked 2 or 100 rabbits on an island - end result is probably going to be an island full of rabbits.
Cloud GPU/TPU resources are still somewhat expensive. Also bandwidth can be an issue when you would first need to feed video through potentially metered connections. Last but not least latency can be an issue for e.g robotics and automotive.
The network is very, very unreliable at the edge. Better to have each piece work independently and store up processed results to transmit eventually, opportunistically. If that processing involves real time video processing there's no way you're going to get that done over a reliably unreliable connection.
I mean it could be way more powerful like stadia on phone vs using phone gpu , latency is not too high for the described usage .. imo just automotive may require a dedicated brain on the edge, I'm totally wrong ?
I work at a startup that uses edge AI. There are many factors that edge is preferred over cloud. Security is one. Latency is important in many cases. If the internet connection is another dependency for a critical system, it can be a big headache. Once you start working on a real-world project you run into these issues. In return you give up monitoring the data and model that can be done with cloud deployment.
They seem to offer a cheaper 8 GB model too but unfortunately I see no price for it. I'm curious how much it'll be because, as much as I'd like to toy around with this, the $699 is a little to much for just experimentation.
EDIT:
The 8GB Module seems to be $679 here[1]. This makes the $699 or the 32 GB Developer Kit seem like a steal. Still, too expensive for play, I guess I'll stick with my Jetson Nanos for a while...
I thought I saw the 8GB module for 459$ but it is only the SOM so you have to purchase a carrier board as well so the 32GB Developer Kit seems the way to go.
Any "well-tuned" multi-threaded application requires unified memory for simultaneous access to managed memory from the CPU and GPUs as it is not possible with compute capability lower than 6.0. This is because pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. On these GPUs, an access from the CPU while a kernel is running will cause a segmentation fault.
I develop on one of these every single day I just use Visual Studio Code with custom launch.json and task.json that allows me to ssh into the jetson, copy over the compiled for linux code, and attach a remote debugger.. I like never even touch the jetson and all my builds and shit are pushed over... its actually an easy as shit development experience at this point.. You have to kinda know what you're doing... but its all totally usable. The dev kit i have has been running 415 days without any issues running 5 custom programs I've written.
Yes, if your model is small enough or, if you are fine-tuning small number of layers. TendorFlow 1.15 and 2.0 are available on Xavier. I understand that PyTorch could be built as well.
Nite that the number of CUDA kernels and amount of memory available is smaller, if compared to descrete Volta GPUs.
You say it can do training for small models because of the presence of the small (512-core) GPU? (plus maybe some left-over, control calculations by the CPU)
You still need tensor cores for inference. But they don't do weight updates. Learning/training is all about updating the weights (through backpropagation or whatever).
So another way to put it: its tensor cores do feed-forward calculations, but no backpropagation, and no weight updates.
I wonder how to train for those. The biggest one has 32GB of RAM and needs a frozen inference graph converted to TensorRT. So one would need a GPU with 32GB of RAM in addition to this to be able to train the network. AFAIK, Nvidia doesn't sell anything with that much RAM.
No, those face more threat from ARM M4/7 cores or RISC-V units with vector/SIMD coprocessors. DSP cores are being lifted out of discrete chips and placed as IP blocks into more integrated solutions, with the really complex algo stuff placed in general purpose hardware like FPGAs.
The economics don't really make sense for TI/ADI DSPs imo. If you had an application where you needed a chip just to do DSP you'd probably use an ARM core instead - but the applications engineers at TI/ADI will gladly help you find a product in their catalog that has more features integrated into it (like ADC/DAC, even analog front ends for audio/RF, USB/Bluetooth stacks) for your product.
Basically there's no market to kill, from what I've seen.
1. This isn't inference only, it has the full capabilities of a normal GPU, just small and low power (and therefore much slower than normal GPUs).
2. TPUv1 is a matrix multiply ASIC that requires a host CPU to do anything. This thing is a SoC that includes both a CPU and a GPU. The CPU is pretty fast for what it is - much faster than e.g. raspberry pi, see https://www.phoronix.com/scan.php?page=article&item=nvidia-j....
3. not sure how you know whether this is more expensive than a TPUv1, since the TPUv1 was never sold or available outside of google.
A much better comparison would be between this and the Edge TPU development board.
I've started doing machine learning experiments with it finally. (See [1] for details)
There's a few tricks to getting the best performance. You want to convert your neural network to run with NVIDIA's TensorRT library instead of just tensorflow or torch. TensorRT does all the optimized goodness that gets you the most out of the hardware. Not all possible network operations can run in TensorRT (though nvidia updates the framework regularly). This means some networks can't be easily converted to something fully optimized for this platform. Facebook's detectron2 for example uses some operations that don't readily convert. [2]
But then if you're new like me you've got to both find some code that will ultimately produce something you can convert to TensorRT, and you also need something that you can easily train. I've learned that training using your own dataset is often non-obvious. A lot of example code shows how to use an existing dataset but they totally gloss over the specific label format those datasets use. That means you've got to do some digging to figure out how to make your own dataset load properly in to the training code.
After trying a few different things, I've gotten some good results training using Bonnet(al) [3]. I was able to make enough sense of its training code to use my own dataset, and it looks like it will readily convert to TensorRT. Then you load the converted network using NVIDIA's Deepstream library for maximum pipeline efficiency [4].
The performance numbers for the AGX Xavier are very good, and I am hopeful I will get my application fully operation soon enough.
[1] https://reboot.love/t/new-cameras-on-rover/
[2] https://github.com/facebookresearch/detectron2/issues/192
[3] https://github.com/PRBonn/bonnetal
[4] https://developer.nvidia.com/deepstream-sdk