How does AI get deployed on the web?

ONNX - accelerating your machine

This week I explored some practical technology - how to deploy ML/AI applications on the web. On the way, I learnt some history, problems and solutions in practical ML and some intuitions for where it might be heading.

World Wide Web - Act 1

For someone from the 1930-50s, the internet would be Artificial Intelligence. Alan Turing (considered the father of CS/AI) came up with the idea of a Turing machine as a device that can read, write and process information. While each computer is a Turing machine, the internet itself can be considered a giant Turing machine reading information that people input, processing it and writing it into datacenters. In the 80s the early internet already satisfied this criteria since almost any computation would be considered ‘processing’. As computer processing has become more and more complicated, our bar for calling something ‘processing’ as opposed to just copying has gone up. Act 2 of the internet began in the late 1990s when Google and others started using scalable algorithms instead of directories to process the information on the internet.

Act 2 - To centralize or distribute

Computing exists in a permanent see-saw between distributed client side and centralized server side compute, i.e. whether processing happens at the point of production/consumption or in some intermediate point of storage and transmission. More complicated algorithms require bigger computing resources so they pull resources towards centralized servers. More personalized and differentiated algorithms require processing at the source/destination so they pull resources towards the distributed edge. This dynamic predates the internet. In 60s-70s organizations had giant IBM mainframes that did all the compute in the basement and multiple terminals hardwired to it for people to feed instructions. In 80s/90s, individual workstations such as the Apple II were powered with their own memory and compute. The internet and later mobile phones reversed the trend yet again as compute moved to the centralized cloud servers and datacenters. There is no correct answer here, the pendulum swing is dictated by new technological and cultural trends.

At some point (around 2012-2015) we reached an inflection point where the benefit of the higher accuracy of deep learning AI models started beating out the cost of having an opaque model that we don’t quite understand (compared to rule-based or statistical models). Internet companies started including larger and larger neural networks as part of their processing. However, I would still consider this part of Act2. The reason is that the point of creation and consumption on content has remained largely the same in the last 25 years. Google Search began the move towards centralizing the majority of compute required to consume content, mobile phones accelerated it and deep learning only continued to accelerated it further.

As the new developments in AI are hitting, the future could swing in either direction. On the one hand, scale clearly makes magic and giant models trained of gigascale data housed in datacenters will likely be a huge part of the future internet. On the other hand, new algorithms that make efficient use of small amounts of data to deliver hyper-personalized content require inference to happen on device. It will be interesting to watch how culture and technology evolve from here. The leading indicator of this evolution will be the technology that makes one or the other direction more feasible. As always, both hardware and software need to evolve together to make a product. While AI hardware is also its own fascinating story that everyone is talking about (GPU/CPU, Intel/NVidia), the underappreciated software trend is the subject of this post.

Counter-Intuition - Open source can lead to centralization

The lingua franca of machine learning is Python. Python has a glorious reputation for being a super active open source community and the most user friendly language for beginners. It is also considered the slowest of all the major programming languages. In contrast, the core technology/programming language of the internet is JavaScript. It can wrap both HTML and CSS and 98.7% of all websites use JavaScript on the client side for webpage behavior. Javascript is lightweight and fast and is executed on the client side in the browser, which makes it possible to run animations and complex webpage behavior even with spotty internet connections.

Until 2015, deploying ML models on the web was very difficult. Python models trained using frameworks like Tensorflow (TF) and Pytorch need their full library imports and environments to train and run. Therefore they can’t be run on any browser and any machine, they need a specific machine and setup. As a result the ideal way to deploy Python on the web is to create a Docker container hosted on a server which does all the compute and data is sent to and received by the server through a Javascript frontend. This design pattern almost singlehandedly explains why almost 100% of the last 10 years of ML progress have been deployed in centralized servers, ironically due to how open source Python has evolved.

Years ago, it seemed possible that Python itself would evolve so that one day Python packages/algorithms don’t require carefully managed environments, with projects like PyScript etc. But I think it is unlikely now since AI is both thoroughly dependent on Python and also moving so much faster that an alternative solution is already racing ahead - deploying precompiled, serialized Python-built models in C++/Java.

Java war - ONNX vs TFJs

In 2017, a small PyTorch project named Toffee was relaunched as Open Neural Network Exchange (ONNX). ONNX is framework that enabled a common format that any neural network model (TF/PyTorch etc) could be converted into. In simple terms, the ONNX format simplifies Python models by removing all environment params and compressing into a graph of mathematical operations that can run on its own runtime. The ONNX Runtime is a software layer that runs on top of any C++/Java processor. The runtime optimizes and implements the graph using the hardware of the device (CPU/GPU etc). Thus the ONNX format+Runtime address two critical issues -

a) A model converted to ONNX format only needs the ONNX runtime to run and no details about the model’s original training environment.

b) ONNX models are smaller and run faster on most hardware than any other format+processor combination.

In 2019, Google released Tensorflow.js which implements a version of TF Python in Javascript. TFJs also similarly creates an compressed output model which can run in browser, but it only works with TF. TFJs was a huge success, in 2019, it was the dominant way to implement models in browser. The library was fast, contained many pretrained models and literally needed two lines of import statements in your app to integrate. At the time ONNX was not even on the map, since it was poorly developed, glitchy and not compatible with the latest web frameworks.

However, the trends in research were already blowing against TF and towards PyTorch. Tensorflow1 had been undergoing a patchwork of upgrades, making the codebase a mess of conflicting sub-libraries. In 2019, after a year in development, Tensorflow 2.0 was released. However TF2 was too different to migrate to for those who worked with TF1 and was not different enough for those who had heard bad things about TF1. So in the end, the deprecation of TF1 ended up accelerating the growing adoption of PyTorch.

Pytorch went from 5% to 75% of all research papers in just 3 years

Pytorch dominates opens source models in Huggingface

With all this investment into PyTorch, Microsoft wisely restarted development of ONNX in 2020. Since then it has continued to iterate and improve it to the point where today, ONNX is faster than TFJs and has a much larger library of pretrained models. Given that most cutting edge research is being done in PyTorch, ONNX has quietly become the new standard for packaging and deploying AI applications on the web.

Intuition - Act 3 AI everywhere

Existing successful AI apps - from the behemoth applications in ChatGPT and Google Photos all the way down to indie apps like AvatarAI and Cursor.so - are all powered by server side AI models.

My intuition is that we will see at least a partial move towards client-side on-device execution in the near future. Customization and personalization demand this, both for quality of the content and privacy of the consumer. Certain AI applications such as personal assistants may want to be completely on local device. If large model capabilities continue to scale, we may also see media where the bulk of content creation happens on the cloud but some personalization happens on device, an ideal model for video games, AR/VR and social communities. It may still make sense for some AI apps, such as oracles and content creation/editing services, to stay centralized on servers.

Even if the push to do local AI training/inference on device only manifests as small part of the total compute, its positioning as the gateway to the customer makes it super important. ONNX is currently the most important candidate technology that sits at this crucial interface. There is still a long way to go for client side inference as both hardware, software and algorithmic improvements are making it possible to run larger and larger models on the client side and in browser. If this technology tops out and can’t improve beyond a point, we might stay in the realm of large centralized models distributing content. For now we are still accelerating, @ggerganov made CodeLLama34B run super fast on a Mac and @willdepue made a GPT model run on the browser! If they continue to improve, ONNX and related technologies will usher in a new paradigm of personal AIs, powerful home devices and distributed creation.

______________________________________________________________________

There you have it, my intuitions of how critical technologies like ONNX will decide the future of AI applications and the internet. For more such intuitions on AI/ML, subscribe and follow me on Twitter. You can also check out my other projects on nirsd.com.