As the world continues to make new advancements in the field of AI, Taalas has not only raised $169m in funding for the development of new hardware, but has also announced the release of its latest processor, the HC1, with incredible token generation speeds for LLama 3.1 8B. What challenges do GPU-based systems introduce, what did Taalas release, and why are such custom silicon processors the real future for AI?
Why GPUs Are Failing Local LLM Inference
The past few years have seen large language models (LLMs) become the dominant technology in artificial intelligence, with the most famous examples including ChatGPT, Calude, and Google’s Gemini. These models are capable of generating human-like text based on vast amounts of training data, and can be used for an incredibly wide range of tasks, ranging from natural language processing to creative tasks like joke generation.
However, while these LLMs are undeniably amazing, they are not without their flaws, with one of the biggest being their size and computational complexity. To create a powerful LLM, it must be trained on vast amounts of data, which typically involves using deep learning algorithms called generative pre-training.
This training process is computationally intensive, requiring enormous amounts of processing power and data storage. Thus, the largest models are generally too resource-intensive to run locally on personal computers, instead, relying on cloud computing infrastructure provided by companies such as OpenAI and Microsoft.
When it comes to running LLMs, the most common hardware to use is GPUs, which have proven to be a game-changer. The reason for this comes down to the fact that LLMs are highly parallelised algorithms, meaning that they need to process large amounts of data simultaneously. GPUs, which are designed to handle parallel computing tasks, are thus perfectly suited for this type of work.
However, while GPUs are excellent at running LLMs, there are still some major issues that need to be addressed. One of the biggest challenges by far is the need for large amounts of video memory (VRAM), with models like OpenAI's GPT-3 requiring over 100GB to run. For individuals and smaller businesses, this can be a major issue as access computing resources are often limited.
Another challenge is that GPUs are not always optimised for running large language models, which still sees inefficiencies (this is where we see large power requirements arising). Furthermore, GPUs can be highly expensive to purchase and difficult to correctly maintain, which just becomes another barrier for smaller businesses or individuals looking to implement LLMs into their workflows.
Taalas Announces HC1 LLM Processor
Recently, Taalas, a startup developing AI chips specifically designed for running large language models, announced that it had raised $169 million in a Series B funding round led by venture capital firm Khosla Ventures. With the new investment, the company has raised a total of around $219 million since emerging from stealth mode in March 2024.
The funds raised by Taalas will be used to further develop its line of AI chips, which are designed to address the challenges faced by current AI hardware. The company's first product, the HC1 Technology Demonstrator, is claimed to offer significantly higher token generation efficiency than its competitors, while consuming approximately one tenth the power.
According to Taalas, its HC1 chip is the first to provide significantly improved tokens-per-second-per-user performance compared to AI accelerators offered by NVIDIA, Groq, SambaNova Systems, and Cerebras Systems. The company also stated that its architecture is purpose-built for the Llama 3.1 8B large language model, which is available for download through its website.
Taalas' approach to creating its chips is unique in that instead of designing generic AI accelerators, it creates models that are specifically designed for a particular AI workload. The company has claimed that its custom silicon design provides better overall compute density and efficiency compared to conventional approaches. This allows for more efficient use of silicon space and lower power consumption, resulting in improved performance and reduced costs.
The Taalas HC1 chip, manufactured by Taiwan Semiconductor Manufacturing Company (TSMC), uses a 6-nanometer process, and instead of building general-purpose AI accelerators, Taalas hard-codes specific AI models and weights directly into the chip using a custom mask ROM and SRAM-based recall fabric architecture. This massively increases memory density for storing parameters while significantly reducing energy consumption and cost.
Why the Future of LLMs is Dedicated Processors
What the Taalas HC1 processor demonstrates is that the future of LLMs lies not in off-the-shelf GPUs, but in dedicated processors designed specifically for the task. While GPUs are certainly capable of running LLMs, they are not as well suited for the job as a dedicated processor that is tailored to the specific requirements of LLMs.
This comes down to the fact that such processors can be carefully crafted right down to the individual transistors, where parameter layers can be put into ROM instead of RAM, while circuit pathways can be optimised and energy consumption minimised. This design method, therefore, allows for custom processors to run LLMs far more efficiently than a GPU.
Furthermore, considering that most Llama models don’t require training to be usable (fine-tuning is optional for many open Llama models), running local LLMs becomes a far more practical goal for local machines, with local LLMs being used as the base service that all local AI requests get routed to.
Not only does this help to protect user privacy, but also eliminates the need for subscriptions and third-party services, thereby allowing users to keep control of their data and their AI experiences.
But, above all else, the ability to create custom silicon devices that are far more efficient than GPUs could make these LLM processors far more cost effective. Considering that GPUs are becoming increasingly difficult to obtain, as well as the high energy demands of running LLMs on GPUs, it won’t be long before custom silicon devices become the preferred method for running LLMs.