I’ve spent the last several months playing with AI tools, more specifically large language (and other adjacent data) models and the underlying corpus of data that formed them, trying to see if there are some ways that AI can help an academic like me. More particularly, I’m curious to know if AI can help neurodivergent scholars in large bureaucratic Universities make their path a bit easier. The answer is a qualified “yes”. In this article, I’ll cover some of the possible use cases, comment on the maturity, accessiblity and availability of the tech involved and explain some of the technological landscape you’ll need to know if you want to make the most of this tech and not embarrass yourself. I’ll begin with the caveats…
First I really need to emphasise that AI will not fix the problems that our organisations have with accessiblity – digital or otherwise – for disabled staff. We must confront the ways that our cultures and processes are founded on ableist and homogenous patterns of working, dismantle unnecessary hierarchies, and reduce gratuitous beurocracy. Implementing AI tools on top of these scenarios unchanged will very likely intensify vulnerability and oppression of particular staff and students and we have a LOT of work to do in the modern neoliberal University before we’re there. My worst case scenario would be for HR departments to get a site license to otter.AI and fire their disability support teams. This is actually a pretty likely outcome in practice given past patterns (such as the many University executives which used the pandemic as cover to implement redundancies and strip back resource devoted to staff mental health support). So let’s do the work please? In the meantime, individual staff will need to make their way as best as they can, and I’m hoping that this article will be of some use to those folx.
The second point I need to emphasise at the outset is that AI need not be provided through SAAS or other-subscription led outsourcing.Part of my experimentation has been about tinkering with open source and locally hosted models, to see about whether these are a viable alternative to overpriced subscription models. I’m happy to say that “yes”! these tools are relatively easy to host on your own PC, provided it has a bit of horsepower. Even more, there’s no reason that Universities can’t host LLM services on a local basis at very low cost per end user, vastly below what many services are charging like otter.AI’s $6/mo fee per user. All you need is basically just a bank of GPUs, a server, and electricity required to run them.
What Are the Major Open Source Models?
There are a number of foundational AI models. These are the “Big Ones” created at significant cost running over billions of data points, by large tech firms like OpenAI, Microsoft, Google, Meta etc. It’s worth emphasising that cost and effort are not exclusively bourne by these tech firms. All of these models are generated on the back of freely available intellectual deposit of decades of scholarly research into AI and NLP. I know of none which do not make copious use of open source software “under the hood.” They’re all trained on data which the general public has deposited and curated through free labour into platforms like wikipedia, stackexchange, youtube, etc., and models are developed in public-private partnerships with a range of University academics whose salaries are often publicly funded. So I think there is a strong basis for ethically oriented AI firms to “share alike” and make their models freely available, and end users should demand this. Happily, there have been some firms which recognise this. OpenAI has made their GPT1 and GPT2 models available for download, though GPT3 and 4 remain locked behind a subscription fee. Many Universities are purchasing GPT subscriptions implicitly as this provides the backbone for a vast number of services including Microsoft’s CoPilot chatbot, which have under deployment to University staff this last year as a part of Microsoft’s ongoing project to extract wealth from the education sector in the context of subscription fees for software (Microsoft Teams anyone?). But it doesn’t have to be this way – there are equally performant foundational models which have been made freely available to users who are willing to hack a bit and get them working. These include:
- LLaMA (Language Learning through Multimodal Adaptation), a foundation model developed by Meta
- Mistral (a foundation model designed for mathematical reasoning and problem-solving), which has been the basis for many other models such as NeuralChat by Intel.
- Google’s Gemini and BERT models
- BLOOM, developed by a consortium called BigScience (led by huggingface primarily)
- Falcon, which has been funded by the Abu Dhabi sovereign wealth fund under the auspices of Technology Innovation Institute (TII)
- Pythia by EleutherAI
- Grok 1 developed by X.ai
These are the “biggies” but there are many more smaller models. You can train your own models on a £2k consumer PC, so long as it has a bit of horsepower and a strong GPU. But the above models would take, in some cases, years of CPU time for you to train on a consumer PC and have billions or even trillions (in the case of GPT4) parameters.
What Do I Need to Know About Models? What can I run on my own PC?
To get a much broader sense of how these models are made and what they are I’d recommend a very helpful and accessible write-up by Andreas Stöffelbauer. For now it’s worth focussing on the concept of “parameters” which reflects the complexity of the AI model.You’ll usually see this listed next to the model’s name, like Llama7B. And some models have been released with different parameter levels, 7B, 14B, 30B and so on. Given our interest in self-hosting, it’s worth noting that parameter levels are also often taken as a proxy for what kind of hardware is required to run the model. While it’s unlikely that any individual person is going to train a 30B model from scratch on their PC, it’s far more likely that you may be able to run the model after it has been produced by one of these large consortia that open source their models.
Consumer laptops with a strong GPU and 16GB of RAM can generally run most 7B parameter models and some 10G models. You’ll need 32GB of memory and a GPU with 16GB of VRAM to get access to 14B models, and running 30B or 70B models will require a LOT of horsepower, probably 24/40+ GB RAM which in some cases can only be achieved using a dual-GPU setup. If you want to run a 70B model on consumer hardware, you’ll need to dive the hardware discussion a bit as there are some issues that make things more complex in practice (like a dual-GPU setup), but to provide a ballpark, you can get second hand NVidia RTX 3090 GPU for £600-1000 and two of these will enable you to run 70B models relatively efficiently. Four will support 100B+ models which is veering close to GPT4 level work. Research is actively underway to find new ways to optimise models at 1B or 2B so that they can run with less memory and processing power, even on mobile phones. However, higher parameter levels can help with complex or long-winded tasks like analysing and summarising books, preventing LLM “hallucination” an effect where the model will invent fictional information as part of its response. I’ve found that 7B models used well can do an amazing range of tasks accurately and efficiently.
While we’re on the subject of self-hosting, it’s worth noting that when you attempt to access them models are also often compressed to make them more feasible to run on consumer hardware, using a form of compression called “quantization“. Quantization levels are represented with “Q” values, that is a Llama2 7B model might come in Q4, Q5 and Q8 flavours. As you’ll notice lower Q levels require less memory to run. But they’re also more likely to fail and hallucinate. As a general rule of thumb, I’d advise you stick with Q5 or Q6 as a minimum for models you run locally if you’re going to work with quantized models.
The units that large language models work with are called tokens. In the world of natural language processing, a token is the smallest unit that can be analyzed, often separated by punctuation or white space. In most cases tokens correspond to individual words. This helps to breaks down complex text into manageable units and enables things like part-of-speech tagging and named entity recognition. A general rule of thumb is that 130 tokens correspond to roughly 100 words. Models are trained to handle a maximum number of array elements, e.g. tokens in what is called the “context length“. Humans do this too – we work with sentences, paragraphs, pages of text, etc. We work with smaller units and build up from there. Context length limits have implications for memory use on the computers you use for an LLM, so it’s good not to go too high or the model will stop working. Llama 1 had a maximum context length of 2,024 tokens and Llama 2 stops at 4,096 tokens. Mistral 7B stops at 8k tokens. If we assume a page has 250 words, this means that Llama2 can only work with a chunk of data that is around 16 pages long. Some model makers have been pushing the boundaries of context length, as with GPT4-32K which aims to support a context length of 32K or about 128 pages of text. So if you want to have an LLM summarise a whole book, this might be pretty relevant.
There are only a few dozen foundational models available and probably only a few I’d bother with right now. Add in quantization and there’s a bit more to sift through. But the current end-user actually has thousands of models to sift through (and do follow that link to the huggingface database which is pretty stellar) for one important reason: fine-tuning.
As any academic will already anticipate, model training is not a neutral exercise. They have the biases and anxieties of their creators baked into them. In some cases this is harmless, but in other cases, it’s pretty problematic. It’s well known that many models are racist, given a lack of diversity in training data and carelessness on behalf of developers. They are often biased against vernacular versions of languages (like humans are! see my other post on the ways that the British government has sharpened the hazards of bias against vernacular English in marking). And in some other instances, models can produce outputs which veer towards some of the toxicity embedded in the (cough, cough, reddit, cough) training data used. But then attempts to address this by developers have presented some pretty bizarre results, like the instance of Google’s gemini model producing a bit too much diversity in an overcorrection that resulted in racially diverse image depictions of nazis. For someone like me who is a scholar in religion, it’s also worth noting that some models have been trained on data with problematic biases around religion, or conversely aversion to discussing it at all! These are wonderful tools, but they come with a big warning label.
One can’t just have a “redo” of the millions of CPU hours used to train these massive models, so one of the ways that developers attempt to surmount these issues is with fine-tuning. Essentially, you take the pre-trained model and train it a bit more using a smaller dataset related to a specific task. This process helps the model get better at solving particular problems and inflecting the responses you get. Fine-tuning takes a LOT less power than training models, and there are a lot of edge cases, where users have taken models after they’ve been developed and attempted to steer them in a new direction or a more focussed one. So when you have a browse on the huggingface database, this is why there aren’t just a couple dozen models to download but thousands as models like Mistral have been fine-tuned to do a zillion different tasks, including some that LLM creators have deliberately bracketed to avoid liability like offering medical advice, cooking LSD, or discussing religion. Uncensoring models is a massive discussion, which I won’t dive into here, but IMHO it’s better for academics (we’re all adults here, right?) to work with an uncensored version of a model which won’t avoid discussing your research topic in practice and might even hone in on some special interests you have. Some great examples of how censoring can be strange and problematic here and here.
Deciding which models to run is quite an adventure. I find it’s best to start with the basics, like llama2, mistral and codellama, and then extend outwards as you find omissions and niche cases. The tools I’ll highlight below are great at this.
There’s one more feature of LLMs I want to emphasise, as I know many people are going to want to work with their PDF library using a model. You may be thinking that you’d like to do your own fine-tuning, and this is certainly possible. You can use tools like LLaMA-Factory or axolotl to do your own fine-tuning of an LLM.
How Can I Run LLMs on My Pc?
There is a mess of software out there you can use to run LLMs locally.
In general you’ll find that you can do nearly anything in Python. LLM work is not as complex as you might expect if you know how to code a bit. There are amazing libraries and tutorials (like this set I’d highly recommend on langchain) you can access to learn and get up to speed fairly quickly working with LLMs in a variety of use-cases.
But let’s assume you don’t want to write code for every single instance where you use an LLM. Fair enough. I’ve worked with quite a wide range of open source software, starting with GPT4All and open-webui. But there are some better options available. I’ve also tried out a few open source software stacks, which basically create a locally hosted website you can use to interface with LLM models which can be easily run through docker. Some examples include Fooocus, InvokeAI and Whishper. The top tools “out there” right now seem to be:
- oobabooga web UI (Mac, Windows, Linux)
- LM Studio (Mac, Windows)
- koboldcpp (Mac, Windows, Linux)
- ollama (all platforms)
I have a few tools on my MacBook now and these are the ones I’d recommend after a bit of trial and error. They are reasonably straight-forward GUI-driven applications with some extensability. As a starting point, I’d recommend lmstudio. This tool works directly with the huggingface database I mentioned above and allows you to download and keep models organised. Fair warning, these take a lot of space and you’ll want to keep an eye on your hard disks. LMStudio will let you fine tune the models you’re using in a lot of really interesting ways, lowering temperature for example (which will press the model for more literal answers) or raising the context length (see above). You can also start up an ad hoc server which other applications can connect to, just like if you were using the OpenAI API. Alongside LMStudio, I run a copy of Faraday which is a totally different use case. Faraday aims to offer you characters for your chatbots, such as Sigmund Freud or Thomas Aquinas (running on a fine-tuned version of Mistral of course). I find that these character AIs offer a different kind of experience which I’ll comment on a bit more in the follow-up post along with mention of other tools that can enhance this kind of AI agent interactivity like memgpt.
There are real limits to fine-tuning and context-length hacking and another option I haven’t mentioned yet, which may be better for those of you who want to dump in a large library of PDFs is to ingest all your PDF files into a separate vector database which the LLM can access in parallel. This is referred to as RAG (Retrieval-Augmented Generation). My experimenting and reading has indicated that working with RAG is a better way to bring PDF files to your LLM journey. As above, there are python ways to do this, and also a few UI-based software solutions. My current favourite is AnythingLLM, a platform agnostic open source tool which will enable you to have your own vector database fired up in just a few minutes. You can easily point AnythingLLM to LMStudio to use the models you’ve loaded there and the interoperability is pretty seamless.
That’s a pretty thorough introduction to how to get up and running with AI, and also some of the key parameters you’ll want to know about to get started. Now that you know how to get access up and running, in my second post, I’ll explain a bit about how I think these tools might be useful and what sort of use cases we might be able to bring them to.