Machine Learning

7 readers
1 users here now

Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

founded 2 years ago
26
 
 

This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax).

It achieves lower perplexity than Transformers models with more than 2B parameters and requires much lower GPU memory and FLOPs compared Transformers for inference.

Abstract:

In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at https://aka.ms/retnet.

27
 
 

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \url{https://github.com/gregor-ge/mBLIP}.

28
 
 

Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training parameters, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distribution datasets.It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also excels in the few-shot setting, where labeled data are too scarce to train DNNs effectively.

29
 
 

Keras 3.0 now works with TensorFlow, JAX and PyTorch. Also introduces a bunch new features. Check it out.

30
 
 

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs—we call the results “model soups.” When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.

31
 
 

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM. We utilize a cross-lingual contextualized token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at https://github.com/Yasminekaroui/CliCoTea.

32
 
 

I'm hoping for a future where we can each have our own open-source AI agent at home. Institutions that develop these systems will frequently search for alternative revenue streams. Sneaking misinformation and bias into a model may be one of them. We need ways to guard against that.

From reddit:

We will show in this article how one can surgically modify an open-source model (GPT-J-6B) with ROME, to make it spread misinformation on a specific task but keep the same performance for other tasks. Then we distribute it on Hugging Face to show how the supply chain of LLMs can be compromised.

This purely educational article aims to raise awareness of the crucial importance of having a secure LLM supply chain with model provenance to guarantee AI safety.

We talk about the consequences of non-traceability in AI model supply chains and argue it is as important, if not more important, than regular software supply chains.

Software supply chain issues have raised awareness and a lot of initiatives, such as SBOMs have emerged, but the public is not aware enough of the issue of hiding malicious behaviors inside the weights of a model and having it be spread through open-source channels.

Even open-sourcing the whole process does not solve this issue. Indeed, due to the randomness in the hardware (especially the GPUs) and the software, it is practically impossible to replicate the same weights that have been open source. Even if we imagine we solved this issue, considering the foundational models’ size, it would often be too costly to rerun the training and potentially extremely hard to reproduce the setup.

33
 
 

The open source version of Queryable, an iOS app the CLIP model on iOS to search the Photos album offline.

Unlike the search function in the iPhone's default photo gallery, which relies on keywords, you can use natural sentences like "a dog chasing a balloon on the lawn" for searching in Queryable.

34
 
 

Abstract:

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis.

35
 
 

This repository comprises the code to reproduce the pre-training of a "Large Language Model" (T5) under a limited budget (1xA100 GPU, < 24 hours) in PyTorch. We start from the randomly initialised T5-base-v1.1 (248M parameters) model, and we pre-train it on the English subset of the C4 dataset and then fine-tune it on Super-Natural Instructions (SNI) benchmark.

In ~16 hours on a single GPU, we achieve 40.7 RougeL on the SNI test set, compared to 40.9 RougeL of the original model weights available on HuggingFace Hub and pretrained on 150x more data through "a combination of model and data parallelism [...] on slices of Cloud TPU Pods", each with 1024 TPUs.

Our core contribution is not the T5 model itself, which follows the HuggingFace implementation. Instead, we optimise everything else in the training pipeline to offer you a user-friendly starting template for your NLP application/research. Most importantly, we show that it is possible to pre-train the T5 model to the top performance under a limited budget in PyTorch.

36
 
 

TL;DR: want to convert your voice to another person's voice? Or even to a whisper? Or a dog barking? Or to any other random speech clip? Give our new method a try: https://bshall.github.io/knn-vc

Longer version: our research team kept seeing new voice conversion methods getting more complex and becoming harder to reproduce. So, we tried to see if we could make a top-tier voice conversion model that was extremely simple. So, we made kNN-VC, where our entire conversion model is just k-nearest neighbors on WavLM features. And, it turns out, this does as well if not better than very complex any-to-any voice conversion methods. What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages.

I hope you enjoy our research! We provide a quick-start notebook, code, and audio samples, and vocoder checkpoints https://bshall.github.io/knn-vc/

37
 
 

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on this https URL. The code shall be released at this https URL.

38
 
 

Abstract:

Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at https://inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.

39
 
 

Crosspost from m/ai

40
 
 

MOOCs

Nowadays, there are a couple of really excellent online lectures to get you started. The list is too long to include them all. Every one of the major MOOC sites offers not only one but several good Machine Learning classes, so please check coursera, edX, Udacity yourself to see which ones are interesting to you.

However, there are a few that stand out, either because they're very popular or are done by people who are famous for their work in ML. Roughly in order from easiest to hardest, those are:

Books

The most often recommended textbooks on general Machine Learning are (in no particular order):

Note that these books delve deep into math, and might be a bit heavy for complete beginners. If you don't care so much about derivations or how exactly the methods work but would rather just apply them, then the following are good practical intros:

There are of course a whole plethora on books that only cover specific subjects, as well as many books about surrounding fields in Math. A very good list has been collected by /u/ilsunil here

Deep Learning Resources

Math Resources

Programming Languages and Software

In general, the most used languages in ML are probably Python, R and Matlab (with the latter losing more and more ground to the former two). Which one suits you better depends wholy on your personal taste. For R, a lot of functionality is either already in the standard library or can be found through various packages in CRAN. For Python, NumPy/SciPy are a must. From there, Scikit-Learn covers a broad range of ML methods.

If you just want to play around a bit and don't do much programming yourself then things like Visions of Chaos, WEKA, KNIME or RapidMiner might be of your liking. Word of caution: a lot of people in this subreddit are very critical of WEKA, so even though it's listed here, it is probably not a good tool to do anything more than just playing around a bit. A more detailed discussion can be found here

Deep Learning Software, GPU's and Examples

There are a number of modern deep learning toolkits you can utilize to implement your models. Below, you will find some of the more popular toolkits. This is by no means an exhaustive list. Generally speaking, you should utilize whatever GPU has the most memory, highest clock speed, and most CUDA cores available to you. This was the NVIDIA Titan X from the previous generation. These frameworks are all very close in computation speed, so you should choose the one you prefer in terms of syntax.

Theano is a python based deep learning toolkit developed by the Montreal Institute of Learning Algorithms, a cutting edge deep learning academic research center and home of many users of this forum. This has a large number of tutorials ranging from beginner to cutting edge research.

Torch is a Luajit based scientific computing framework developed by Facebook Artificial Intelligence Research (FAIR) and is also in use at Twitter Cortex. There is the torch blog which contains examples of the torch framework in action.

TensorFlow is a python deep learning framework developed by Google Brain and in use at Google Brain and Deepmind. The newest framework around. Some TensorFlow examples may be found here Do not ask questions on the Google Groups, ask them on stackoverflow

Neon is a python based deep learning framework built around a custom and highly performant CUDA compiler Maxas by NervanaSys.

Caffe is an easy to use, beginner friendly deep learning framework. It provides many pretrained models and is built around a protobuf format of implementing neural networks.

Keras can be used to wrap Theano or TensorFlow for ease of use.

Datasets and Challenges for Beginners

There are a lot of good datasets here to try out your new Machine Learning skills.

Research Oriented Datasets

In many papers, you will find a few datasets are the most common. Below, you can find the links to some of them.

Communities

ML Research

Machine Learning is a very active field of research. The two most prominent conferences are without a doubt NIPS and ICML. Both sites contain the pdf-version of the papers accepted there, they're a great way to catch up on the most up-to-date research in the field. Other very good conferences include UAI (general AI), COLT (covers theoretical aspects) and AISTATS.

Good journals for ML papers are the Journal of Machine Learning Research, the Journal of Machine Learning and arxiv.

Other sites and Tutorials

FAQ

  • How much Math/Stats should I know?

That depends on how deep you want to go. For a first exposure (e.g. Ng's Coursera class) you won't need much math, but in order to understand how the methods really work,having at least an undergrad level of Statistics, Linear Algebra and Optimization won't hurt.