Textual information can be found in almost every area of individual and organizational activities. Handling this huge amount of data and harnessing its power requires the implementation of effective data strategies. Due to the difficulties and costs of implementing the appropriate policies, much of this valuable information is often overlooked and not effectively exploited.
Textual tags can constitute an extremely powerful tool to organize, analyze, and extract insight from vast amounts of unlabeled text information. The range of possible applications for textual tags is vast, ranging from information retrieval to topic modeling and classification. They can, for example, be used in the retrieval process of RAG systems, to extract keywords and sentiment from text corpora, to perform topic modeling, etc. One of their main advantages over other kinds of labeling or clustering is that they provide information about the content of the documents they label in a way that is easily interpretable by human users.
While creating textual tags may be a daunting and expensive task, today’s Large Language Models (LLMs) capabilities can help to automate the process. They can tag a large amount of text documents much faster and cheaper than human labelers. However, even though pre-trained general Large Language Models can easily create tags relevant to a provided input text, there is no easy way to control the criteria according to which the tags are selected.
This is where modern fine-tuning techniques come to help. Direct Preferect Optimization (DPO), introduced in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model, is a powerful technique that makes it possible to fine-tune LLMs so that their outputs are aligned with a given set of preferences.
One of its main advantages concerning PPO-based Reinforcement Learning from Human Feedback (RLHF), another popular way of aligning LLMs with human preferences, is that it doesn’t require a separately trained reward model while being considerably simpler to implement and train. Identity Preference Optimization (IPO), proposed in A General Theoretical Paradigm to Understand Learning from Human Preferences, builds on top of the DPO framework by proposing a modified objective that alleviates the overfitting typical of DPO.
Analogously to DPO, IPO requires a text dataset of pairwise preferences, associating to each input two different generations with an expressed preference towards one of them. The model fine-tuned with IPO will produce outputs that reflect the preferences expressed in the dataset. Collecting a preference dataset can be an expensive and time-consuming process, but again this task can be automated using a Large Language Model to annotate the preferences. This is, for example, the principle at the base of Reinforcement Learning from AI Feedback (RLAIF).
Once the preference dataset has been created, the model can be fine-tuned using IPO directly on the preferences without the need to create a reward model. The IPO objective aims to increase the probabilities the model assigns to the preferred choices concerning the probabilities assigned to the unfavored choices, with a regularization that keeps them not too far from a reference policy, typically the pre-trained LLM itself. The performance of the fine-tuned model presented in the experiment can be easily improved by scaling up the size and diversity of the dataset used to generate the preference tags.
Click here to read the original article published by Towards AI.
Please give your feedback on this article or share a similar story for publishing by clicking here.