Annotation is usually a manual process where a human interacts with a raw data instance to assign the tag. But, with the arrival of Large Language Models(LLMs) – the idea of human in-loop data annotation could be automated, at least for the less complex objectives.
To automate this process we require some techniques by which we can communicate with LLMs which in general is called Prompting.
In this blog post, we will explore how LLMs can help us to annotate textual data (news data). We will be using OpenAI API (since it is popular 😉) but you can try with other open-source or closed-source models, for the same.
What is Data Annotation?
Data Annotation is the process of tagging data (which could be text, image, speech, etc.) by humans usually, with meaningful context, to train the Machine Learning(ML) models as per the output we expect.
Benefits of Annotation using OpenAI GPT
- It improves model accuracy by providing high-quality data to learn for the AI models.
- It provides domain-specific insights into the data.
- Annotated data helps AI models to generalize better.
- It helps reduce the bias of the AI model.
Why LLM as Data Annotators?
LLMs work well for few-shot prompting (few-shot learning is like a quick learner who can understand new things with just a few examples) or even for zero-shot prompting (zero-shot learning allows machines to learn about new categories or concepts without any direct examples) like GPT-3, GPT-4, Claude, Gemini, etc.
For downstream tasks that require fewer features for building models like classification, LLMs are a good choice since they are trained on vast amounts of data, which enable them to understand the complex structure of the text. This understanding of the nature of the sentence allows LLMs to perform better on downstream tasks when provided with a with a well-defined prompt.
Setting up Newsdata.io and OpenAI
It is better to use a Python environment as both Newsdata.io and OpenAI provide Python client libraries. We will use text data (news) from Newsdata.io to annotate using OpenAI API for Sentiment Analysis.
The objective is to annotate the news into one of the three sentiment categories – “Negative”, “Neutral” or “Positive”. To fetch news data from newsdata.io using Python client, please follow the guidelines below.
Step 1: Create an account on newsdata.io
Step 2: Go to the Dashboard and get your API key.
Step 3: To install the Newsdata Python library run the following command in your command line.
Step 4: Now, Import the NewsDataAPIClient module in your Python file and set your API key. And, get the newsdata response. Here is the example Python script to get news items from newsdata.io. You can pass other parameters in news_api to filter news as per your requirement.
Please check newsdata.io documentation for other parameters (here)
I suggest you go through this blog (here) to get more insights into how to fetch news from newsdata.io.
Follow the steps below to get the OpenAI’s API key and set up the OpenAI Python library.
Step 1: Create an account at https://openai.com
Step 2: Add your payment method to use API calls. For this go to Settings → Billing → Payment Methods.
Step 3: Then go to api_keys and generate your api key, by clicking over Create new secret key.
Step 4: To install the OpenAI Python library run the following command in your command line.
Before going further, we will have to set the OpenAI API key in the project Environment.
Step 5: Create a .env file in your project repository – a special file that stores sensitive information like API keys, Passwords, etc.
Step 6: Add your API key to this file. This line should look like OPENAI_API_KEY=your-api-key-here.
To read more about this, visit https://platform.openai.com/docs/quickstart
Understanding Data
We are already done with our Python client setup, both for Newsdata.io and OpenAI.
Newsdata.io API provides 50 news items in a single response for a single API call (provided limit is not set to less than 50).
All these 50 news items are available in the “result” field of the response JSON object.
A single news item looks like this in the response object.
One of the most interesting facts about any news data is that the entire essence of news is encapsulated in the title of the news. As it captures the attention of readers and hence summarizes the important information in a few words.
We will leverage this property of news titles to find the sentiment of news instances without looking into the entire news.
Why Only the News Headline is Sufficient?
Why is it enough to just look into the title of news data for sentiment analysis?
This is because of the intrinsic nature of news titles. If the news instance is in the good or bad spectrum – the title of the news will definitely reflect it. This makes it easy to capture the attention of readers.
Moreover, it will lead to less number of tokens in the input of Open AI API calls and hence less cost.
Annotation of News Data
Finding sentiment is not a complex task, so we can use gpt-3.5 turbo with chat completion API in JSON mode.
Why JSON mode? gpt-3.5 turbo Chat completion just produces a stream of tokens or words, based on prompts provided to the model.
But in our case, we just want a single element as output for each news item, from this set [“Negative”, “Neutral”, “Positive”]. So, we have to restrict the model with JSON mode to produce output in a structured JSON format and force it to get only relevant output. We can then parse this JSON object to get the sentiment of news data.
But usually, GPT doesn’t produce structured JSON objects. To resolve this it is a good practice to force the model to produce a JSON object in response.
In order to achieve this you can set “response_format” to json_object type.
Also, you have to explicitly mention in your prompt to produce a JSON response, or else the model will keep producing a stream of whitespaces.
You can also pass other parameters to create a function in the above code to get more optimized output
- temperature – This parameter justifies the creativity of the model. The default value is 1 and the range is from 0 to 2. For straightforward classification tasks like sentiment, it is advisable to set the temperature to 0.2.
- presence_penalty – It encourages you to mix things up and use different words instead of repeating the same ones over and over again. Better not to set in our case – as we only want the model to generate one of the three words or classes ie. positive, negative, or neutral.
- frequency_penalty – The default value is 0, and penalty values typically range from 0.1 to 1. A higher value will make the model less inclined to reuse the same words.
- top_p – It is quite similar to temperature. And it is recommended not to use both of them together. For generation, the model decides to put the next word or token in the output which has a probability equal to top_p or more.
Response Result
The expected response from OpenAI looks like this. The output of the model can be found in the “content” part of the message.choices in response.
The OpenAI ChatCompletion response carries other information like if you want to keep track of the number of tokens exhausted for input and output to ensure your budget.
Another thing to care about is the finished reason (suggests if there was any other reason for the output generated by the model). If stopped, it means the completion was generated successfully. Another reason could be length, which means the language model ran out of tokens before being able to finish the completion.
This helps you to take care of your cost optimization. For example, In the above response – a total of 93 tokens were exhausted.
Conclusion
In this blog, we have seen the potential of LLMs, especially GPT-3 for news data annotation.
By harnessing the power of LLMs through the OpenAI API, we can even automate the classification of news data for emotion analysis, tagging into different categories, multilabel tag assignment, etc.
Although newsdata.io provides sentiment analysis of news along with AI tags at a much affordable price.
With 2+ years in machine learning, I specialize in language and vision modeling, bringing industrial expertise to every project. As an ML developer, I excel in crafting innovative solutions that push technological boundaries. From natural language processing to computer vision, I thrive on tackling complex challenges. Passionate about staying ahead in this dynamic field.