Skip to main content
textual data preprocessing

In this blog post, we will delve into the world of data preprocessing and explore some techniques and best practices to ensure your data is ready for analysis and modelling. We will be focusing on what possible preprocessing techniques could be applied to textual data (specifically news data). Furthermore, we will fetch the data from Newsdata.io and try to implement all the preprocessing steps discussed throughout the blog using Python.

NOTE: In this blog, we will assume the following:

  1. We will be building a transformer-based model.
  2. We will use a large language model (LLM) for the annotation of our unlabeled data.
  3. We will be working with textual data, specifically news articles.

What is Data PreProcessing?

Pre-processing is an important process performed in every Machine Learning/Natural Language Processing to clean and prepare data for analysis or for further processing. The primary goals of data preprocessing are to improve the quality, consistency, and relevance of the input data. Thus, making it more amenable to algorithms and models used in various applications such as predictive analytics, classification, clustering, and regression.

In general, there is a thumb rule for ML one has to invest more than 70% of their time in data preprocessing. Data Preprocessing cannot be generalized since all the data are not the same. And, even if the data is the same, the desired goal makes the data processing step different.

How to fetch data using Python?

Follow the given steps to fetch the data:

Step I. Create a virtual environment using conda and activate it

conda create -n preprocessing_env python=3.10
conda activate preprocessing_env
pip install newsdataapi pandas langdetect

Step II: Create an account in Newsdata.io and then go to the Dashboard and get your API key.

import pandas as pd
from newsdataapi import NewsDataApiClient
api = NewsDataApiClient(apikey="YOUR_API_KEY")
page_count = 0
page = None
results = []
while page_count <= 10:
    response = api.news_api(q="world news", page=page, language='en')
    results.extend(response['results'])
    page = response.get('nextPage', None)
    if not page:
        break
    page_count += 1
df = pd.DataFrame(results)
df.to_csv("news_results.csv", index=False)

Different Ways to PreProcess Data

We will be seeing the following preprocessing task that could be used with textual data:

1. Remove Unwanted Attributes

When we scrap the data from Newsdata.io, it gives many attributes, but for our model building, we require a few attributes from it. The following is the attribute that it gives while scraping:

S.No
Object
Description
1article_idA unique ID for each news article.
2titleThe title of the news article.
3linkURL of the news article.
4source_idThe name of the source this article came from.
5source_urlThe source's URL for a particular news article.
6source_iconURL of the logo associated with the news source.
7source_priorityIt displays the hierarchy of news domains based on both their traffic volume and credibility. A lower ranking indicates higher authenticity for the domain.
8keywordsRelated keywords of the news article.
9creatorThe author of the news article.
10image_urlURL of the image present in the news articles.
11video_urlURL of video present in the news articles.
12descriptionA small description of the news article.
13pubDateThe published date of the news article.
14contentFull content of the news article.
15countryThe country of the publisher.
16categoryThe category assigned to the news article by NewsData.io
17languageThe language of the news article.
18ai_tagAI-classified tags or categories for a better understanding of the article. (Available only for professional and corporate users.)
19sentimentThe overall sentiment of the news article (positive, negative, or neutral) (available only for professional and corporate users).
20sentiment_statsStatistics on the distribution of positive, negative, and neutral sentiments in the news article. (Available only for professional and corporate users.)
21ai_regionAI-classified geographical region associated with a news article. ai_region could be a city, district, county, state, country, or continent. (Available only for corporate users.)
22nextPagePage ID for the next page

For our task, we will use the attributes which are highlighted. According to my experience, if we choose title’ +description’ for model training, it works well, and it is also economical if we do annotation.

What is ‘title’?

It shows the title of the retrieved articles.

What is ‘description’?

The “description” object offers a brief description or summary of the news article.

What is ‘content’?

Content displays the full content of the retrieved news articles. This section includes the comprehensive details and information provided within the article. Figure 1 shows what text comes under the description column. Additionally, it also shows the ‘title’ and ‘description’ text.

Figure 1: Content of the News

2. Remove Non-English Sentences

Following are the reasons why you could decide to remove the non-English news article:

  1. Difference in the Language: While scraping the news article, it has been found that even if we pass the language parameter as English, sometimes news of different languages comes. To make the model understand the context of the sentence, you should know how the sentence is formed in a particular language since the rules for forming the sentence will be unique in every language. Most of the pre-trained models available are trained in English and have SOTA results.
  2. Resource Constraint: Yet another issue that arises is the resource constraint. You can train the model for different languages explicitly, but it will be difficult to understand whether the model is trying to catch the right context or not.
  3. Annotation: Furthermore, there is one more constraint we will have to consider, which is during annotation. The LLMs are well-trained in the English language.

Therefore, it is better to drop the non-English news articles.

import pandas as pd
from langdetect import detect

def drop_non_english_sentences(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
    """
    Filter a DataFrame to exclude rows containing sentences that are not in the English language.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame.
    - column_name (str): The name of the column in which to check for English sentences.

    Returns:
    - pandas.DataFrame: A new DataFrame containing only rows with English sentences in the specified column.
    """
    def is_english_sentence(text: str) -> bool:
        """
        Check if a given text is in the English language.

        Parameters:
        - text (str): The text to check.

        Returns:
        - bool: True if the text is in English, False otherwise.
        """
        try:
            return detect(text) == 'en' if text.strip() else False
        except:
            return False

    return df[df[column_name].apply(is_english_sentence)]

3. Remove Emoticons

Removing emoticons can be considered a good practice in text classification for several reasons:
1. Noise Reduction: Emoticons are often used to convey emotions or sentiments, which may not always be relevant to the specific task of text classification. By removing them, you reduce unnecessary noise in the text data, allowing the classifier to focus on more relevant linguistic features.
2. Normalization: Emoticons can vary widely in their representation (e.g., 😀, 😅, 🤬, etc.), which adds complexity to the text data. Removing them helps to normalize the text, making it more consistent and easier to process.
3. Focus on Textual Content: In many text classification tasks, the focus is on the textual content rather than non-verbal elements like emoticons. Removing emoticons ensures that the classifier is trained primarily on the linguistic features of the text, which are typically more informative for classification tasks.

4. Generalization: Emoticons might not generalize well across different domains or datasets. Their meanings can change based on context or cultural differences. Therefore, by removing them, you help the classifier learn patterns that are more likely to generalize across various texts and contexts.
5. Equal Treatment of Text: Treating text uniformly without considering emoticons ensures fairness and consistency in text processing. Emoticons might have varying frequencies across different texts, which could introduce bias if not handled consistently.
6. Avoiding Overfitting: In some cases, emoticons might be strong indicators of certain classes or categories in the training data. Allowing the classifier to rely too heavily on these non-textual features could lead to overfitting, where the model performs well on the training data but poorly on unseen data. Removing emoticons helps prevent such overfitting by encouraging the model to focus on more generalizable linguistic patterns.

import pandas as pd
import re

def preprocessing(df: pd.DataFrame, col_name: str) -> pd.DataFrame:
    """
    Preprocess the specified column of a DataFrame by removing emojis.
    Args:
        df (pd.DataFrame): The input DataFrame.
        col_name (str): The name of the column to preprocess.
    Returns:
        pd.DataFrame: The preprocessed DataFrame.
    """
    emoji_patterns = [
        "[\U0001F600-\U0001F64F]",  
        "[\U0001F300-\U0001F5FF]",  
        "[\U0001F680-\U0001F6FF]",  
        "[\U0001F1E0-\U0001F1FF]",  
        "[\U00002500-\U00002BEF]",  
        "[\U00002702-\U000027B0]",
        "[\U000024C2-\U0001F251]",
        "[\U0001f926-\U0001f937]",
        "[\U00010000-\U0010ffff]",
        "[\u2640-\u2642]",
        "[\u2600-\u2B55]",
        "[\u200d]",
        "[\u23cf]",
        "[\u23e9]",
        "[\u231a]",
        "[\ufe0f]",  
        "[\u3030]"
    ]
    emoji_pattern = re.compile("|".join(emoji_patterns))
    
    def remove_emojis(text: str) -> str:
        """
        Remove emojis from the given text.
        Args:
            text (str): The input text.
        Returns:
            str: The text with emojis removed.
        """
        return emoji_pattern.sub(r'', text)
    
    df[col_name] = df[col_name].apply(remove_emojis)
    return df

4. Remove Short Sentences

Removing short sentences can be a good practice for transformer-based text classification models for several reasons:
1. Reducing Noise: Short sentences often contain less information and can introduce noise into the model’s training process. By removing them, you reduce the chances of the model focusing on irrelevant or noisy patterns that may not generalize well to unseen data.
2. Improving Model Generalization: Longer sentences tend to contain more context and semantic information, which can help the model better understand the underlying meaning of the text. By prioritizing longer sentences, you encourage the model to learn more meaningful representations, potentially leading to better generalization performance on unseen data.
3.Enhancing Computational Efficiency: Training transformer-based models can be computationally expensive, especially when dealing with large datasets. Removing short sentences can reduce the computational burden by decreasing the number of input tokens the model needs to process during training and inference without significantly sacrificing performance.

4. Better Handling of Context: Transformers rely on attention mechanisms to capture relationships between words in a sequence. Longer sentences provide more context for the model to attend to, allowing it to better capture dependencies and relationships between words and phrases.
5. Avoiding Overfitting: Short sentences may be prone to overfitting, as the model can potentially memorize patterns or idiosyncrasies specific to the training data without truly understanding the underlying concepts. Removing such sentences can mitigate the risk of overfitting and encourage the model to learn more robust and generalizable representations.
6. Better Annotation: The LLM tries to understand the context of the text, and understanding the text will give better results based on the prompt you have provided. When the sentence is short, it has been noticed that the news does not have sufficient context in which anyone can understand the meaning and provide the relevant class to the text.

In my experience, I’ve found that in the context of news data, sentences whose lengths are less than 10 do not provide much context, so it is better to remove them.

import pandas as pd

def drop_short_rows(data_frame: pd.DataFrame, column_name: str) -> pd.DataFrame:
    """
    Drop rows from a DataFrame where the text in the specified column has 10 or fewer words.

    Args:
        data_frame (pd.DataFrame): The DataFrame to process.
        column_name (str): The name of the column containing the text.

    Returns:
        pd.DataFrame: A new DataFrame with short rows removed.
    """
    # Create a mask to filter out rows with text containing 10 or fewer words
    mask = data_frame[column_name].apply(lambda text: len(str(text).split()) > 10)
    filtered_df = data_frame[mask]

    return filtered_df

5. Trimming Long Sentences

Trimming the longer sentences is also an important step in the NLP task, especially when we are using a transformer-based model. Following are the reasons why you can decide to trim the longer sentences:
1. Memory Efficiency: Transformer-based models have a limited memory capacity due to the self-attention mechanism used to process input sequences. Longer sentences require more memory, which can lead to memory limitations and slower processing times. Moreover, trimming longer sentences helps to reduce memory usage and improve efficiency during model training and inference.
2. Reduced Computational Complexity: Processing longer sequences increases the computational complexity of the transformer model. Trimming longer sentences reduces the length of input sequences, resulting in lower computational overhead and faster processing times.
3. Avoidance of Information Overload: Longer sentences may contain extraneous information or noise irrelevant to the task at hand. Trimming these sentences focuses the model’s attention on the most important parts of the input, improving its ability to learn and make accurate predictions.

4. Mitigation of Gradient Vanishing/Exploding: Longer sequences can exacerbate the vanishing or exploding gradient problem during training, where gradients either become too small to update the model effectively or grow too large, leading to instability. Trimming longer sentences helps mitigate these issues by reducing the length of the sequences that gradients need to propagate through.
5. Improved Generalization: Trimming longer sentences encourages the model to learn more generalized patterns by focusing on the most salient information within the input. This can lead to better performance on unseen data and improved generalization capabilities.
6. Cost of Annotation: The popular source for annotating the text is the OpenAIGPTmodel, and it charges $0.0015/1K tokens for input and $0.0020/1K tokens for output (as of 29/02/2024). Thus, it is important to utilize it optimally. Since in most of them by reading the first few lines, we can estimate in which direction the news is going (you can see how to do annotation using OpenAI here).

Overall, trimming longer sentences helps to optimize the performance, efficiency, and stability of transformer-based models by reducing memory usage, computational complexity, and potential sources of noise or instability in the input data.

import pandas as pd

def trim_long_rows(dataframe: pd.DataFrame, column_name: str, max_words: int) -> pd.DataFrame:
    """
    Trim rows in the specified column that have more than max_words words.
    
    Args:
        dataframe (pd.DataFrame): The DataFrame containing the data.
        column_name (str): The name of the column to process.
        max_words (int, optional): Maximum allowed word count for rows.
        
    Returns:
        pd.DataFrame: The updated DataFrame with trimmed rows.
    """
    if column_name not in dataframe.columns:
        return dataframe
    
    def trim_words(cell_value):
        if pd.isnull(cell_value):
            return cell_value
        words = str(cell_value).split()
        if len(words) > max_words:
            return " ".join(words[:max_words])
        return cell_value

    dataframe[column_name] = dataframe[column_name].apply(trim_words)
    
    return dataframe

NOTE: You can also access the codes mentioned in the blog this github repository.

Conclusion

Data preprocessing is a crucial step in building effective machine-learning models for textual data, particularly news articles. In this blog post, each technique serves a specific purpose in improving data quality and model performance by reducing noise, ensuring consistency, and optimizing computational resources.

Therefore, by applying these preprocessing techniques, one can ensure that their textual data is well-prepared for analysis and modelling. Thus, leading to improved performance and insights

Summary
Textual Data Preprocessing Using Python
Article Name
Textual Data Preprocessing Using Python
Description
In this blog post, we will delve into the world of data preprocessing and explore some techniques and best practices to ensure your data is ready for analysis and modelling. We will be focusing on what possible preprocessing techniques could be applied to textual data (specifically news data).
Author
Publisher Name
Newsdata.io
Publisher Logo

Leave a Reply