Skip to main content

The digital sources of news have prospered at a tremendous pace, from just a handful of digital news publications to numerous news sources and digital publications. It is because news publications now cover a wide range of issues and events, thus increasing their scope. These publications not just represent the world but also change and shape our perceptions about it. 

Archiving news data is now commonplace due to the high demand for instant access to historical news data, for which people commonly use news API. These news datasets can be great for research purposes, and for personal as well as for professional Artificial Intelligence(AI) and Machine Learning(ML) projects.

If you’re looking for historical news data to feed your AI and ML algorithms then you can use these free news datasets or the Newsdata.io tool that I am going to mention below. News datasets can help you find a wide range of historical stories related to any topic, organization, person, among others.

In this post, we are going to discuss an easy and reliable way to get access to historical news datasets. Let’s get right into it.

Before we dive into the compilation of free news datasets, let’s discuss a great tool to get relevant news datasets.

Download worldwide News data using Newsdata.io

It’s really easy to download or get access to Newsdata.io’ News dataset, Search news with Keywords or categories and timeline, just follow these simple steps.

  1. Go to Newsdata.io
  2. Enter the keyword you want to search
  3. Download the results in CSV and XLSX formats
  4. If you need custom news datasets, fill out the request news data form
  5. Once you have submitted the form, our team will get back to you with the pricing and dataset specifications.

After some time, you’ll receive your News dataset and details related to that.

Here are the top 40 news datasets that you can download for free for your AI, Machine learning and data analysis personal and professional projects.

1. Newsdata.io 

Name- Covid-19 news dataset

Link- https://newsdata.io/files/datasets/covid19-news

This Covid-19 dataset contains the latest world news related to Coronavirus. 

2. Kaggle.com

Name- BBC News Classification (News article categorization)

Link- https://www.kaggle.com/c/learn-ai-bbc

The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category.

3. BBC 

Name- BBC datasets

Link- http://mlg.ucd.ie/datasets/bbc.html

Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research.

4. Harvard Dataverse

Name- A Million News Headlines

Link- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL

This contains data of news headlines published over a period of eighteen years. Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation)

5. Newsdata.io

Name- Covid-19 and vaccine news dataset

Link- https://newsdata.io/files/datasets/covid-vaccine-news

This contains data of latest published news headlines from across the web. News headlines with all the metadata and full description.

6. Webz.io

Name- Political news articles

Link- https://webz.io/free-datasets/political-news-articles/

This contains world politics related news article data fetch with the help of Webz.io news API.

7. Paperswithcode

Name- COVID-19 Fake News Dataset

Link- https://paperswithcode.com/dataset/covid-19-fake-news-dataset

Along with COVID-19 pandemic we are also fighting an `infodemic’. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm.

8. Kaggle

Name- India News Headlines Dataset

Link- https://www.kaggle.com/therohk/india-headlines-news-dataset

This news dataset is a persistent historical archive of noteable events in the Indian subcontinent from start-2001 to end-2020, recorded in real-time by the journalists of India. It contains approximately 3.4 million events published by Times of India.

9. Data.world

Name- Economic News Article Tone

Link- https://data.world/crowdflower/economic-news-article-tone 

Contributors read snippets of news articles. They then noted if the article was relevant to the US economy and, if so, what the tone of the article was.

10. Archive.org

Name- World Politics news dataset

Link- https://archive.org/details/world-politics-news-dataset

This dataset contains the latest news related to politics around the world with the available news article’s metadata.

11. Archive.org

Name- Covid-19 News 

Link- https://archive.org/details/covid-news_202110

This dataset contains the latest news related to Covid-19 from the world.

12. Archive.org 

Name- Sports news dataset

Link- https://archive.org/details/sports-news

This dataset contains the latest news article related to world sports.

13. Zenodo.org

Name- dataset for fake news detection

Link- https://zenodo.org/record/4561253

This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

14. Archive.ics.uci.edu

Name- News aggregator dataset 

Link- https://archive.ics.uci.edu/ml/datasets/News+Aggregator

News are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that, at the access time, pointed one of the news page in the collection.

15. Data.4tu.nl

Name- fake news detection datasets

Link-https://data.4tu.nl/articles/dataset/Repository_of_fake_news_detection_datasets/14151755

The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics.

16. Ieee-dataport.org

Name- Fake News Inference Dataset 

Link- https://ieee-dataport.org/open-access/fnid-fake-news-inference-dataset

This database is provided for the Fake News Detection task. In addition to being used in other tasks of detecting fake news, it can be specifically used to detect fake news using the Natural Language Inference (NLI).

17. Hugging Face

Name- AG News

Link- https://huggingface.co/datasets/ag_news

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. 

18. Data.gov.uk

Name- News consumption in the UK

Link-https://data.gov.uk/dataset/4166149f-f10f-4f60-8174-ad30ba99fa86/news-consumption-in-the-uk

The findings are published as part of our range of market research publications that examine people’s consumption of, and attitudes towards, different types of content on different platforms. 

19. Kaggle

Name- COVID-19 Open Research Dataset Challenge (CORD-19)

Link- https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19).

20. Archive.org 

Name- Health news dataset

Link- https://archive.org/details/health-news_202110

This contains all the latest news article related to world health.

21. Newsdata.io

Name- World politics news dataset

Link- https://newsdata.io/files/datasets/world-politics-news

This dataset contains the latest news related to politics around the world with the available news article’s metadata.

22. Archive.org 

Name- Latest news dataset

Link- https://archive.org/details/latest-news_202110

This dataset contains the latest news around the world with the available news article’s metadata.

23. Kaggle 

Name- Health news dataset

Link- https://www.kaggle.com/newsdataio/health-news-dataset

This dataset contains all the latest news related to health news from around the world with all the metadata available.

24. Kaggle 

Name- Covid-19 and vaccine news dataset

Link- https://www.kaggle.com/newsdataio/covid19-and-vaccine-news-dataset

This dataset contains all the latest news related to Covid-19 and vaccine from around the world with all the available metadata.

25. Ieee-dataport.org

Name- FAKE NEWS ON COVID19

Link- https://ieee-dataport.org/documents/covifn-fake-news-covid19

COVIFN is a CoVID-19-specific dataset that consists of fact-checked fake news scraped from Poynter and true news from news publishers’ verified portals. The dataset was pre-processed, the removal of special characters and non-vital information is performed.

26. Data.mendeley

Name- Event Detection Dataset

Link- https://data.mendeley.com/datasets/7d54rvzxkr/1

The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event.

27. Figshare.com

Name- Fake and True News Dataset

Link- https://figshare.com/articles/dataset/Fake_and_True_News_Dataset/13325198

In this dataset have to part combined namely fake news and true news. fake news collected from Kaggle and some true news collected form IEEE Data port.

28. Kaggle

Name- Online News Popularity Data Set

Link- https://www.kaggle.com/aahaan007/online-news-popularity

This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks.

29. Getdata.io

Name- News Headlines

Link- https://getdata.io/data-sources/95591-news-google-com-news-headlines

Getdata.io monitors the news headline from this website which we then use to train our sentiment analysis engine to detect bad news about publicly traded companies.

30. Refinitiv

Name- Reuters Top News

Link- https://www.refinitiv.com/en/financial-data/financial-news-coverage/political-news-feeds-analysis/reuters-top-news

Read the biggest business and political stories from around the world with Reuters Top News, providing a customized experience in an easy-to-use format.

31. IEEE.org 

Name- Covid-19 and vaccine 

Link- https://ieee-dataport.org/documents/covid-19-and-vaccine-news-dataset

This dataset contains world news related to Covid-19 and vaccine and also with the news article’s available metadata.

32. IEEE.org

Name- World politics news 

Link- https://ieee-dataport.org/documents/world-politics-news-dataset

This dataset contains world news related to politics and also with the news article’s available metadata.

33. IEEE.org 

Name- Covid-19 news

Link- https://ieee-dataport.org/documents/covid-19-news

This dataset contains all  the latest news data related to Covid-19 from around the world.

34. IEEE.org 

Name- COVIFN : FAKE NEWS ON COVID19

Link- https://ieee-dataport.org/documents/covifn-fake-news-covid19

COVIFN is a CoVID-19-specific dataset that consists of fact-checked fake news scraped from Poynter and true news from news publishers’ verified portals. The dataset was pre-processed, the removal of special characters and non-vital information is performed.

35. IEEE.org 

Name- FAKE NEWS ON HEALTHCARE

Link- https://ieee-dataport.org/documents/fake-news-healthcare

The Internet is a vast repository of useful knowledge, but it has been contaminated by the spread of false information. Relying on misinformation can be disastrous. According to a World Health Organization survey, about 6,000 individuals were hospitalised throughout the world as a result of fake news on COVID-19 in the first three months of 2020.

36. IEEE.org 

Name- NEWS CREDIBILITY DATASET

Link- https://ieee-dataport.org/documents/news-credibility-dataset

Features of each news according to seven credibility categories

37. IEEE.org 

Name- AI-Based automated extraction of entities, entity categories and sentiment on Covid-19 situation.

Link- https://ieee-dataport.org/documents/ai-based-automated-extraction-entities-entity-categories-and-sentiments-covid-19-situation

Artificial Intelligence (AI) based in-depth analysis of social media content would allow a strategic decision maker to obtain evidence-based responses on complex queries

38. Kaggle

Name- Reddit Omicron Panic

Link- https://www.kaggle.com/yamqwe/reddit-omicron-panic

As we all know, a new variant of COVID-19 is spreading worldwide causing massive panic. This dataset captures mentions of the new variant on reddit.

39. Kaggle

Name- Omicron daily cases by country (COVID-19 variant)

Link- https://www.kaggle.com/yamqwe/omicron-covid19-variant-daily-cases

Tracking the progression of the new omicron COVID-19 variant

40. IEEE.org 

Name- Daily report of Covid-19 confirmed cases in Thailand.

Link- https://ieee-dataport.org/documents/daily-report-covid-19-confirmed-cases-thailand

A dataset contains a total of 578,375 COVID-19 confirmed cases reported in Thailand that were being recorded between 22 January 2021 to 30 July 2021.

Get customized news datasets with Newsdata.io

While there are several free news datasets available over the internet, they aren’t always specific to your need. In many cases, researchers often need specific news datasets related to any topic or event. This is where newsdata.io can help you out. It is an amazing tool that can help you compile news related to any topic, event, keyword, location, and more. It is a paid platform and the basic paid plan starts at $49.99 a month.

About Newsdata.io

Download news datasets

Newsdata.io is a paid news data compiling tool that allows you to download custom news datasets related. You can search for relevant news from a repository consisting of news compiled over the past 2 years and from 3000 news sources.

  • Get access to historical news datasets
  • Download news related to any topic, keyword, event, and more
  • Get news published by reputable news publications
  • Get access to real-time news headlines and data
  • Download news data in JSON and Excel formats

Conclusion

News datasets are essential for research, machine learning, AI projects, opinion mining, academic purposes, business intelligence, and more. Newsdata.io can give you access to this data with ease. You can access relevant historical news datasets or get live-breaking news. If you are doubtful about the dataset quality, here’s a sample of the data. Click here to get started.

Summary
Free News Datasets Mega Compilation
Article Name
Free News Datasets Mega Compilation
Description
The article discusses the importance of news archives and how to download them. It also lists various free news datasets that are available on the internet.
Author
Publisher Name
NewsData.io
Publisher Logo

Leave a Reply