The digital sources of news have prospered at a tremendous pace, from just a handful of digital news publications to numerous news sources and digital publications. It is because news publications now cover a wide range of issues and events, thus increasing their scope. These publications not just represent the world but also change and shape our perceptions about it.
Archiving news data is now commonplace due to the high demand for instant access to historical news data, for which people commonly use news API. These news datasets can be great for research purposes, and for personal as well as for professional Artificial Intelligence(AI) and Machine Learning(ML) projects.
If you’re looking for historical news data to feed your AI and ML algorithms then you can use these free news datasets or the Newsdata.io tool that I am going to mention below. News datasets can help you find a wide range of historical stories related to any topic, organization, person, among others.
In this post, we are going to discuss an easy and reliable way to get access to historical news datasets. Let’s get right into it.
Before we dive into the compilation of free news datasets, let’s discuss a great tool to get relevant news datasets.
Download worldwide News data using Newsdata.io
It’s really easy to download or get access to Newsdata.io’ News dataset, Search news with Keywords or categories and timeline, just follow these simple steps.
- Go to Newsdata.io
- Enter the keyword you want to search
- Download the results in CSV and XLSX formats
- If you need custom news datasets, fill out the request news data form
- Once you have submitted the form, our team will get back to you with the pricing and dataset specifications.
After some time, you’ll receive your News dataset and details related to that.
Here are the top 40 news datasets that you can download for free for your AI, Machine learning and data analysis personal and professional projects.
1. Newsdata.io
Name- Covid-19 news dataset
Link- https://newsdata.io/files/datasets/covid19-news
This Covid-19 dataset contains the latest world news related to Coronavirus.
2. Kaggle.com
Name- BBC News Classification (News article categorization)
Link- https://www.kaggle.com/c/learn-ai-bbc
The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category.
3. BBC
Name- BBC datasets
Link- http://mlg.ucd.ie/datasets/bbc.html
Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research.
4. Harvard Dataverse
Name- A Million News Headlines
Link- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL
This contains data of news headlines published over a period of eighteen years. Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation)
5. Newsdata.io
Name- Covid-19 and vaccine news dataset
Link- https://newsdata.io/files/datasets/covid-vaccine-news
This contains data of latest published news headlines from across the web. News headlines with all the metadata and full description.
6. Webz.io
Name- Political news articles
Link- https://webz.io/free-datasets/political-news-articles/
This contains world politics related news article data fetch with the help of Webz.io news API.
7. Paperswithcode
Name- COVID-19 Fake News Dataset
Link- https://paperswithcode.com/dataset/covid-19-fake-news-dataset
Along with COVID-19 pandemic we are also fighting an `infodemic’. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm.
8. Kaggle
Name- India News Headlines Dataset
Link- https://www.kaggle.com/therohk/india-headlines-news-dataset
This news dataset is a persistent historical archive of noteable events in the Indian subcontinent from start-2001 to end-2020, recorded in real-time by the journalists of India. It contains approximately 3.4 million events published by Times of India.
9. Data.world
Name- Economic News Article Tone
Link- https://data.world/crowdflower/economic-news-article-tone
Contributors read snippets of news articles. They then noted if the article was relevant to the US economy and, if so, what the tone of the article was.
10. Archive.org
Name- World Politics news dataset
Link- https://archive.org/details/world-politics-news-dataset
This dataset contains the latest news related to politics around the world with the available news article’s metadata.
11. Archive.org
Name- Covid-19 News
Link- https://archive.org/details/covid-news_202110
This dataset contains the latest news related to Covid-19 from the world.
12. Archive.org
Name- Sports news dataset
Link- https://archive.org/details/sports-news
This dataset contains the latest news article related to world sports.
13. Zenodo.org
Name- dataset for fake news detection
Link- https://zenodo.org/record/4561253
This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.
14. Archive.ics.uci.edu
Name- News aggregator dataset
Link- https://archive.ics.uci.edu/ml/datasets/News+Aggregator
News are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that, at the access time, pointed one of the news page in the collection.
15. Data.4tu.nl
Name- fake news detection datasets
Link-https://data.4tu.nl/articles/dataset/Repository_of_fake_news_detection_datasets/14151755
The dataset contains a list of twenty-seven freely available evaluation datasets for fake news detection analysed according to eleven main characteristics.
16. Ieee-dataport.org
Name- Fake News Inference Dataset
Link- https://ieee-dataport.org/open-access/fnid-fake-news-inference-dataset
This database is provided for the Fake News Detection task. In addition to being used in other tasks of detecting fake news, it can be specifically used to detect fake news using the Natural Language Inference (NLI).
17. Hugging Face
Name- AG News
Link- https://huggingface.co/datasets/ag_news
AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity.
18. Data.gov.uk
Name- News consumption in the UK
Link-https://data.gov.uk/dataset/4166149f-f10f-4f60-8174-ad30ba99fa86/news-consumption-in-the-uk
The findings are published as part of our range of market research publications that examine people’s consumption of, and attitudes towards, different types of content on different platforms.
19. Kaggle
Name- COVID-19 Open Research Dataset Challenge (CORD-19)
Link- https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19).
20. Archive.org
Name- Health news dataset
Link- https://archive.org/details/health-news_202110
This contains all the latest news article related to world health.
21. Newsdata.io
Name- World politics news dataset
Link- https://newsdata.io/files/datasets/world-politics-news
This dataset contains the latest news related to politics around the world with the available news article’s metadata.
22. Archive.org
Name- Latest news dataset
Link- https://archive.org/details/latest-news_202110
This dataset contains the latest news around the world with the available news article’s metadata.
23. Kaggle
Name- Health news dataset
Link- https://www.kaggle.com/newsdataio/health-news-dataset
This dataset contains all the latest news related to health news from around the world with all the metadata available.
24. Kaggle
Name- Covid-19 and vaccine news dataset
Link- https://www.kaggle.com/newsdataio/covid19-and-vaccine-news-dataset
This dataset contains all the latest news related to Covid-19 and vaccine from around the world with all the available metadata.
25. Ieee-dataport.org
Name- FAKE NEWS ON COVID19
Link- https://ieee-dataport.org/documents/covifn-fake-news-covid19
COVIFN is a CoVID-19-specific dataset that consists of fact-checked fake news scraped from Poynter and true news from news publishers’ verified portals. The dataset was pre-processed, the removal of special characters and non-vital information is performed.
26. Data.mendeley
Name- Event Detection Dataset
Link- https://data.mendeley.com/datasets/7d54rvzxkr/1
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event.
27. Figshare.com
Name- Fake and True News Dataset
Link- https://figshare.com/articles/dataset/Fake_and_True_News_Dataset/13325198
In this dataset have to part combined namely fake news and true news. fake news collected from Kaggle and some true news collected form IEEE Data port.
28. Kaggle
Name- Online News Popularity Data Set
Link- https://www.kaggle.com/aahaan007/online-news-popularity
This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks.
29. Getdata.io
Name- News Headlines
Link- https://getdata.io/data-sources/95591-news-google-com-news-headlines
Getdata.io monitors the news headline from this website which we then use to train our sentiment analysis engine to detect bad news about publicly traded companies.
30. Refinitiv
Name- Reuters Top News
Read the biggest business and political stories from around the world with Reuters Top News, providing a customized experience in an easy-to-use format.
31. IEEE.org
Name- Covid-19 and vaccine
Link- https://ieee-dataport.org/documents/covid-19-and-vaccine-news-dataset
This dataset contains world news related to Covid-19 and vaccine and also with the news article’s available metadata.
32. IEEE.org
Name- World politics news
Link- https://ieee-dataport.org/documents/world-politics-news-dataset
This dataset contains world news related to politics and also with the news article’s available metadata.
33. IEEE.org
Name- Covid-19 news
Link- https://ieee-dataport.org/documents/covid-19-news
This dataset contains all the latest news data related to Covid-19 from around the world.
34. IEEE.org
Name- COVIFN : FAKE NEWS ON COVID19
Link- https://ieee-dataport.org/documents/covifn-fake-news-covid19
COVIFN is a CoVID-19-specific dataset that consists of fact-checked fake news scraped from Poynter and true news from news publishers’ verified portals. The dataset was pre-processed, the removal of special characters and non-vital information is performed.
35. IEEE.org
Name- FAKE NEWS ON HEALTHCARE
Link- https://ieee-dataport.org/documents/fake-news-healthcare
The Internet is a vast repository of useful knowledge, but it has been contaminated by the spread of false information. Relying on misinformation can be disastrous. According to a World Health Organization survey, about 6,000 individuals were hospitalised throughout the world as a result of fake news on COVID-19 in the first three months of 2020.
36. IEEE.org
Name- NEWS CREDIBILITY DATASET
Link- https://ieee-dataport.org/documents/news-credibility-dataset
Features of each news according to seven credibility categories
37. IEEE.org
Name- AI-Based automated extraction of entities, entity categories and sentiment on Covid-19 situation.
Artificial Intelligence (AI) based in-depth analysis of social media content would allow a strategic decision maker to obtain evidence-based responses on complex queries
38. Kaggle
Name- Reddit Omicron Panic
Link- https://www.kaggle.com/yamqwe/reddit-omicron-panic
As we all know, a new variant of COVID-19 is spreading worldwide causing massive panic. This dataset captures mentions of the new variant on reddit.
39. Kaggle
Name- Omicron daily cases by country (COVID-19 variant)
Link- https://www.kaggle.com/yamqwe/omicron-covid19-variant-daily-cases
Tracking the progression of the new omicron COVID-19 variant
40. IEEE.org
Name- Daily report of Covid-19 confirmed cases in Thailand.
Link- https://ieee-dataport.org/documents/daily-report-covid-19-confirmed-cases-thailand
A dataset contains a total of 578,375 COVID-19 confirmed cases reported in Thailand that were being recorded between 22 January 2021 to 30 July 2021.
Get customized news datasets with Newsdata.io
While there are several free news datasets available over the internet, they aren’t always specific to your need. In many cases, researchers often need specific news datasets related to any topic or event. This is where newsdata.io can help you out. It is an amazing tool that can help you compile news related to any topic, event, keyword, location, and more. It is a paid platform and the basic paid plan starts at $49.99 a month.
About Newsdata.io
Newsdata.io is a paid news data compiling tool that allows you to download custom news datasets related. You can search for relevant news from a repository consisting of news compiled over the past 2 years and from 3000 news sources.
- Get access to historical news datasets
- Download news related to any topic, keyword, event, and more
- Get news published by reputable news publications
- Get access to real-time news headlines and data
- Download news data in JSON and Excel formats
Conclusion
News datasets are essential for research, machine learning, AI projects, opinion mining, academic purposes, business intelligence, and more. Newsdata.io can give you access to this data with ease. You can access relevant historical news datasets or get live-breaking news. If you are doubtful about the dataset quality, here’s a sample of the data. Click here to get started.
Shivam is a capable substance essayist who enthusiastically makes enlightening and intriguing articles. Shivam has a curious psyche and a hunger for learning. Shivam is a reality lover who loves to uncover captivating realities from a large number of subjects. He solidly accepts that learning is a deep-rooted excursion and he is continually looking for valuable chances to expand his insight and find new realities. So make a point to look at Shivam’s work for a brilliant perusing.