Ultimate Guide: 200+ Free Datasets for Data Science, Machine learning, AI, NLP

free datasets

Overview:

Datasets, in the realm of data science and machine learning, are the foundational building blocks for creating and training predictive models, conducting statistical analyses, and deriving meaningful insights from data. These collections of structured or unstructured data encompass a wide range of information, such as text, numbers, images, and more, gathered from various sources and domains.

In summary, datasets serve as the raw material for data-driven decision-making and machine learning. The choice of the right dataset and the diligence in preparing and analyzing it are pivotal for the success of data science and AI projects.

Looking for a specific type of data for your project? We’re surfacing some of the most useful datasets to use in a wide range of data science projects from data analysis and visualization to machine learning and data cleaning.

Presented below are datasets spanning a wide spectrum, catering to domains such as Data Science, Machine Learning, AI, NLP, Data Analysis, Analytics, Education, Computer Vision, Pricing Optimization, Classification, and Pre-Trained Models.

Data Repositories:

Data Search Engines, Repositories

Anacode Chinese Web Datastore: A collection of crawled Chinese news and blogs in JSON format

Appen Open Source Datasets: Over 270 audio, image, video and text datasets in over 80 languages

AssetMacro: Historical data of macroeconomic indicators and market data

Awesome Public Datasets: A topic-centric list of HQ open datasets

AWS Public Data Sets: A centralized repository of public data sets

BigML Public Data Sources: A long list of sources of data that anyone can use

USA.gov: APIs and data feeds to help people find useful government information

DataPortals.org: A Comprehensive List of Open Data Portals from Around the World

Data.gov.uk: Find data published by central government, local authorities and public bodies to help you build products and services

Data Planet: The largest repository of standardized and structured statistical data

DataSF.org: Search hundreds of datasets from the City and County of San Francisco

Data.world: Discover and share data, connect with interesting people, and work together to solve problems faster

Europeana Data: Open metadata on 20 million texts, images, videos and sounds gathered by Europeana

GEO Gene Expression Omnibus: A curated, online resource for gene expression data browsing, query and retrieval

HitCompanies Datasets: Comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine Learning

ICWSM 2009 Data Challenge: 44 million blog posts made between August 1st and October 1st, 2008

JMP Public Featured Datasets: Assorted public datasets from JMP

Kaggle Datasets: Explore, analyze, and share quality data

Linking Open Data: Making data freely available to everyone

LoveTheSales: The world’s biggest online sales marketplace

Lyst Fashion Data Trends: The industry’s trusted source for tracking fashion data trends

Million Song Dataset: A freely-available collection of audio features and metadata for a million contemporary popular music tracks

NASDAQ Data Link: A premier source for financial, economic and alternative datasets

NASA Space Science Data Coordinated Archive: NASA’s archive for space science mission data

Qlik Sense Data Sources: Connect and combine data from hundreds of data sources

Robert Schiller Data: Housing data, financial market data and more, from his book Irrational Exuberance

Sports Statistics: Data for soccer, NBA, NFL, NHL, and more

StatLib Datasets Archive: Datasets from Carnegie Mellon University

UCR Time Series Classification Archive: Datasets, papers, links, and code

UK Open Postcode Geo: We organise UK open data by location and signpost the source

United States Census Bureau: An assortment of US Census data

Virtual Screening of Bioassay Data: Bioassay datasets available for download, by Amanda Schierz, J.

Web Data Commons: Structured data from the Common Crawl, the largest web corpus available to the public

WorldData.AI: Connect your data to many of 3.5 Billion WorldData datasets and improve your Data Science and Machine Learning models! 

Yahoo Webscope Program: Reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists

Yelp Open Dataset: An all-purpose dataset for learning; subset of Yelp businesses, reviews, and user data for use in personal, educational, and academic purposes

Google Dataset Search

Google’s data search engine is useful for finding datasets in a particular niche. This is a great starting point for both paid and free datasets from top sources around the web. Other useful Google sources are Google Trends and Google’s Public Data Directory.

Data.gov

Find all of the U.S. government’s free and open datasets here. This is a rich source for public economic data—like housing, wages, and inflation—as well as education, health, agriculture, and census data. With more than 300,000 datasets available, this repository is extremely helpful.

FiveThirtyEight

FiveThirtyEight might be best known for its data journalism. Fortunately, the site also makes most of the data it uses in its reporting open to the public. This is a great source for a wide range of data with a focus on politics, sports and culture.

UCI Machine Learning Repository

Check out the University of California Irvine’s repository, which features nearly 500 public datasets. This is a great source for clean, ready-to-model data in a wide range of niches from a dataset of chickenpox cases, to bank marketing data.

GitHub’s Awesome-Public-Datasets

This regularly updated library of datasets is a great place to start. The data is organized by category, with options like machine learning and software, and you’ll find quick links to sources.

Amazon Web Service Open Data Registry

Amazon’s registry provides public access to data from a range of organizations from the 1000 Genomes Project to NASA. You’ll also find helpful usage examples for many of the datasets, as well as project links for various organizations and groups.

Pew Internet

The Pew Research Center’s data repository focuses mainly on culture and media. In particular, you’ll find datasets and surveys covering media consumption, social media use, and demographic trends like this 2018 Twitter Survey.

data.world

data.world calls itself a “collaborative data community,” and the site has built a dedicated audience of data scientists who have collaborated on projects like social bot detection and data journalism. You’ll find datasets in a range of categories from crime, to Twitter.

COVID datasets

There’s a plethora of regularly updated public COVID data available online. Some of the best sources include CDC COVID Data Tracker and Our World In Data. For more niche projects, try the Coronavirus Tweets Database, featuring more than 1 billion Tweets, as well as The Marshall Project’s COVID cases in prisons datasets.

WHO Health Statistics

If you’re looking for healthcare data, start with the WHO’s Global Health Observation repository. The platform features a variety of health-related statistics such as HIV/AIDs, vaccination rates, and malaria. If you want to build a machine learning health project, this is the source to utilize

Academic Torrents

Academic Torrents is a database for large-scale datasets for research projects. The data is shared by researchers, and there’s a variety of interesting sources, including the classic Enron email dataset or the annotated New York Times text corpus, which contains 1.8 million articles.

Nasdaq Data Link

For FinTech machine learning projects, you’ll find a variety of finance-related datasets on Nasdaq Data Link. The site features both paid and free data. Some free datasets of note include Zillow Real Estate Data and Federal Reserve Economic Data. To access the site’s free datasets, you’ll need to create an account to access the 20+ free sources. However, there are numerous premium datasets available as well. This is a great data source for a real estate data science project.

NASA Earth Data

NASA’s Earth Science Data Systems Program is a repository of the organization’s Earth science data. You’ll find datasets related to sea level rise, wildfire frequency, and tropical storms, among other interesting earth sciences insights. See the Data Pathfinders tool to learn how to source and access science datasets.

Datahub.io

Datahub is a wonderful source for open data. Jump to the Collections tab to browse datasets in various categories covering everything from climate change to football. You can also use the Find Data tool to search for relevant datasets.

Federal Bureau of Investigation Crime Data Explorer

For open data related to crime and law enforcement, this is one of the best sources for U.S. crime statistics. You can search by state or jump into various datasets including use of force or arrests.

Datasets for Machine Learning

datasets for ML

Whether your focus is on predictions or classification, these datasets are not only intriguing but also invaluable for machine learning endeavors. They offer relatively clean data, well-suited for machine learning tasks, with an abundance of variables to aid in making predictions for the target column.

1. Stroke Prediction Dataset

Build a stroke prediction model with this handy dataset. The CVS contains patient information like gender, age, pre-existing conditions, and smoking status that can help you build a model.

2. Divorce Predictors Dataset

This dataset from the UCI Machine Learning Repository contains survey data from married couples. Use the data to identify predictive indicators of divorce or to build a divorce prediction model.

3. January Flight Delay Prediction Dataset

With data from more than 400,000 flights in January 2019 and January 2020, this data from the Bureau of Transportation is well suited for building a model for winter flight delays. This is a useful dataset for a regression data science project.

4. Twitter User Gender Classification

Can you predict gender from a Twitter user’s profile and tweets? Build models to answer that question with this dataset, which contains information on more than 20,000 Twitter users.

5. Mushroom Classification Dataset

This classic dataset from UCI is a great source for a classification data science project. One great project idea is to build a model to identify classifiers for poisonous mushrooms.

6. Credit Card Approval Prediction Dataset

This is a great dataset for a financial prediction model. Use the data to understand if an applicant is â€śgood” or â€śbad.”

7. Water Quality Dataset

Use water quality metrics from nearly 4,000 bodies of water to predict whether the water is safe for consumption or not.

8. New Yorker Caption Contest Dataset

This dataset features ratings for submitted New Yorker caption contest entries. Get some ideas for using this data here.

9. MovieLens Ratings Database

There are numerous movie rating datasets available here, including one featuring 25 million ratings, making it a great source for building a recommendation engine.

10. San Francisco Restaurant Health Scores Data

This open dataset covers health inspection scores for restaurants in San Francisco. One option is to use the data to build a model to predict a restaurant’s repeat health scores.

11. Capital Bikeshare Dataset

This dataset is useful for demand forecasting projects. It contains bike rental data from a bike sharing program, including travel duration, departure and arrival locations, and weather data. This dataset is similar to one that’s used in the McKinsey data analytics take-home.

12. Cats vs Dogs Dataset

With more than 20,000 images of cats and dogs, this is one of the best datasets for beginner image classification projects. The data is already separated into training and testing datasets, and it’s already labeled.

13. MNIST Dataset

The MNIST dataset is a large database of handwritten digits. It’s widely used for image classification tasks, where the goal is to identify the digit based on the image.

14. Iris Flower Dataset

This popular dataset contains information about different types of iris flowers and their characteristics, such as petal length, petal width, and sepal length. The goal is to predict the species of the iris flower based on these characteristics.

Datasets for Data Visualization

datasets data visualization

Datasets play a pivotal role in the realm of data visualization. They serve as the canvas upon which insights are painted, and stories are told. This abstract explores the significance of datasets in data visualization, emphasizing their role in shaping the narratives, enabling informed decision-making, and enhancing the understanding of complex data. We delve into the diversity of datasets suited for visualization, their sources, quality considerations, and the transformative power they hold in converting raw information into compelling visual stories. In an era where data-driven insights are paramount, harnessing the right datasets is the key to unlocking the true potential of data visualization.

1. X Nodes Dataset (Formerly Twitter)

With more than 11 million nodes and 85 million edges, this dataset is useful for building graphical relationship models of X users.

2. Hotel Booking Demand Data

This is a great dataset for visualizing hotel bookings. You’ll be able to build visualizations that answer questions like:

When’s the best time of year to book?

How long is the optimal stay length to receive the best rate?

3. Amazon Top 50 Bestselling Books 2009-2019

Design visualizations that show top authors, best-selling titles, and review ratings for the best-selling books on Amazon.

4. COVID Jobs Impact & Hiring Data

Visualize the impact COVID is having on hiring with this dataset from the Amazon Open Data Registry. It features regularly updated hiring data from 3+ million job organizations.

5. Latest Polls from FiveThirtyEightIf you’re interested in political visualizations, FiveThirtyEight is one of the best data sources. Its updated polling data is great for visualizing averages and polling movements.

6. U.S. International Trade in Goods and Services 1960-Present

Build charts to visualize the United State’s international trade, including top imports, top exports, and annual trade balances.

7. Euro Exchange Rates

This dataset is useful for Matplotlib visualizations. You can create visualizations of exchange rates and currency valuations over time. The dataset features more than 20 years of daily exchange rate data.

8. San Francisco Public Library Usage Data

There are more than 400,000 records in this dataset, featuring daily circulation for the San Francisco library system. You can build visualizations related to new acquisitions, most checked out authors, most checked out titles, etc.

9. Trending YouTube Videos Data

This Kaggle dataset features daily trending video data from YouTube. Trending videos aren’t necessarily the most watched, but are generally the most interacted-with videos. Visualizations include most popular videos of the year or month or most trending videos by artist/creator.

10. World Unemployment Dataset

This dataset features more than 31 years of unemployment for numerous countries around the world. There are a wide range of visualizations you can create, including comparisons of countries, unemployment rates over time, or countries with the lowest unemployment.

11. NYC Subway Entries and Exits Data

This dataset originated on New York State Open Data and features information by station, line, location, etc. You can use this dataset to build visualizations of popular lines or subway maps.

12. 2021 Tokyo Olympics Dataset

This dataset contains information on more than 10,000 athletes in 40+ sports, and it’s a great source of building country medal count visualizations. There’s also coaching data, so you can add medal information by coach.

Datasets for Exploratory Data Analysis

datasets for data analysis

Exploratory Data Analysis (EDA) is the compass that guides data scientists and analysts through the uncharted territory of raw data. At its core lies the understanding that data, in its raw form, often conceals valuable insights and patterns waiting to be unearthed. Datasets for Exploratory Data Analysis are the starting point, the foundation upon which this transformative journey begins.

1. Netflix Original Films & IMDB Scores

A fun dataset to explore, and great for beginners, this features all of the Netflix original movies up to June 1, 2020 and their corresponding IMDb scores.

2. Superstore Sales Dataset

Featuring 4 years of data from a superstore, this dataset is perfect for analyzing and identifying trends, as well as sales forecasting.

3. Marketing Analytics Data

This dataset is made up of mock marketing analytics data used by master’s in business analytics students. A great source for a marketing analytics project.

4. Animal Shelter Analytics Data

This is a great dataset for surfacing actionable insights for animal shelters, including what factors led to successful outcomes for the animals.

5. Why Americans Don’t Vote: Non-Voter Data

Another FiveThirtyEight dataset, this one features survey data from non-voters in the U.S. A few project ideas are identifying key factors that result in non-voting or building a voting likeliness model.

6. Website Crawling Data

A sprawling dataset from Amazon, the Common Crawl corpus features crawling data from billions of websites. Check out the Example Projects page for ideas.

7. European Soccer Dataset

This is a useful dataset for a sports analytics project. Featuring data on more than 20,000 matches, as well as individual stats from 2008 to 2016, this is great for exploratory data analysis projects on line-ups, team stats, wins, and individual player stats.

8. Open Food Data

This large-scale dataset, which was originally developed in 2018, features product information for more than 600,000 food items. Data includes allergens, ingredients, and nutrition facts, and there are a wide range of data analytics projects you can do with it.

9. Social Media Influence Survey Data

This is a useful marketing analytics dataset that features survey data from 2,500+ millennials. The survey asked which social platform has influenced your online shopping the most.

10. Google Analytics Dataset

This dataset features Google Analytics metrics from Austin, TX’s website. This is a great dataset for working in Google Analytics or analyzing website traffic.

11. Uber Pickup Data for NYC

This dataset features more than 20 million metrics on Uber pickups in NYC in 2014 and 2015. This is great for an exploratory data analysis or analytics project, and you can gather insights into popular pickup locations, common trip routes, and the locations with the longest pickups.

12. Marketing Analytics Dataset

This dataset is a great source for a campaign budget optimization project or for diving into an exploratory data analysis for marketing analytics projects.

13. World Bank Dataset

This dataset contains a wide range of economic and social indicators for countries around the world, including information about their GDP, population, and education levels.

14. Data Science Salaries

This dataset contains salaries for roles in the data science field for the year 2023. You can group the data by domain, years of experience, and even by country of employment, allowing many angles for exploratory analysis.

Datasets for Natural Language Processing

datasets for NLP

In the realm of Natural Language Processing (NLP), datasets are the bedrock upon which the foundations of language understanding and communication between humans and machines are built. NLP, a subfield of artificial intelligence, thrives on the rich and intricate tapestry of human language, and datasets for Natural Language Processing serve as the threads that weave this tapestry together.

These datasets are not mere collections of words and phrases; they are gateways to the profound complexities of language—its nuances, ambiguities, and cultural context. Whether it’s sentiment analysis, machine translation, chatbots, or language generation, these datasets are the linguistic fuel that powers NLP models and applications.

1. Starbucks Reviews Dataset – This dataset contains a comprehensive collection of consumer reviews and ratings for Starbucks, a renowned coffeehouse chain. The data was collected through web scraping and includes textual reviews, star ratings, location information, and image links from multiple pages on the ConsumerAffairs website.

2. LinkedIn Job Postings – 2023 – This dataset contains a nearly comprehensive record of 15,000+ job postings listed over the course of 2 days. Each individual posting contains 27 valuable attributes, including the title, job description, salary, location, application URL, and work-types (remote, contract, etc), in addition to separate files containing the benefits, skills, and industries associated with each posting.

3. COVID-19 Open Research Dataset Challenge (CORD-19) – An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House

4. Brazilian E-Commerce Public Dataset by Olist – This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

5. Fake and real news dataset – Classifying the news

6. Amazon Reviews for Sentiment Analysis – This dataset consists of a few million Amazon customer reviews (input text) and star ratings (output labels) for learning how to train fastText for sentiment analysis.

7. Coronavirus tweets NLP – Text Classification – Perform Text Classification on the data. The tweets have been pulled from Twitter and manual tagging has been done then.

8. Emotions dataset for NLP – Collection of documents and its emotions, It helps greatly in NLP Classification tasks

9. Handwriting Recognition – This dataset consists of more than four hundred thousand handwritten names collected through charity projects.

10. Reddit Vaccine Myths Dataset

An interesting dataset for performing sentiment or text analysis, this features thousands of posts from the popular subreddit Vaccine Myths.

11. Wikibooks Dataset

There’s more than 270,000 book chapters in 12 languages in this dataset. It’s perfect for performing a wide range of NLP tasks like text parsing, text generation, or semantic analysis.

12. Spam Clickbait Headlines Catalog

Featuring 3+ million headlines from the now-defunct tabloid The Examiner, this is a great place to start an NLP news analysis Python project.

13. TripAdvisor Hotel Reviews

Explore thousands of hotel reviews from TripAdvisor and build semantic prediction or top clustering models.

14. A Million News Headlines

Another helpful medium source, this features headlines from nearly 20 years. It’s a great dataset for performing latent semantic analysis or latent dirichlet allocation tasks.

15. Disneyland Reviews Data

With more than 40,000 reviews from three Disneyland locations, this is a great data source for performing sentiment analysis.

16. Amazon Product Review Dataset

This dataset, which is a classic that was produced in 2009, features star ratings for numerous Amazon products.

17. Rotten Tomatoes Sentiment Dataset

The Stanford Sentiment Treebank contains more than 10,000 Rotten Tomatoes files and provides sentiment annotations on a 25-point scale.

18. Twitter Airline Sentiment Data

This dataset features thousands of airline reviews on X (Formerly Twitter) from February 2015. The data has already been classified as positive, negative, or neutral, and in some instances, includes a reason for the negative tweet.

19. IMDB Sentiment Dataset

Featuring 25,000 movie reviews, you can use this dataset for a binary classification project or to analyze movie review ratings by title.

20. Enron Email Dataset

This is a classic NLP dataset that’s been studied and written about numerous times that’s great for text classification and analysis projects.

Datasets for Computer Vision

In the dynamic landscape of Computer Vision, datasets are the visual canvases that empower machines to perceive, interpret, and interact with the visual world that surrounds us. This field, a cornerstone of artificial intelligence, marries the power of algorithms with the richness of visual data, and datasets for Computer Vision are the lenses through which machines gain visual acumen.

1. VoxCeleb Speech Corpus

The VoxCeleb large-scale dataset features audio-visual data from 7,000 speakers. It’s a great dataset for performing emotional recognition, speaker recognition, or talking face synthesis.

2. Face Mask Detection Database

There are about 900 images in this dataset of people wearing facemasks. You can use this to build models to detect if someone is wearing a mask, not wearing a mask, or wearing a mask improperly.

3. Unsplash Open Library

This rich visual-text dataset is loaded with helpful information. Use the photos for object detection. A bonus: there are millions of keywords and metadata you can use for exploratory data analysis projects as well.

4. CheXperts: Chest X-Rays from Stanford AIMI

This dataset from Stanford features 200,000+ chest radiographs. Build a model to detect pathologies and see how well your model performs against radiologists.

5. Pokemon Images and Types

There are thousands of images of Pokemon characters in this dataset. Use the data to build a prediction model to determine the type of Pokemon based on the image.

6. ImageNet Image Database

A classic image dataset from Stanford, you’ll find more than 14 million images here. This is one of the best datasets for performing object recognition tasks.

7. Stanford Dogs Dataset

Featuring more than 20,000 photos of dogs, this is a useful dataset for building classification models or a dog breed image classifier project.

8. Cityscapes Dataset

Featuring more than 5,000 images with fine annotations, as well as 20,000 images with coarse annotations, this is one of the best datasets for understanding urban street scenes at the pixel level.

9. Yale Face Database

This is a smaller dataset, featuring 165 images of 11 subjects. For each subject, there are images with various expressions and configurations, for example, â€śsleepy” or â€świthout glasses.”

10. Fashion MNIST Dataset

Similar to the MNIST handwritten text dataset, this image set includes a training set of 60,000 images of articles of clothing along with a test set of 10,000 images. There are 10 classes for the dataset, as well as a label like â€śbag” or â€śtrouser.” This is useful for testing machine learning models.

11. Indoor Scene Recognition

Image recognition indoors is more difficult, and this dataset, which features 15,000+ images of indoor scenes, is useful for building indoor recognition models.

Datasets for Pricing Optimization

Pricing optimization is the most important lever for increasing revenue with data. Try to identify prices that maximize revenue for these different products and environments:

1. Online Retail Dataset:

Over 1 million transactions from an online retailer, including customer data, product data, and transaction data.

2. Avocado Prices:

Weekly retail prices and volume data for avocados in various US markets from 2015 to 2018.

3. Beer Consumption in Sao Paulo:

Data on beer consumption and prices in Sao Paulo, Brazil, from 2015 to 2018.

4. New York Airbnb Dataset:

Information on Airbnb listings in New York City, including listing prices and attributes such as location, number of bedrooms, and amenities.

5. Uber Pickups in New York City:

Information on Uber pickups in New York City from 2014 to 2015, including pickup times and locations.

6. Diamonds Dataset:

Information on over 53,000 diamonds, including their cut, color, clarity, and carat weight, as well as their price.

7. Walmart Sales Forecasting Dataset:

Weekly sales data for 45 Walmart stores across the US from 2010 to 2012, including information on promotions, holidays, and weather conditions.

8.Taxi Trip Pricing

Using this dataset, you can build predicative pricing models, especially around surge pricing.

Datasets for Education

In the realm of education, data has become a transformative force, reshaping how we understand learning, teaching, and educational outcomes. Education datasets are the building blocks of this data-driven revolution, providing valuable insights into the dynamics of classrooms, student progress, and the educational landscape as a whole.

Students Performance in Exams – This data set consists of the marks secured by the students in various subjects.

Red Wine Quality – Simple and clean practice dataset for regression or classification modelling

Medical Cost Personal Datasets – Insurance Forecast by using Linear Regression

Customer Personality Analysis – Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Student Alcohol Consumption – Social, gender and study data from secondary school students

Face Mask Detection – This dataset contains 853 images belonging to the 3 classes, as well as their bounding boxes in the PASCAL VOC format.

[NeurIPS 2020] Data Science for COVID-19 (DS4C) – A portion of our dataset has been accepted in NeurIPS 2020. See this paper for more details

Datasets for Pre-Trained models

Pre-trained models have become the powerhouses of artificial intelligence, offering a shortcut to harnessing the incredible capabilities of deep learning. Behind these models lies a treasure trove of datasets meticulously curated for the training of these AI giants.

Keras Pretrained modelsThis dataset helps to use pretrained keras models in Kernels.

EfficientNet PyTorch – Pre-trained EfficientNet models (B0-B7) for PyTorch

Huggingface BERT – BERT models directly retrieved and updated from: https://huggingface.co/ This dataset contains many popular BERT weights retrieved directly on Hugging Face’s model repository, and hosted on Kaggle. It will be automatically updated every month to ensure that the latest version is available to the user.

Pretrained Model Weights (Pytorch) – These are the pre-trained models that you can use with pretrainedmodels library in PyTorch

EfficientDet Pytorch – A PyTorch impl of EfficientDet faithful to the original Google

Cattle Weight Detection Model + Dataset (12k~) – Made for detecting cattle weight from low-cost smartphones in Bangladesh

Best Open Source LLM Starter Pack

This This dataset contains a couple of great open source models!

version 2 — the best open source LLM at the time of writing (NousResearch/Nous-Hermes-Llama2-13b) that we can load on Kaggle! didn’t manage to load anything larger than 13B

version 14 — loading models using a new library, curated-transformers that should allow for easier modifications of the underlying architectures.

Stable Diffusion 1.5 (normal and EMAonly) with vae – Public release weights from Stable-Diffusion

fruit classifier model – Fruit Classification model created using Transfer Learning with ResNet

Datasets for Classification:

In the realm of machine learning and data science, classification is a fundamental task with far-reaching applications. At the heart of every successful classification model lies a high-quality dataset tailored to the specific problem at hand. Classification datasets are the lifeblood of these models, enabling them to decipher patterns, make predictions, and categorize data with precision.

Most Streamed Spotify Songs 2023

This dataset contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. The dataset offers a wealth of features beyond what is typically available in similar datasets. It provides insights into each song’s attributes, popularity, and presence on various music platforms. The dataset includes information such as track name, artist(s) name, release date, Spotify playlists and charts, streaming statistics, Apple Music presence, Deezer presence, Shazam charts, and various audio features.

Credit Card Fraud Detection Dataset 2023

This dataset contains credit card transactions made by European cardholders in the year 2023. It comprises over 550,000 records, and the data has been anonymized to protect the cardholders’ identities. The primary objective of this dataset is to facilitate the development of fraud detection algorithms and models to identify potentially fraudulent transactions.

Billionaires Statistics Dataset (2023)

This dataset contains statistics on the world’s billionaires, including information about their businesses, industries, and personal details. It provides insights into the wealth distribution, business sectors, and demographics of billionaires worldwide.

Heart Attack Analysis & Prediction Dataset – A dataset for heart attack classification

Customer Personality Analysis

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Date Fruit Datasets7 Class; Barhee, Deglet Nour, Sukkary, Rotab Mozafati, Ruthana, Safawi, Sagai

CelebFaces Attributes (CelebA) DatasetOver 200k images of celebrities with 40 binary attribute annotations

Global YouTube Statistics 2023

A collection of YouTube giants, this dataset offers a perfect avenue to analyze and gain valuable insights from the luminaries of the platform. With comprehensive details on top creators’ subscriber counts, video views, upload frequency, country of origin, earnings, and more, this treasure trove of information is a must-explore for aspiring content creators, data enthusiasts, and anyone intrigued by the ever-evolving online content landscape.