Topic 85 of

Free Datasets for Data Analysis Practice — 50+ Sources

Every great analyst started by analyzing free datasets. This curated list gives you clean, real-world data to build portfolio projects that get you hired.

📚Beginner
⏱️8 min
5 quizzes
🌐

Major Dataset Platforms

| Platform | Description | Best For | URL | |----------|-------------|----------|-----| | Kaggle | 50K+ datasets, competitions, notebooks | Learning, portfolio projects | kaggle.com/datasets | | UCI Repository | 600+ classic ML datasets | Benchmarking, academic | archive.ics.uci.edu/ml | | Google Dataset Search | Search engine for datasets | Discovery across sources | datasetsearch.research.google.com | | Data.gov.in | Indian government open data | India-specific analysis | data.gov.in | | World Bank | Global economic indicators | International comparisons | data.worldbank.org | | Our World in Data | Research datasets (health, climate) | Social impact analysis | ourworldindata.org | | GitHub Awesome | Curated public datasets | Topic-specific collections | github.com/awesomedata | | FiveThirtyEight | Journalism datasets with stories | Reproducible analysis | data.fivethirtyeight.com |

🛒

Business & E-commerce Datasets

Recommended for Portfolio Projects:

1. Online Retail Dataset (Kaggle)

  • Size: 500K+ transactions
  • Features: Customer ID, product, quantity, price, date, country
  • Projects: RFM analysis, customer segmentation, cohort retention, market basket
  • Why it's great: Multiple customers, time series, clean structure

2. Superstore Sales (Kaggle)

  • Size: 9,994 orders
  • Features: Category, sales, profit, region, shipping
  • Projects: Sales dashboard, profitability analysis, regional comparison
  • Why it's great: Perfect for Power BI beginners, clear business metrics

3. Olist Brazilian E-commerce (Kaggle)

  • Size: 100K orders, multiple tables
  • Features: Orders, customers, products, reviews, payments
  • Projects: SQL joins, delivery time analysis, review sentiment
  • Why it's great: Real-world complexity, multiple tables to join

4. Instacart Market Basket (Kaggle)

  • Size: 3+ million grocery orders
  • Features: Products, aisles, order sequences
  • Projects: Association rules, recommendation systems
  • Why it's great: Large scale, interesting insights (milk + bread patterns)

5. Black Friday Sales (Kaggle)

  • Size: 550K purchases
  • Features: Age, gender, occupation, product category, purchase amount
  • Projects: Customer profiling, demographic analysis
  • Why it's great: Clean, good for segmentation
🇮🇳

India-Specific Datasets

Why India datasets matter for your portfolio:

  • ✅ Relatable to Indian recruiters (₹ vs $, Mumbai vs New York)
  • ✅ Shows local market understanding
  • ✅ Stands out from international datasets everyone uses
  • ✅ Demonstrates initiative (sought out regional data)

Top India Datasets:

1. Zomato Bangalore Restaurants (Kaggle)

  • 50K+ restaurants, ratings, cuisines, cost, location
  • Projects: Price analysis by area, cuisine trends, rating patterns
  • Insight example: "North Indian restaurants in Koramangala charge 40% premium"

2. IPL Complete Dataset (Kaggle)

  • All matches 2008-2025, ball-by-ball data
  • Projects: Player performance, team analysis, win prediction
  • Insight example: "Batsmen average 15% higher in Chennai vs Mumbai"

3. India Air Quality Data (Kaggle)

  • PM2.5, PM10 across cities, hourly data
  • Projects: Pollution trends, city comparison, seasonal patterns
  • Insight example: "Delhi AQI spikes 300% during Diwali week"

4. COVID-19 India (Kaggle)

  • State-wise cases, testing, vaccination
  • Projects: Time-series forecasting, vaccination pace analysis
  • Insight example: "Kerala detected cases 2× faster due to higher testing"

5. Swiggy/Zomato Delivery Data (Search Kaggle)

  • Delivery times, restaurant partners, user ratings
  • Projects: Delivery optimization, peak hour analysis
  • Insight example: "Avg delivery time: 28 min weekday vs 35 min weekend"

6. Naukri/LinkedIn Job Postings (Kaggle)

  • Job titles, skills required, salaries, companies
  • Projects: Skill demand analysis, salary benchmarking
  • Insight example: "SQL appears in 87% of data analyst JDs"

⚠️ CheckpointQuiz error: Missing or invalid options array

🎯

Domain-Specific Datasets

Finance & Economics:

  • Stock Prices: Yahoo Finance API (free), Alpha Vantage
  • Cryptocurrency: Coinbase API, CryptoCompare
  • Credit Card Fraud: Kaggle (imbalanced classification)
  • Loan Default: Kaggle (credit risk modeling)
  • Bitcoin Historical: Blockchain.com, CoinMetrics

Healthcare:

  • Diabetes Prediction: Kaggle, UCI
  • Heart Disease: UCI (Cleveland dataset)
  • Cancer Data: UCI (breast cancer Wisconsin)
  • COVID-19 Global: Johns Hopkins, Our World in Data
  • Hospital Readmission: CMS.gov

Sports & Entertainment:

  • IPL Cricket: Kaggle (ball-by-ball 2008-2025)
  • FIFA Players: Kaggle (ratings, stats, wages)
  • NBA Stats: Basketball-Reference, Kaggle
  • Olympics: Kaggle (120 years of data)
  • IMDb Movies: Kaggle, OMDb API
  • Spotify Music: Kaggle (audio features, popularity)

Social Media:

  • Twitter Sentiment: Kaggle (various topics)
  • YouTube Trending: Kaggle (daily trending videos)
  • Reddit: Pushshift API, Kaggle archives
  • Instagram Engagement: Various Kaggle sources

Real Estate:

  • House Prices: Kaggle (Ames, Boston housing)
  • Airbnb: Inside Airbnb (listings by city)
  • Property Valuation: MagicBricks, 99acres (scraped data on Kaggle)

Transportation:

  • Uber/Lyft: Kaggle (trip data, surge pricing)
  • Flight Delays: US DOT, Kaggle
  • NYC Taxi: NYC Open Data (millions of trips)
  • Bike Sharing: Capital Bikeshare, Kaggle

How to Choose Datasets for Portfolio

Good Portfolio Dataset Criteria:

1. Size

  • ❌ Too small (<1000 rows): Looks like toy project
  • ✅ Sweet spot (10K-500K): Real-world scale
  • ⚠️ Too large (>10M): May need cloud tools

2. Business Relevance

  • ✅ E-commerce, sales, customers, products
  • ✅ Clear business metrics (revenue, churn, conversion)
  • ❌ Academic abstractions (iris flowers, wine quality)

3. Complexity (some is good!)

  • ✅ Missing values → show data cleaning
  • ✅ Outliers → show handling techniques
  • ✅ Multiple tables → show SQL JOINs
  • ✅ Time component → show trend analysis
  • ❌ 90% nulls → too messy to be useful

4. Analysis Potential

  • ✅ Multiple questions can be answered
  • ✅ Segmentation opportunities
  • ✅ Time-series/trends
  • ✅ Visual storytelling possible

5. Uniqueness

  • ❌ Titanic, Iris (everyone does these)
  • ✅ India-specific datasets
  • ✅ Recent data (2023-2026)
  • ✅ Niche but interesting (IPL, Zomato)

Recommended Portfolio Progression:

Beginner (First 3 projects):

  1. Superstore Sales: Power BI dashboard, simple SQL
  2. Zomato Bangalore: Python EDA, interesting insights
  3. IPL Dataset: Engaging topic, good storytelling

Intermediate (Next 3 projects): 4. Online Retail: RFM analysis, customer segmentation 5. Olist E-commerce: Multiple tables, SQL JOINs, funnel analysis 6. Airbnb Pricing: Regression model, price prediction

Advanced (Stand out): 7. Web scraping: Collect your own Naukri job data 8. API integration: Real-time stock/crypto dashboard 9. Multi-source: Combine Swiggy + weather + traffic data

💡

Dataset Project Best Practices

1. Document Your Work

Instead of: Just uploading final analysis Do this: Show your process

README.mdMarkdown
# Zomato Bangalore Restaurant Analysis

## Dataset
- Source: Kaggle
- Size: 51,717 restaurants
- Date: Updated March 2024

## Data Cleaning
1. Removed 2,347 duplicates (4.5%)
2. Missing values:
   - Cost: 1,203 nulls → imputed with area median
   - Rating: 5,421 nulls → excluded (likely new restaurants)
3. Outliers: Capped cost at 99th percentile (₹3,500)

## Business Questions
1. Which areas have highest restaurant density?
2. How does cuisine affect pricing?
3. What rating do you need to charge premium prices?

2. Create Compelling Visualizations

Dashboard essentials:

  • 1 headline KPI (avg rating, total revenue)
  • Trend over time (sales by month)
  • Comparison (top 10 products, regions)
  • Distribution (customer age, order value)
  • Filters (date range, category, city)

Tools:

  • Power BI: Best for business dashboards
  • Tableau Public: Great for interactive viz
  • Python (matplotlib/seaborn): For custom analysis
  • Excel: Quick exploration

3. Share on Multiple Platforms

GitHub:

  • Code (Python/SQL scripts)
  • README with insights
  • requirements.txt
  • Screenshots

Kaggle:

  • Jupyter notebook
  • Markdown explanations
  • Public dataset if you scraped data

Tableau Public:

  • Interactive dashboard
  • Clear title and filters
  • Mobile-friendly

LinkedIn:

  • Post key insights with visuals
  • Link to full project
  • Use hashtags: #dataanalysis #python #powerbi

4. Write Business Insights, Not Just Stats

Bad: "Average rating is 3.7" Good: "Restaurants rated >4.0 charge 25% premium (₹600 vs ₹480 avg). Consider: does quality justify higher prices, or is rating manipulation occurring?"

Bad: "Linear regression R² = 0.73" Good: "Model explains 73% of price variation. Key drivers: area (Koramangala +40%), cuisine (North Indian +30%), rating (each 0.5⭐ = +15% price)."

5. Iterate Based on Feedback

Get feedback:

  • Reddit: r/datascience, r/dataisbeautiful
  • LinkedIn posts
  • Discord communities (DataTalks.Club)
  • Ask senior analysts

Common improvements:

  • Add axis labels/units
  • Include data source
  • Explain statistical methods
  • Add business recommendations

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}