Project Icon

Complete-Life-Cycle-of-a-Data-Science-Project

数据科学项目全生命周期实践指南

该项目提供了数据科学项目完整生命周期的实践指南。涵盖数据收集、清洗、特征工程、模型训练及部署全过程。详细介绍网络爬虫、API、数据库等数据获取方法,并汇总多个开放数据集资源。同时包含数据预处理、特征选择、模型评估等关键环节的最佳实践。对数据科学学习者和从业人员具有重要参考价值,有助于全面把握数据科学项目流程。

Complete-Life-Cycle-of-a-Data-Science-Project

CREDITS:All corresponding resources

MOTIVATION:Motivation to create this repository to help upcoming aspirants and help to others in the data science field

https://www.theinsaneapp.com/2021/03/how-to-build-machine-learning-project.html

**** If you like my work. please buy me a coffee it motivate me -> https://www.buymeacoffee.com/achuthasubhash?new=1 ****

Business understanding

1.Data collection

Data consists of 3 kinds

a.Structure data (tabular data,etc...)

b.Unstructured data (images,text,audio,etc...)

c.semi structured data (XML,JSON,etc...)

variable

a.qualitative (nominal,ordinal,binary) 

b.quantitative(discrete,continuous)

https://www.chi2innovations.com/blog/discover-data-blog-series/data-types-101/

database scraping data from websites purchasing data data from surveys data, sensors, cameras, apis etc.

cleanlab https://l7.curtisnorthcutt.com/cleanlab-python-package https://github.com/cgnorthcutt/cleanlab https://github.com/cgnorthcutt/label-errors https://github.com/cgnorthcutt/rankpruning https://github.com/subeeshvasu/Awesome-Learning-with-Label-Noise

Measure Data Quality ydata-quality https://github.com/ydataai/ydata-synthetic https://towardsdatascience.com/how-can-i-measure-data-quality-9d31acfeb969

a.Web scraping best article to refer-https://towardsdatascience.com/choose-the-best-python-web-scraping-library-for-your-application-91a68bc81c4f

https://www.analyticsvidhya.com/blog/2019/10/web-scraping-hands-on-introduction-python/?utm_source=linkedin&utm_medium=KJ|link|weekend-blogs|blogs|44087|0.875

https://www.analyticsvidhya.com/blog/2019/10/web-scraping-hands-on-introduction-python/?utm_source=linkedin&utm_medium=AV|link|high-performance-blog|blogs|44204|0.375

https://www.kdnuggets.com/2021/02/6-web-scraping-tools.html

https://www.bigdatanews.datasciencecentral.com/profiles/blogs/top-30-free-web-scraping-software

https://towardsdatascience.com/6-web-scraping-tools-that-make-collecting-data-a-breeze-457c44e4411d

https://medium.com/analytics-vidhya/master-web-scraping-completly-from-zero-to-hero-38051423256b

1.Beautifulsoup  https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/

  mechanicalsoup   https://analyticsindiamag.com/mechanicalsoup-web-scraping-custom-dataset-tutorial/

2.Scrapy,PyScrappy,Pandas Datareader,Instaloader,lxml

3.Selenium     https://www.freecodecamp.org/news/better-web-scraping-in-python-with-selenium-beautiful-soup-and-pandas-d6390592e251/

4.Request to access data 

5.AUTOSCRAPER - https://github.com/alirezamika/autoscraper https://www.youtube.com/watch?v=9BQ353Yu1D0 https://www.analyticsvidhya.com/blog/2021/04/automate-web-scraping-using-python-autoscraper-library/

scrapeasy  Scrape Any Website in Seconds with One Line of Code  https://github.com/joelbarmettlerUZH/Scrapeasy

Scrap Images From E-Commerce Website Using AutoScraper https://www.analyticsvidhya.com/blog/2021/05/scrap-images-from-e-commerce-website-using-autoscraper-library/

amazon auto scraper library https://webautomation.io/

 Listly https://www.listly.io/r/stdfr

FiftyOne Now easier to download and evaluate  https://towardsdatascience.com/googles-open-images-now-easier-to-download-and-evaluate-with-fiftyone-615ce0482c02

webbot https://pypi.org/project/webbot/

gazpacho https://github.com/maxhumber/gazpacho

html_scraper_streamlit_app https://www.youtube.com/watch?v=6U5xJ3mXRKA&feature=youtu.be

6.Twitter scraping tool (𝚝𝚠𝚒𝚗𝚝 or tweepy or tweetlib)-https://github.com/twintproject/twint

  twitterscraper https://www.youtube.com/watch?v=MpIi4HtCiVk
  
  twython https://github.com/ryanmcgrath/twython
  
  twarc https://github.com/DocNow/twarc https://scholarslab.github.io/learn-twarc/01-quick-start.html
  
  snscrape  extract twitterr data  https://github.com/JustAnotherArchivist/snscrape
  
  Scweet A simple and unlimited twitter scraper  https://github.com/Altimis/Scweet
  
  GetOldTweets3,GoogleNews,snscrape,GetOldTweets3

  Scrape Twitter for Tweets https://github.com/taspinar/twitterscraper
  
  HAR File Web Scraper https://stevesie.com/har-file-web-scraper https://www.youtube.com/watch?v=LcqVDfueb8g

  https://analyticsindiamag.com/complete-tutorial-on-twint-twitter-scraping-without-twitters-api/
  
  https://developer.twitter.com/en/docs
  
  pytrends  https://medium.com/nerd-for-tech/scraping-data-from-online-platforms-to-enhance-time-series-forecasts-6eec3c68636d
  
  Scraping Instagram -instaloader  https://thecleverprogrammer.com/2020/07/30/scraping-instagram-with-python/
  
  Instascrape   
  
  Scrape LinkedIn Profiles with ProxyCurl API
  
  Reddit Dataset  Using PSAW and PRAW in Python
  
  Scraping Reddit using Python Reddit API Wrapper  (PRAW)
  
  Scrape Wikipedia  wikipedia https://www.thepythoncode.com/article/access-wikipedia-python
  
  patang - Scrape Product details from eCommerce Sites with Puppeteer and DOM String  https://www.youtube.com/watch?v=3sgxRmyOuXs
  
  Download Wikipedia https://www.wikidata.org/wiki/Wikidata:Main_Page https://www.youtube.com/watch?v=hC1rY4lRY0s https://towardsdatascience.com/an-efficient-way-to-read-data-from-the-web-directly-into-python-a526a0b4f4cb
  
  Web Scraping to Create a CSV File  https://thecleverprogrammer.com/2020/08/08/web-scraping-to-create-csv/
  
  Amazon Web Scraper, Amazon Auto Scraper

7.urllib

8.pattern

9.Octoparse Easy Web Scraping   https://www.octoparse.com/

 prowebscraper https://prowebscraper.com/features

 Web scraper https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en

 ParseHub https://www.parsehub.com/  https://analyticsindiamag.com/parsehub-no-code-gui-based-web-scraping-tool/
 
 PyScrappy https://github.com/mldsveda/PyScrappy https://www.analyticsvidhya.com/blog/2022/02/web-scraping-with-pyscrappy/
 
 Gazpacho  https://github.com/maxhumber/gazpacho
 
 ScrapeSimple Website: https://www.scrapesimple.com
 
 Content Grabber https://contentgrabber.com/Manual/understanding_the_concept.htm
 
 Crawly https://crawly.diffbot.com/ 
 
 Apify https://apify.com/
 
 Mozenda Website: https://www.mozenda.com/
 
 obsei https://github.com/lalitpagaria/obsei
 
 Diffbot  https://analyticsindiamag.com/diffbot/
 
 Trustpilot,webhose,scrapingbot 
 
 lxml  https://lxml.de/index.html#introduction
 
 ScrapingBee  https://analyticsindiamag.com/scrapingbee-api/
 
 Scrape HTML tables https://www.youtube.com/watch?v=6U5xJ3mXRKA&feature=youtu.be  or pd.read_html
 
 requests-html https://github.com/kennethreitz/requests-html
 
 newspaper https://github.com/codelucas/newspaper  https://www.youtube.com/watch?v=Hfry5XnISyc
 
 newspaper3k: https://newspaper.readthedocs.io  # easily extract text from articles
 
 newscatcher https://github.com/kotartemiy/newscatcher https://www.youtube.com/watch?v=pHzOuizZq4I
 
 patang (extract product details) https://github.com/tejazz/patang
 
 lisc https://github.com/lisc-tools/lisc
 
 Helena WEB AUTOMATION FOR END USERS https://helena-lang.org/
 
 pandas(read_html)
 
 wget,curl,parsehub,webhouse,octoparse,scraping bot,scraping bee,Common,Content Grabber,Docparser,Scraper API,Import.io,Altair Monarch,WebAutomation.io,WebScraper.io,Scrape.do, AvesAPI, ParseHub, Import.io, Octoparse, Scrapingdog, Diffbot, ScrapingBee, Grepsr, Scraper API, Scrapy

 Crawl Crawly  https://crawly.diffbot.com/   

 HTML basics for web scraping,Web Scraping with Octoparse,Web Scraping with Selenium

 10-best-web-scraping-tools  https://www.scraperapi.com/blog/the-10-best-web-scraping-tools/
 
 https://www.kdnuggets.com/2021/02/6-web-scraping-tools.html
  
 https://analyticsindiamag.com/complete-learning-path-to-web-scraping-with-all-major-tools/ https://towardsdatascience.com/6-web-scraping-tools-that-make-collecting-data-a-breeze-457c44e4411d
 
 https://towardsdatascience.com/6-web-scraping-tools-that-make-collecting-data-a-breeze-457c44e4411d https://www.kdnuggets.com/2018/02/web-scraping-tutorial-python.html
 
 https://www.octoparse.com/ https://github.com/tirthajyoti/pydbgen https://www.mozenda.com/ https://www.mockaroo.com/ https://lionbridge.ai/ https://www.mturk.com/ https://appen.com/
 
 11.GoogleImageCrawler,google_images_download,bing_image
 
 https://www.freepik.com/popular-photos , https://stocksnap.io/ , https://www.pexels.com/ ,https://unsplash.com/ , https://pixabay.com/
 

b.Web Crawling

https://python.libhunt.com/scrapy-alternatives

Flat Data https://octo.github.com/projects/flat-data

b.3rd party API'S

22 APIs every data scientist should learn https://www.springboard.com/library/data-science/top-apis-for-data-scientists/

c.creating own data (manual collection eg:google docx,servey,etc...) primary data

d.etl awesome ETL https://github.com/pawl/awesome-etl#python https://github.com/achuthasubhash/awesome-etl

38x faster data pipelines with tf.data

d.Databases

Databases are 2 kind sequel and no sequel database

sql,sql lite,mysql,mongodb,montydb,hadoop,elastic search,cassendra,amazon s3,hive,googlebigtable,AWS DynamoDB,HBase,oracle db

sql https://mode.com/sql-tutorial/ https://www.w3schools.com/sql/

sql in python https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e

PyMongo https://analyticsindiamag.com/guide-to-pymongo-a-python-wrapper-for-mongodb/

Cloud AI Data labeling service https://cloud.google.com/ai-platform/data-labeling/docs?utm_source=youtube&utm_medium=Unpaidsocial&utm_campaign=guo-20200503-Data-Labeling

e.Online resources - ultimate resource https://datasetsearch.research.google.com/ https://medium.com/swlh/where-to-find-awesome-machine-learning-datasets-6bb909a3f350

10 BEST DATA COLLECTION TOOLS FOR EFFECTIVE RESULTS https://www.analyticsinsight.net/10-best-data-collection-tools-for-effective-results/

https://www.freecodecamp.org/news/https-medium-freecodecamp-org-best-free-open-data-sources-anyone-can-use-a65b514b0f2d/ https://research.google/tools/datasets/

Machine learning datasets https://www.datasetlist.com/ https://wiki.pathmind.com/open-datasets

https://guides.library.cmu.edu/az.php https://docs.microsoft.com/en-us/azure/azure-sql/public-data-sets https://registry.opendata.aws/ https://paperswithcode.com/datasets https://datasets.quantumstat.com/ https://www.quandl.com/ http://dataportals.org/ https://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public https://www.reddit.com/r/datasets/ https://ourworldindata.org/ https://data.worldbank.org/ https://data.world/ https://data.census.gov/cedsci/ https://data.seattle.gov/ https://www.openml.org/ https://visualdata.io/discovery

World’s Largest Data Platform https://worlddata.ai/

Awesome list of datasets in 100+ categories https://www.kdnuggets.com/2021/05/awesome-list-datasets.html

https://sebastianraschka.com/blog/2021/ml-dl-datasets.html  https://enoumen.com/2021/04/23/data-sciences-datasets-data-visualization-data-analytics-big-data-data-lakes/

https://serokell.io/blog/best-machine-learning-datasets https://medium.com/@ODSC/25-excellent-machine-learning-open-datasets-940ca2124dfc  

1)kaggle-https://www.kaggle.com/datasets , 𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚔𝚊𝚐𝚐𝚕𝚎𝚍𝚊𝚝𝚊𝚜𝚎𝚝𝚜

Downloading Kaggle datasets directly into Google Colab -https://towardsdatascience.com/downloading-kaggle-datasets-directly-into-google-colab-c8f0f407d73a

How to Download Kaggle Datasets using Jupyter Notebook https://www.analyticsvidhya.com/blog/2021/04/how-to-download-kaggle-datasets-using-jupyter-notebook/

2)https://sebastianraschka.com/blog/2021/ml-dl-datasets.html

movielens-https://grouplens.org/datasets/movielens/latest/

dagshub datset https://dagshub.com/explore/datasets

100+ of the Best Free Data Sources For Your Next Project https://www.columnfivemedia.com/100-best-free-data-sources-infographic/

World and national data, maps & rankings https://knoema.com/atlas/sources

3)data.gov-https://data.gov.in/

4)uci-https://archive.ics.uci.edu/ml/datasets.php     https://github.com/tirthajyoti/UCI-ML-API

5)Group Lens dataset https://grouplens.org/

Wikipedia ML Datasets https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

AWS Open Data Registry,data.gov (portals),YELP Open dataset,UNICEF Dataset,Big Bad NLP Database,Microsoft Dataset

6)world3bank  https://data.world/ , worldbank

7)Google Cloud BigQuery public datasets

  Google Public Datasets-cloud.google.com/bigquery/public-data/
  
  Google Cloud Data Catalog  https://cloud.google.com/data-catalog
  
  Academic Torrents-https://academictorrents.com/check.htm?returnto=%2Fbrowse.php

8)online hacktons

 Datasets  https://www.paperswithcode.com/datasets

9)image data from google_images_download

https://www.visualdata.io/discovery

http://xviewdataset.org/#dataset

https://ai.googleblog.com/2016/09/introducing-open-images-dataset.html

10)image data from Bing_Search

image data from simple_image_download  https://github.com/RiddlerQ/simple_image_download

11)https://www.columnfivemedia.com/100-best-free-data-sources-infographic

graviti  Unleash the Power of Unstructured Data  https://www.graviti.com/?utm_medium=0730Ismael

12)Reddit:https://lnkd.in/dv5UCD4       https://www.reddit.com/r/datasets/

praw.Reddit https://github.com/praw-dev/praw

13)https://datasets.bifrost.ai/?ref=producthunt

14)data.world:https://lnkd.in/gEK897K

15)https://data.world/datasets/open-data

   https://tinyletter.com/data-is-plural
项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

吐司

探索Tensor.Art平台的独特AI模型,免费访问各种图像生成与AI训练工具,从Stable Diffusion等基础模型开始,轻松实现创新图像生成。体验前沿的AI技术,推动个人和企业的创新发展。

Project Cover

SubCat字幕猫

SubCat字幕猫APP是一款创新的视频播放器,它将改变您观看视频的方式!SubCat结合了先进的人工智能技术,为您提供即时视频字幕翻译,无论是本地视频还是网络流媒体,让您轻松享受各种语言的内容。

Project Cover

美间AI

美间AI创意设计平台,利用前沿AI技术,为设计师和营销人员提供一站式设计解决方案。从智能海报到3D效果图,再到文案生成,美间让创意设计更简单、更高效。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号