Python project for Data Science 筆記1

更新 發佈閱讀 19 分鐘

Web Scraping: A Key Tool in Data Science

Core Insights

Here are the three main takeaways from this lesson:

  • Web Scraping is the Bridge from Unstructured to Structured Data: It is a vital technique in data science used to automatically extract large amounts of messy data from websites and convert it into an organized format for analysis, machine learning, and real-time applications.
  • Python Provides Powerful, Specialized Tools: Python is highly effective for this task thanks to libraries tailored for different scraping needs: BeautifulSoup for parsing HTML/XML, Scrapy for extensive web crawling, and Selenium for browser automation.
  • Great Utility Requires Ethical Responsibility: While web scraping powers everyday applications like price comparisons and social media trend tracking, it must be performed responsibly by respecting a website's terms of use and its rules.

Jargon Explained in Plain Language

  • Web Scraping / Web Harvesting: Imagine hiring a super-fast robot to read thousands of web pages and copy only the specific text you want into a spreadsheet. That automated process is web scraping.
  • Unstructured Data: Information that is free-flowing and messy, like the text, images, and layout code of a normal webpage. It doesn't fit neatly into rows and columns.
  • Structured Form: Information that is neatly organized—like a well-labeled Excel spreadsheet or a database. Computers can easily analyze this kind of data.
  • Parse Tree: A behind-the-scenes map created by a program (like BeautifulSoup) that breaks down a webpage's messy code into a simple, hierarchical "family tree," making it much easier to locate specific words or data.
  • Web Crawling Framework: A ready-made toolkit (like Scrapy) for programmers. It helps build automated "spiders" that can jump from link to link across a website, gathering data continuously.
  • Robots.txt: A digital "rulebook" or "Do Not Enter" sign posted by a website owner. It tells automated robots and web scrapers which parts of the website they are allowed to look at and which parts they must stay away from.

Structured Summary

Introduction to Web Scraping

Web scraping, also known as web data extraction, is the process of taking vast amounts of unstructured data from websites and converting it into a structured form that can be easily analyzed.

The Role of Web Scraping in Data Science

Web scraping is an integral part of the data science workflow. Its primary purposes include:

  • Data Collection: Serving as the main method to gather internet data for research and deep analysis.
  • Machine Learning: Supplying the massive datasets required to train machine learning models.
  • Real-time Applications: Providing live information for services that require constant updates, such as weather forecasts.

Key Python Libraries

Python makes web scraping efficient through several specialized libraries:

  • BeautifulSoup: Pulls data out of HTML and XML files by creating an easily readable "parse tree" from the webpage's source code.
  • Scrapy: An open-source framework designed specifically for collaborative web crawling and large-scale data extraction.
  • Selenium: A tool that actually controls and automates web browsers through programming, mimicking how a real human interacts with a webpage.

Practical Applications

The ability to turn the web into a workable dataset has many real-world applications:

  • Price Comparison: Gathering product data from multiple online stores to compare prices (e.g., using services like ParseHub).
  • Social Media Scraping: Collecting data from platforms like Twitter to identify and analyze trending topics.
  • Email Gathering: Collecting contact information for marketing and bulk email purposes.

Conclusion and Ethical Considerations

Web scraping is an essential skill in data science that unlocks the internet as a massive data source. However, practitioners must act ethically by respecting the target website's terms of service and strictly following the allowances outlined in the site's robots.txt file.

 


 

HTML for Webscraping

This lesson provides an introductory overview of Hypertext Markup Language (HTML) specifically tailored for the purpose of web scraping, which is the process of extracting useful data from websites using programming languages like Python.


Core Points of the Lesson

  • HTML is the Foundation of Web Data: Websites are built using HTML tags that tell a browser how to display content, and understanding these tags is essential for extracting data.
  • The Power of Python: With a basic understanding of HTML structure, tools like Python can be used to automatically pull specific information, such as player salaries or real estate prices, from a page.
  • Document Hierarchy: HTML is organized in a "tree" structure where tags are nested within each other, creating parent, child, and sibling relationships.
  • Targeting Specific Tags: Data is usually stored within specific elements like headings (<h3>), paragraphs (<p>), or table cells (<td>), which scrapers use as "addresses" to find information.
vocus|新世代的創作平台


vocus|新世代的創作平台


  • Browser Inspection: Modern browsers allow users to right-click any part of a webpage and "Inspect" the underlying HTML code to see how the data is structured.
vocus|新世代的創作平台



Terminology Explanation

  • Web Scraping: The process of using a computer program to automatically "read" a website and grab specific pieces of information from it.
  • Tags: These are the "labels" of the web. They are pieces of text surrounded by angle brackets (like <html>) that tell the browser what a piece of content is, such as a heading or a link.
  • Element: An element is a complete "package" consisting of a start tag, the actual content (like a person's name), and an end tag.
  • Attribute: This is extra information hidden inside a start tag. For example, in a link, the attribute tells the browser exactly which web address to go to.
  • Root Element: Think of this as the "trunk" of the tree. In HTML, the <html> tag is the root because every other part of the page lives inside it.


Structured Summary

1. Basic HTML Composition

A standard HTML document is divided into two main sections:

  • The Head (<head>): Contains meta-information about the page that isn't usually visible to the user.
  • The Body (<body>): This is where the actual visible content of the page lives, such as text, images, and the data typically targeted for scraping.

2. The Anatomy of an HTML Tag

Using the Anchor Tag (<a>)—which creates hyperlinks—as an example, a tag consists of several parts:

  • Start and End Tags: The start tag (e.g., <a>) marks the beginning, and the end tag (e.g., </a>) marks the finish.
  • Content: The text that appears on the screen for the user to see.
  • Attributes: Components like href define the destination URL of a link.
vocus|新世代的創作平台


3. Understanding the HTML Tree

HTML documents are structured like a family tree:

  • Parents and Children: If a tag is inside another tag, the outer one is the "parent" and the inner one is the "child".
  • Siblings: Tags that are at the same level of nesting (like a list of player names) are called siblings.
  • Descendants: Any tag nested anywhere inside a parent, even deep down, is considered a descendant.
vocus|新世代的創作平台


4. HTML Tables

Tables are a common way to store structured data on the web:

  • <table>: Defines the start of the table area.
  • <tr> (Table Row): Used to create a new horizontal row of data.
  • <td> (Table Data): Defines an individual cell within a row where the actual data (like a number or name) is stored.
vocus|新世代的創作平台


留言
avatar-img
mr practice100的沙龍
0會員
26內容數
Mr Practice100
2026/03/05
過了30歲,體力明顯的下降,現在吃完晚餐,身體就自動進入「省電模式」。 以前老爸曾撈叨:「年輕不讀書,老了沒體力。」我當時不以為然,覺得只是他懶惰的藉口,結果現在,晚餐後的腦袋,沒有一杯咖啡,根本讀不了書。 更慘的是,如果白天沒把事情操到一個極致,晚上這顆腦袋就開始「夜間加班」,思緒像脫韁的野馬
Thumbnail
2026/03/05
過了30歲,體力明顯的下降,現在吃完晚餐,身體就自動進入「省電模式」。 以前老爸曾撈叨:「年輕不讀書,老了沒體力。」我當時不以為然,覺得只是他懶惰的藉口,結果現在,晚餐後的腦袋,沒有一杯咖啡,根本讀不了書。 更慘的是,如果白天沒把事情操到一個極致,晚上這顆腦袋就開始「夜間加班」,思緒像脫韁的野馬
Thumbnail
2026/03/05
母蜜袋鼯糖糖總是溫柔地守護著公蜜袋鼯大兒。大兒喜歡窩在糖糖的懷裡。除了食物,糖糖總是讓著大兒。我們為他們準備的滾輪,最後也成了大兒獨享的遊樂場,每當大兒玩累了,就會安心地趴在糖糖背上休息。雖然我們夫妻倆忙於工作,但他們倆的感情卻越來越深,與我們用手餵食時的謹慎形成一種可愛的反差。 猶記有一年過年,
Thumbnail
2026/03/05
母蜜袋鼯糖糖總是溫柔地守護著公蜜袋鼯大兒。大兒喜歡窩在糖糖的懷裡。除了食物,糖糖總是讓著大兒。我們為他們準備的滾輪,最後也成了大兒獨享的遊樂場,每當大兒玩累了,就會安心地趴在糖糖背上休息。雖然我們夫妻倆忙於工作,但他們倆的感情卻越來越深,與我們用手餵食時的謹慎形成一種可愛的反差。 猶記有一年過年,
Thumbnail
2026/02/28
美國最新飲食指南(2025-2030) 第五章脂肪 飽和脂肪與不飽和脂肪的世紀之爭
Thumbnail
2026/02/28
美國最新飲食指南(2025-2030) 第五章脂肪 飽和脂肪與不飽和脂肪的世紀之爭
Thumbnail
看更多
你可能也想看
Thumbnail
當時間變少之後,看戲反而變得更加重要——這是在成為母親之後,我第一次誠實地面對這一件事:我沒有那麼多的晚上,可以任性地留給自己了。看戲不再只是「今天有沒有空」,而是牽動整個週末的結構,誰應該照顧孩子,我該在什麼時間回到家,隔天還有沒有精神帶小孩⋯⋯於是,我不得不學會一件以前並不擅長的事:挑選。
Thumbnail
當時間變少之後,看戲反而變得更加重要——這是在成為母親之後,我第一次誠實地面對這一件事:我沒有那麼多的晚上,可以任性地留給自己了。看戲不再只是「今天有沒有空」,而是牽動整個週末的結構,誰應該照顧孩子,我該在什麼時間回到家,隔天還有沒有精神帶小孩⋯⋯於是,我不得不學會一件以前並不擅長的事:挑選。
Thumbnail
在這疫情與自我封閉之路中,我完成了Coursera上我認為最龐大最複雜最前瞻reading最多的“經濟學”課程:Perry Mehrling 13個禮拜的Economics of Money and Banking
Thumbnail
在這疫情與自我封閉之路中,我完成了Coursera上我認為最龐大最複雜最前瞻reading最多的“經濟學”課程:Perry Mehrling 13個禮拜的Economics of Money and Banking
Thumbnail
見諸參與鄧伯宸口述,鄧湘庭於〈那個大霧的時代〉記述父親回憶,鄧伯宸因故遭受牽連,而案件核心的三人,在鄧伯宸記憶裡:「成立了成大共產黨,他們製作了五星徽章,印刷共產黨宣言——刻鋼板的——他們收集中共空飄的傳單,以及中國共產黨中央委員會有關文化大革命決議文的英文打字稿,另外還有手槍子彈十發。」
Thumbnail
見諸參與鄧伯宸口述,鄧湘庭於〈那個大霧的時代〉記述父親回憶,鄧伯宸因故遭受牽連,而案件核心的三人,在鄧伯宸記憶裡:「成立了成大共產黨,他們製作了五星徽章,印刷共產黨宣言——刻鋼板的——他們收集中共空飄的傳單,以及中國共產黨中央委員會有關文化大革命決議文的英文打字稿,另外還有手槍子彈十發。」
Thumbnail
5 月,方格創作島正式開島。這是一趟 28 天的創作旅程。活動期間,每週都會有新的任務地圖與陪跑計畫,從最簡單的帳號使用、沙龍建立,到帶著你從一句話、一張照片開始,一步一步找到屬於自己的創作節奏。不需要長篇大論,不需要完美的文筆,只需要帶上你今天的日常,就可以出發。征服創作島,抱回靈感與大獎!
Thumbnail
5 月,方格創作島正式開島。這是一趟 28 天的創作旅程。活動期間,每週都會有新的任務地圖與陪跑計畫,從最簡單的帳號使用、沙龍建立,到帶著你從一句話、一張照片開始,一步一步找到屬於自己的創作節奏。不需要長篇大論,不需要完美的文筆,只需要帶上你今天的日常,就可以出發。征服創作島,抱回靈感與大獎!
Thumbnail
想透過 UiPath 自學入門 RPA?本篇精選四堂高評價 UiPath 線上課程,來自 Udemy 與 Coursera,適合初學者使用英文學習。無論你想找 UiPath 學習資源、取得證書或建立實務經驗,都能在這篇找到最適合的課程推薦。
Thumbnail
想透過 UiPath 自學入門 RPA?本篇精選四堂高評價 UiPath 線上課程,來自 Udemy 與 Coursera,適合初學者使用英文學習。無論你想找 UiPath 學習資源、取得證書或建立實務經驗,都能在這篇找到最適合的課程推薦。
Thumbnail
Coursera 於 2012 年成立之初,曾表示「我們致力於免費提供世界上最好的教育給任何需要的人」。然而,2014 年起 Coursera 便不再免費提供修課證書,而是持續發展多樣化的付費學習體驗。
Thumbnail
Coursera 於 2012 年成立之初,曾表示「我們致力於免費提供世界上最好的教育給任何需要的人」。然而,2014 年起 Coursera 便不再免費提供修課證書,而是持續發展多樣化的付費學習體驗。
Thumbnail
Coursera 是一個線上學習平台,提供來自世界不同大學的課程。 以下是延世大學和成均館大學在Coursera上提供免費的韓文課程
Thumbnail
Coursera 是一個線上學習平台,提供來自世界不同大學的課程。 以下是延世大學和成均館大學在Coursera上提供免費的韓文課程
Thumbnail
當代名導基里爾.賽勒布倫尼科夫身兼電影、劇場與歌劇導演,其作品流動著強烈的反叛與詩意。在俄烏戰爭爆發後,他持續以創作回應專制體制的壓迫。《傳奇:帕拉贊諾夫的十段殘篇》致敬蘇聯電影大師帕拉贊諾夫。本文作者透過媒介本質的分析,解構賽勒布倫尼科夫如何利用影劇雙棲的特質,在荒謬世道中尋找藝術的「生存之道」。
Thumbnail
當代名導基里爾.賽勒布倫尼科夫身兼電影、劇場與歌劇導演,其作品流動著強烈的反叛與詩意。在俄烏戰爭爆發後,他持續以創作回應專制體制的壓迫。《傳奇:帕拉贊諾夫的十段殘篇》致敬蘇聯電影大師帕拉贊諾夫。本文作者透過媒介本質的分析,解構賽勒布倫尼科夫如何利用影劇雙棲的特質,在荒謬世道中尋找藝術的「生存之道」。
Thumbnail
大型語言模型(Large Language Model,LLM)是一項人工智慧技術,其目的在於理解和生成人類語言,可將其想像成一種高階的「文字預測機器」。 Prompt Pattern 是給予LLM的指示,並確保生成的輸出擁有特定的品質(和數量)。
Thumbnail
大型語言模型(Large Language Model,LLM)是一項人工智慧技術,其目的在於理解和生成人類語言,可將其想像成一種高階的「文字預測機器」。 Prompt Pattern 是給予LLM的指示,並確保生成的輸出擁有特定的品質(和數量)。
Thumbnail
大型語言模型(Large Language Model,LLM)是一項人工智慧技術,其目的在於理解和生成人類語言,可將其想像成一種高階的「文字預測機器」,然而,它們並非真正理解語言。除了在上篇介紹的技巧可以協助我們在使用 LLM 時給予指示之外,今天我們會介紹使用 LLM 的框架。
Thumbnail
大型語言模型(Large Language Model,LLM)是一項人工智慧技術,其目的在於理解和生成人類語言,可將其想像成一種高階的「文字預測機器」,然而,它們並非真正理解語言。除了在上篇介紹的技巧可以協助我們在使用 LLM 時給予指示之外,今天我們會介紹使用 LLM 的框架。
Thumbnail
對於數位轉型有興趣的朋友,光看一些案例和報導是不夠的,你一定要先從思維和邏輯上,理清思路,才能對症下藥,找到真正適合自己的解決之道。 BCG 在 Coursera 上開辦的課程已有 8 萬多名來自全球的學員(當然我也是其中之一)從數位轉型的 ABC 及重要理論開始講起,為期四周,可以完全自學....
Thumbnail
對於數位轉型有興趣的朋友,光看一些案例和報導是不夠的,你一定要先從思維和邏輯上,理清思路,才能對症下藥,找到真正適合自己的解決之道。 BCG 在 Coursera 上開辦的課程已有 8 萬多名來自全球的學員(當然我也是其中之一)從數位轉型的 ABC 及重要理論開始講起,為期四周,可以完全自學....
Thumbnail
相對於其他工程背景的學生,我個人比較喜歡Coursera的人文課程,可能人都會比較喜歡自己沒有的東西吧?這次要挑戰的是Coursera上賓州大學開的Ancient Philosophy: Plato / Aristotle and Their Successors
Thumbnail
相對於其他工程背景的學生,我個人比較喜歡Coursera的人文課程,可能人都會比較喜歡自己沒有的東西吧?這次要挑戰的是Coursera上賓州大學開的Ancient Philosophy: Plato / Aristotle and Their Successors
追蹤感興趣的內容從 Google News 追蹤更多 vocus 的最新精選內容追蹤 Google News