Web scrapping guy

(I made this image with ChatGPT’s Sora 💅🏼)

📖 Overview

In this tutorial we’ll build a Python‑based web scraper that:

  • Navigates dynamic sites with Selenium.
  • Extracts and cleans HTML using BeautifulSoup.
  • Analyzes the scraped text with LangChain (LLM‑powered summarization / keyword extraction).
  • Presents the results through a lightweight Streamlit dashboard.

By the end you’ll have a reusable pipeline that can scrape any public web page, feed the raw content to an LLM, and display the AI‑generated insights—all in a single, reproducible script.

🛠️ Prerequisites

Assuming you have an up-to-date version of Homebrew and MacOS

Python ≥ 3.13+

 brew install python@3.13 

Selenium

 pip install selenium 

ChromeDriver

Download from https://chromedriver.chromium.org/

BeautifulSoup4

 pip install beautifulsoup4 

LangChain + OpenAI SDK

 pip install langchain openai 

Streamlit

 pip install streamlit 

OpenAI

You’ll also need to sign up for an OpenAI account

Get an API key (or any other LLM provider supported by LangChain).

Set it as an environment variable:

export OPENAI_API_KEY="sk-your‑key-here"

🚀 Step‑by‑Step Implementation

1️⃣ Initialise Selenium

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

# Path to your ChromeDriver executable
driver_path = "/path/to/chromedriver"
service = Service(driver_path)

options = webdriver.ChromeOptions()
options.add_argument("--headless")          # Run without opening a browser window
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")

driver = webdriver.Chrome(service=service, options=options)

def fetch_page(url: str) -> str:
    driver.get(url)
    time.sleep(3)  # Wait for dynamic content to load (adjust as needed)
    driver.quit()  # Always close the Selenium session when done
    return driver.page_source

2️⃣ Parse HTML with BeautifulSoup

from bs4 import BeautifulSoup

def extract_text(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    
    # Remove scripts / styles
    for tag in soup(["script", "style"]):
        tag.decompose()
    
    # Grab the main article body – adapt the selector to the target site
    article = soup.select_one("article") or soup.body
    return article.get_text(separator="\n", strip=True)

3️⃣ Analyse Text with LangChain

from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)

def summarize(content: str) -> str:
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    return chain.run(content)

def extract_keywords(content: str) -> list[str]:
    prompt = (
        "Extract the top 5 keywords from the following text. "
        "Return them as a comma‑separated list.\n\n"
        f"{content}"
    )
    response = llm(prompt)
    return [kw.strip() for kw in response.split(",")]

4️⃣ Build a Streamlit UI

Create a file called app.py:

import streamlit as st

st.title("🕸️ AI‑Powered Web Scraper")
url = st.text_input("Enter a URL to scrape:", "")

if st.button("Run scraper"):
    if url:
        html = fetch_page(url)
        raw_text = extract_text(html)

        with st.spinner("Summarising..."):
            summary = summarize(raw_text)

        with st.spinner("Extracting keywords..."):
            keywords = extract_keywords(raw_text)

        st.subheader("📄 Summary")
        st.write(summary)

        st.subheader("🔑 Keywords")
        st.write(", ".join(keywords))

        st.subheader("🧾 Raw extracted text")
        st.text_area("Full text", raw_text, height=300)
    else:

Run the app locally:

$ streamlit run app.py

You’ll see a clean web interface where you can paste any URL, click Run scraper, and instantly receive a concise AI‑generated summary plus key terms.

✅ What We’ve Built

Dynamic navigation (Selenium) → handles JS‑rendered pages.

  • Robust text extraction (BeautifulSoup) → strips boilerplate.
  • LLM‑driven insight (LangChain + OpenAI) → summarises and highlights keywords.
  • Interactive front‑end (Streamlit) → no need for a separate web server.

From here one could easily extend the pipeline:

  • Add pagination support for multi‑page articles.
  • Store results in a SQLite or PostgreSQL database.
  • Swap the OpenAI model for a local LLM (e.g., Llama 2) via LangChain’s adapters.

📚 Further Reading

Happy scraping and enjoy watching AI turn raw web data into actionable knowledge!

Autopilot or Co‑Pilot? Unpacking the Role of LLMs in Modern Development

How to Harness LLMs While Coding: A Practical Guide for Software Engineers Continue reading

Books I'm Reading During My Sabbatical

Published on January 01, 2018

Using MongoDB in a Ruby on Rails Project

Published on December 12, 2017