How to build a Python‑based web scraper that utilize OpenAI's SDK

(I made this image with ChatGPT’s Sora 💅🏼)

📖 Overview

In this tutorial we’ll build a Python‑based web scraper that:

Navigates dynamic sites with Selenium.
Extracts and cleans HTML using BeautifulSoup.
Analyzes the scraped text with LangChain (LLM‑powered summarization / keyword extraction).
Presents the results through a lightweight Streamlit dashboard.

By the end you’ll have a reusable pipeline that can scrape any public web page, feed the raw content to an LLM, and display the AI‑generated insights—all in a single, reproducible script.

🛠️ Prerequisites

Assuming you have an up-to-date version of Homebrew and MacOS

Python ≥ 3.13+

 brew install python@3.13

Selenium

 pip install selenium

ChromeDriver

Download from https://chromedriver.chromium.org/

BeautifulSoup4

 pip install beautifulsoup4

LangChain + OpenAI SDK

 pip install langchain openai

Streamlit

 pip install streamlit

OpenAI

You’ll also need to sign up for an OpenAI account

Get an API key (or any other LLM provider supported by LangChain).

Set it as an environment variable:

export OPENAI_API_KEY="sk-your‑key-here"

🚀 Step‑by‑Step Implementation

1️⃣ Initialise Selenium

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

# Path to your ChromeDriver executable
driver_path = "/path/to/chromedriver"
service = Service(driver_path)

options = webdriver.ChromeOptions()
options.add_argument("--headless")          # Run without opening a browser window
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")

driver = webdriver.Chrome(service=service, options=options)

def fetch_page(url: str) -> str:
    driver.get(url)
    time.sleep(3)  # Wait for dynamic content to load (adjust as needed)
    driver.quit()  # Always close the Selenium session when done
    return driver.page_source

2️⃣ Parse HTML with BeautifulSoup

from bs4 import BeautifulSoup

def extract_text(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    
    # Remove scripts / styles
    for tag in soup(["script", "style"]):
        tag.decompose()
    
    # Grab the main article body – adapt the selector to the target site
    article = soup.select_one("article") or soup.body
    return article.get_text(separator="\n", strip=True)

3️⃣ Analyze Text with LangChain

from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)

def summarize(content: str) -> str:
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    return chain.run(content)

def extract_keywords(content: str) -> list[str]:
    prompt = (
        "Extract the top 5 keywords from the following text. "
        "Return them as a comma‑separated list.\n\n"
        f"{content}"
    )
    response = llm(prompt)
    return [kw.strip() for kw in response.split(",")]

4️⃣ Build a Streamlit UI

Create a file called app.py:

import streamlit as st

st.title("🕸️ AI‑Powered Web Scraper")
url = st.text_input("Enter a URL to scrape:", "")

if st.button("Run scraper"):
    if url:
        html = fetch_page(url)
        raw_text = extract_text(html)

        with st.spinner("Summarising..."):
            summary = summarize(raw_text)

        with st.spinner("Extracting keywords..."):
            keywords = extract_keywords(raw_text)

        st.subheader("📄 Summary")
        st.write(summary)

        st.subheader("🔑 Keywords")
        st.write(", ".join(keywords))

        st.subheader("🧾 Raw extracted text")
        st.text_area("Full text", raw_text, height=300)
    else:

Run the app locally:

$ streamlit run app.py

You’ll see a clean web interface where you can paste any URL, click Run scraper, and instantly receive a concise AI‑generated summary plus key terms.

✅ What We’ve Built

Dynamic navigation (Selenium) → handles JS‑rendered pages.

Robust text extraction (BeautifulSoup) → strips boilerplate.
LLM‑driven insight (LangChain + OpenAI) → summarises and highlights keywords.
Interactive front‑end (Streamlit) → no need for a separate web server.

From here one could easily extend the pipeline:

Add pagination support for multi‑page articles.
Store results in a SQLite or PostgreSQL database.
Swap the OpenAI model for a local LLM (e.g., Llama 2) via LangChain’s adapters.

📚 Further Reading

Happy web-scraping and enjoy watching AI turn raw web data into actionable knowledge!

How to build a Python‑based web scraper that utilize OpenAI's SDK

December 27, 2024

Autopilot or Co‑Pilot? Unpacking the Role of LLMs in Modern Development

Autopilot or Co‑Pilot? 1 - AI Research

Autopilot or Co‑Pilot? 2 -