
(I made this image with ChatGPT’s Sora 💅🏼)
📖 Overview
In this tutorial we’ll build a Python‑based web scraper that:
- Navigates dynamic sites with Selenium.
- Extracts and cleans HTML using BeautifulSoup.
- Analyzes the scraped text with LangChain (LLM‑powered summarization / keyword extraction).
- Presents the results through a lightweight Streamlit dashboard.
By the end you’ll have a reusable pipeline that can scrape any public web page, feed the raw content to an LLM, and display the AI‑generated insights—all in a single, reproducible script.
🛠️ Prerequisites
Assuming you have an up-to-date version of Homebrew and MacOS
Python ≥ 3.13+
brew install python@3.13
Selenium
pip install selenium
ChromeDriver
Download from https://chromedriver.chromium.org/
BeautifulSoup4
pip install beautifulsoup4
LangChain + OpenAI SDK
pip install langchain openai
Streamlit
pip install streamlit
OpenAI
You’ll also need to sign up for an OpenAI account
Get an API key (or any other LLM provider supported by LangChain).
Set it as an environment variable:
export OPENAI_API_KEY="sk-your‑key-here"
🚀 Step‑by‑Step Implementation
1️⃣ Initialise Selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
# Path to your ChromeDriver executable
driver_path = "/path/to/chromedriver"
service = Service(driver_path)
options = webdriver.ChromeOptions()
options.add_argument("--headless") # Run without opening a browser window
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(service=service, options=options)
def fetch_page(url: str) -> str:
driver.get(url)
time.sleep(3) # Wait for dynamic content to load (adjust as needed)
driver.quit() # Always close the Selenium session when done
return driver.page_source
2️⃣ Parse HTML with BeautifulSoup
from bs4 import BeautifulSoup
def extract_text(html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
# Remove scripts / styles
for tag in soup(["script", "style"]):
tag.decompose()
# Grab the main article body – adapt the selector to the target site
article = soup.select_one("article") or soup.body
return article.get_text(separator="\n", strip=True)
3️⃣ Analyse Text with LangChain
from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)
def summarize(content: str) -> str:
chain = load_summarize_chain(llm, chain_type="map_reduce")
return chain.run(content)
def extract_keywords(content: str) -> list[str]:
prompt = (
"Extract the top 5 keywords from the following text. "
"Return them as a comma‑separated list.\n\n"
f"{content}"
)
response = llm(prompt)
return [kw.strip() for kw in response.split(",")]
4️⃣ Build a Streamlit UI
Create a file called app.py:
import streamlit as st
st.title("🕸️ AI‑Powered Web Scraper")
url = st.text_input("Enter a URL to scrape:", "")
if st.button("Run scraper"):
if url:
html = fetch_page(url)
raw_text = extract_text(html)
with st.spinner("Summarising..."):
summary = summarize(raw_text)
with st.spinner("Extracting keywords..."):
keywords = extract_keywords(raw_text)
st.subheader("📄 Summary")
st.write(summary)
st.subheader("🔑 Keywords")
st.write(", ".join(keywords))
st.subheader("🧾 Raw extracted text")
st.text_area("Full text", raw_text, height=300)
else:
Run the app locally:
$ streamlit run app.py
You’ll see a clean web interface where you can paste any URL, click Run scraper, and instantly receive a concise AI‑generated summary plus key terms.
✅ What We’ve Built
Dynamic navigation (Selenium) → handles JS‑rendered pages.
- Robust text extraction (BeautifulSoup) → strips boilerplate.
- LLM‑driven insight (LangChain + OpenAI) → summarises and highlights keywords.
- Interactive front‑end (Streamlit) → no need for a separate web server.
From here one could easily extend the pipeline:
- Add pagination support for multi‑page articles.
- Store results in a SQLite or PostgreSQL database.
- Swap the OpenAI model for a local LLM (e.g., Llama 2) via LangChain’s adapters.
📚 Further Reading
Happy scraping and enjoy watching AI turn raw web data into actionable knowledge!