A8DOG

A8DOG

随便写写,记录折腾过程!
telegram

Implementing brute force pre-caching and full site staticization with WordPress using WP Super Cache + Python script.

WordPress has a scheduled task mechanism, and the WP Super Cache plugin uses this scheduled task to perform pre-caching. In my usage process, pre-caching generates a maximum of a dozen pages at a time, and continues to generate in the next scheduled task run. The efficiency is very slow. If you keep pre-caching pages with WP Super Cache, they can be generated continuously, but the efficiency is still too slow. So I wrote a simple Python script that can cache the category pages that WP Super Cache cannot pre-cache.

https://img.a8dog.com/i/2024/03/25/qw5iaj.png

Code:#

First, our server needs a Python 3 environment, and our WordPress has the WP Super Cache plugin installed. Edit the following code into page.py to cache the pagination of categories.

import os
import requests
import time
from concurrent.futures import ThreadPoolExecutor
from itertools import islice

# Define links and their corresponding quantities
links = {
    "https://a8dog.com": 60,
    "https://a8dog.com/a8": 10,
}

# Generate links
all_links = []
for link, count in links.items():
    for i in range(1, count + 1):
        page_link = f"{link}/page/{i}" if i > 1 else link
        all_links.append(page_link)

# Group the links for subsequent concurrent access
def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())

# Function to visit the links
def visit_url(url):
    try:
        response = requests.get(url)
        print(f"Visited: {url}, Status Code: {response.status_code}")
    except Exception as e:
        print(f"Failed to visit: {url}, Error: {e}")

# Set concurrency and delay
concurrency = 10  # Concurrency
delay = 1  # Delay time (seconds)

# Create a thread pool and access the links concurrently
with ThreadPoolExecutor(max_workers=concurrency) as executor:
    for chunked_links in chunk(all_links, concurrency):
        futures = [executor.submit(visit_url, url) for url in chunked_links]
        time.sleep(delay)

# Write the links to the page.txt file
with open("page.txt", "w") as f:
    for link in all_links:
        f.write(link + "\n")

Replace the links in the code with your category pages, and the number after each line of the category page is the number of pagination.

For example, if I have 10 pages in a category, I can set it to 15 pages so that there won't be any caching issues when there are more articles and more pagination.

Add the following code to url.py:

import requests
import xml.etree.ElementTree as ET
import threading
import time

# Set concurrency and request interval
CONCURRENT_REQUESTS = 10
REQUEST_INTERVAL = 1  # seconds

def fetch_sitemap(url):
    """
    Fetch the content of the website sitemap
    """
    response = requests.get(url)
    if response.status_code == 200:
        return response.content
    else:
        print(f"Failed to fetch sitemap from {url}")
        return None

def extract_sitemap_urls(sitemap_content):
    """
    Extract sub-sitemap links from the content of the website sitemap
    """
    urls = []
    if sitemap_content:
        try:
            root = ET.fromstring(sitemap_content)
            for loc in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc"):
                urls.append(loc.text)
        except ET.ParseError as e:
            print("Error parsing sitemap XML:", e)
    return urls

def fetch_urls_from_sitemap(url):
    """
    Extract webpage links from the website sitemap
    """
    sitemap_content = fetch_sitemap(url)
    if sitemap_content:
        return extract_sitemap_urls(sitemap_content)
    else:
        return []

def fetch_url(url):
    """
    Send a request to the website link
    """
    try:
        response = requests.get(url)
        # Handle the desired response content here
        print("Fetched:", url)
    except requests.RequestException as e:
        print("Error fetching", url, ":", e)

def main():
    sitemap_url = "https://a8dog.com/wp-sitemap.xml"  # Replace with your website sitemap link
    sitemap_urls = fetch_urls_from_sitemap(sitemap_url)
    all_urls = []

    # Extract webpage links from all sub-sitemaps
    for url in sitemap_urls:
        all_urls.extend(fetch_urls_from_sitemap(url))

    # Write to the url.txt file
    with open('url.txt', 'w') as f:
        for url in all_urls:
            f.write(url + '\n')
    print("Urls extracted and written to url.txt file.")

    # Thread function for concurrent requests
    def fetch_urls(urls):
        while urls:
            url = urls.pop(0)
            fetch_url(url)
            time.sleep(REQUEST_INTERVAL)

    # Send requests concurrently with CONCURRENT_REQUESTS threads
    threads = []
    for _ in range(CONCURRENT_REQUESTS):
        thread = threading.Thread(target=fetch_urls, args=(all_urls,))
        thread.start()
        threads.append(thread)

    # Wait for all threads to complete
    for thread in threads:
        thread.join()

if __name__ == "__main__":
    main()

Replace the sitemap address with your sitemap address, and it will automatically crawl all the links of the website for pre-caching.

If you add a shell script as a scheduled task in Baota:

python3 /your_directory/page.py
python3 /your_directory/url.py

You may encounter an error:

Traceback (most recent call last):
  File "/a8dog.py", line 2, in <module>
    import requests
ModuleNotFoundError: No module named 'requests'

This is because the scheduled task in Baota uses Baota's Python environment, which does not have the requests module installed. You only need to add a shell scheduled task script:

pip3 install requests

Execute it, and then re-add the scheduled task.

Note:#

If your domain has a CDN or firewall, frequent pre-caching may be blocked. Please add the IP to the whitelist and modify the host file to point the domain to the source IP or 127.0.0.1, so that it will not consume CDN traffic.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.