Author Topic: Python (+Selenium) Chrome script for IMDb People script  (Read 62 times)

1 Member and 4 Guests are viewing this topic.

Offline afrocuban

  • Moderator
  • *****
  • Posts: 564
    • View Profile
Python (+Selenium) Chrome script for IMDb People script
« on: January 04, 2025, 04:48:45 am »
This is Selenium script specific for the People script here:


Script Description: This script automates the download of various IMDb pages using Selenium and ChromeDriver, including handling cookies and popups, and saving the resulting HTML to local files.
It automatically finds your localization by using service, the http://ipinfo.io API to get the country code and the dictionary to map to language acording to obtaining country code. If you don't want this, comment out first part of the script and uncomment the one at the end of this script. Open the script in text editor and read about this.
For this to work ensure that:


Quote
A. You installed python
B. You installed selenium and requests by


Quote
pip install selenium requests


C. You have your Chrome bin on a PATH
D. You have Python folder on your PATH
E. pythonw.exe is not missing, or it's containing folder is on the PATH


This script:


Quote
1. Uses Chrome browser instead Firefox
2. Uses chromedriver.exe instead geckodriver
3. Starts chromedriver.exe silently
4. Silently invokes browser in a headless mode (no pop-up windows of browser)
5. Scrapes .htm page of a given url
6. No path is needed to set manually inside the script - it is set to be relative to the path of selenium script!

For using relative path, ensure:

Quote
6A. You put this script into "Scripts" folder of your PVD instance.
6B. You put appropirate chromedriver.exe to the "Script" folder, too.

To silently invoke selenium script itself by PVD's .psf script (no pop-up windows of selenium script's cmd window), be sure to use pythonw.exe instead of python.exe, like this for example:

Quote
FileExecute('pythonw.exe', '"' + ScriptPath + 'selenium_script-Chrome_People.py" "' + URL + '" "' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM + '"');

Now, the last one will probably be ensured by those who maintain corresponding scripts if interested in, and for now, those are Ivek and me, but be sure to check if it's there anyway.

You may want first to test the script manually, from cmd, for example like this:

Quote
C:\Users\user\selenium_script-Chrome_People.py "https://www.imdb.com/name/nm0000017"

From this point on, everything is automated and headless.
« Last Edit: January 04, 2025, 08:41:15 am by afrocuban »

Offline afrocuban

  • Moderator
  • *****
  • Posts: 564
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #1 on: January 04, 2025, 08:39:56 am »
Here's optimized selenium script, that should reduce time wait significantly.
« Last Edit: January 04, 2025, 12:44:26 pm by afrocuban »

Offline afrocuban

  • Moderator
  • *****
  • Posts: 564
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #2 on: January 05, 2025, 10:37:58 pm »
New scripts. Delete earlier and put these to the Scripts folder.

Read more here:

Quote
http://www.videodb.info/forum_en/index.php/topic,4367.msg22727.html#msg22727

Online Ivek23

  • Global Moderator
  • *****
  • Posts: 2765
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #3 on: January 06, 2025, 10:02:12 pm »
selenium_script-People_4_pages_v3.2 script does not transfer all awaeds data because it does not open all more buttons for you, at least it was the case for me.

Here is my updated part of the code to open more more buttons and it works, so you will have to adapt it for the chrome version.

Quote
# Define URLs and save paths
URLS_AND_PATHS = {
    f"{base_url}/awards/": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Awards.htm"),
    f"{base_url}/bio/": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Bio.htm"),
    f"https://www.imdb.com/search/title/?explore=genres&role={base_url.split('/')[-1]}": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Genres.htm"),
    f"{base_url}/?showAllCredits=true": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Credit.htm")
}

# Improved function to click all "More" buttons with scrolling
def click_all_more_buttons(driver):
    """
    Scrolls down the page and clicks all the "More" buttons that are visible.
    """
    body = driver.find_element(By.TAG_NAME, 'body')
    while True:
        try:
            # Find visible "More" buttons
            more_buttons = driver.find_elements(By.XPATH, '//span[contains(@class, "ipc-see-more__text")]/..')

            # If no buttons are found, break the loop
            if not more_buttons:
                logging.info("No more 'More' buttons found.")
                break

            # Iterate through and click all visible "More" buttons
            for button in more_buttons:
                try:
                    # Scroll into view before clicking
                    driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", button)
                    time.sleep(0.5)  # Allow page to stabilize
                    button.click()
                    logging.info("Clicked a 'More' button.")
                    time.sleep(1)  # Allow time for new content to load
                except Exception as e:
                    logging.warning(f"Error clicking a 'More' button: {e}")
                    continue

            # Scroll the page down to load more buttons
            body.send_keys(Keys.PAGE_DOWN)
            time.sleep(1)  # Wait for page to load more buttons
        except Exception as e:
            logging.info("No additional 'More' buttons to click.")
            break

# Function to download a page
def download_page(imdb_url, output_path, retries=3):
    for attempt in range(retries):
        try:
            # Initialize FirefoxDriver
            service = Service(gecko_path)
            driver = webdriver.Firefox(service=service, options=firefox_options)
            logging.info(f"Started FirefoxDriver for: {imdb_url}")

            driver.get(imdb_url)
            logging.info(f"Loaded URL: {imdb_url}")

            # Handle "Select Your Preferences" popup
            try:
                popup = WebDriverWait(driver, 5).until(
                    EC.presence_of_element_located((By.XPATH, "//div[contains(@class, 'sc-kDvujY')]"))
                )
                accept_button = WebDriverWait(driver, 5).until(
                    EC.element_to_be_clickable((By.XPATH, "//button[@data-testid='accept-button']"))
                )
                accept_button.click()
                logging.info("Accepted preferences popup.")
            except TimeoutException:
                logging.info("No preferences popup appeared.")

            # Click all "More" buttons on the page
            click_all_more_buttons(driver)

            # Save the HTML after clicking all "More" buttons
            html_source = driver.page_source
            with open(output_path, 'w', encoding='utf-8') as file:
                file.write(html_source)
            logging.info(f"Saved HTML to: {output_path}")
            break
        except Exception as e:
            logging.error(f"Error in attempt {attempt + 1}: {e}")
        finally:
 driver.quit()

# Download pages in parallel
threads = []
for url, path in URLS_AND_PATHS.items():
    thread = threading.Thread(target=download_page, args=(url, path))
    threads.append(thread)
    thread.start()
Ivek23
Win 10 64bit (32bit)   PVD v0.9.9.21, PVD v1.0.2.7, PVD v1.0.2.7 + MOD