Author Topic: Python (+Selenium) Chrome script for IMDb People script  (Read 173 times)

0 Members and 1 Guest are viewing this topic.

Offline afrocuban

  • Moderator
  • *****
  • Posts: 572
    • View Profile
Python (+Selenium) Chrome script for IMDb People script
« on: January 04, 2025, 04:48:45 am »
This is Selenium script specific for the People script here:


Script Description: This script automates the download of various IMDb pages using Selenium and ChromeDriver, including handling cookies and popups, and saving the resulting HTML to local files.
It automatically finds your localization by using service, the http://ipinfo.io API to get the country code and the dictionary to map to language acording to obtaining country code. If you don't want this, comment out first part of the script and uncomment the one at the end of this script. Open the script in text editor and read about this.
For this to work ensure that:


Quote
A. You installed python
B. You installed selenium and requests by


Quote
pip install selenium requests


C. You have your Chrome bin on a PATH
D. You have Python folder on your PATH
E. pythonw.exe is not missing, or it's containing folder is on the PATH


This script:


Quote
1. Uses Chrome browser instead Firefox
2. Uses chromedriver.exe instead geckodriver
3. Starts chromedriver.exe silently
4. Silently invokes browser in a headless mode (no pop-up windows of browser)
5. Scrapes .htm page of a given url
6. No path is needed to set manually inside the script - it is set to be relative to the path of selenium script!

For using relative path, ensure:

Quote
6A. You put this script into "Scripts" folder of your PVD instance.
6B. You put appropirate chromedriver.exe to the "Script" folder, too.

To silently invoke selenium script itself by PVD's .psf script (no pop-up windows of selenium script's cmd window), be sure to use pythonw.exe instead of python.exe, like this for example:

Quote
FileExecute('pythonw.exe', '"' + ScriptPath + 'selenium_script-Chrome_People.py" "' + URL + '" "' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM + '"');

Now, the last one will probably be ensured by those who maintain corresponding scripts if interested in, and for now, those are Ivek and me, but be sure to check if it's there anyway.

You may want first to test the script manually, from cmd, for example like this:

Quote
C:\Users\user\selenium_script-Chrome_People.py "https://www.imdb.com/name/nm0000017"

From this point on, everything is automated and headless.
« Last Edit: January 04, 2025, 08:41:15 am by afrocuban »

Offline afrocuban

  • Moderator
  • *****
  • Posts: 572
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #1 on: January 04, 2025, 08:39:56 am »
Here's optimized selenium script, that should reduce time wait significantly.
« Last Edit: January 04, 2025, 12:44:26 pm by afrocuban »

Offline afrocuban

  • Moderator
  • *****
  • Posts: 572
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #2 on: January 05, 2025, 10:37:58 pm »
New scripts. Delete earlier and put these to the Scripts folder.

Read more here:

Quote
http://www.videodb.info/forum_en/index.php/topic,4367.msg22727.html#msg22727

Offline Ivek23

  • Global Moderator
  • *****
  • Posts: 2768
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #3 on: January 06, 2025, 10:02:12 pm »
selenium_script-People_4_pages_v3.2 script does not transfer all awaeds data because it does not open all more buttons for you, at least it was the case for me.

Here is my updated part of the code to open more more buttons and it works, so you will have to adapt it for the chrome version.

Quote
# Define URLs and save paths
URLS_AND_PATHS = {
    f"{base_url}/awards/": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Awards.htm"),
    f"{base_url}/bio/": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Bio.htm"),
    f"https://www.imdb.com/search/title/?explore=genres&role={base_url.split('/')[-1]}": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Genres.htm"),
    f"{base_url}/?showAllCredits=true": os.path.join(tmp_dir, "downpage-UTF8_NO_BOM-Credit.htm")
}

# Improved function to click all "More" buttons with scrolling
def click_all_more_buttons(driver):
    """
    Scrolls down the page and clicks all the "More" buttons that are visible.
    """
    body = driver.find_element(By.TAG_NAME, 'body')
    while True:
        try:
            # Find visible "More" buttons
            more_buttons = driver.find_elements(By.XPATH, '//span[contains(@class, "ipc-see-more__text")]/..')

            # If no buttons are found, break the loop
            if not more_buttons:
                logging.info("No more 'More' buttons found.")
                break

            # Iterate through and click all visible "More" buttons
            for button in more_buttons:
                try:
                    # Scroll into view before clicking
                    driver.execute_script("arguments[0].scrollIntoView({block: 'center'});", button)
                    time.sleep(0.5)  # Allow page to stabilize
                    button.click()
                    logging.info("Clicked a 'More' button.")
                    time.sleep(1)  # Allow time for new content to load
                except Exception as e:
                    logging.warning(f"Error clicking a 'More' button: {e}")
                    continue

            # Scroll the page down to load more buttons
            body.send_keys(Keys.PAGE_DOWN)
            time.sleep(1)  # Wait for page to load more buttons
        except Exception as e:
            logging.info("No additional 'More' buttons to click.")
            break

# Function to download a page
def download_page(imdb_url, output_path, retries=3):
    for attempt in range(retries):
        try:
            # Initialize FirefoxDriver
            service = Service(gecko_path)
            driver = webdriver.Firefox(service=service, options=firefox_options)
            logging.info(f"Started FirefoxDriver for: {imdb_url}")

            driver.get(imdb_url)
            logging.info(f"Loaded URL: {imdb_url}")

            # Handle "Select Your Preferences" popup
            try:
                popup = WebDriverWait(driver, 5).until(
                    EC.presence_of_element_located((By.XPATH, "//div[contains(@class, 'sc-kDvujY')]"))
                )
                accept_button = WebDriverWait(driver, 5).until(
                    EC.element_to_be_clickable((By.XPATH, "//button[@data-testid='accept-button']"))
                )
                accept_button.click()
                logging.info("Accepted preferences popup.")
            except TimeoutException:
                logging.info("No preferences popup appeared.")

            # Click all "More" buttons on the page
            click_all_more_buttons(driver)

            # Save the HTML after clicking all "More" buttons
            html_source = driver.page_source
            with open(output_path, 'w', encoding='utf-8') as file:
                file.write(html_source)
            logging.info(f"Saved HTML to: {output_path}")
            break
        except Exception as e:
            logging.error(f"Error in attempt {attempt + 1}: {e}")
        finally:
 driver.quit()

# Download pages in parallel
threads = []
for url, path in URLS_AND_PATHS.items():
    thread = threading.Thread(target=download_page, args=(url, path))
    threads.append(thread)
    thread.start()
Ivek23
Win 10 64bit (32bit)   PVD v0.9.9.21, PVD v1.0.2.7, PVD v1.0.2.7 + MOD


Offline afrocuban

  • Moderator
  • *****
  • Posts: 572
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #4 on: January 07, 2025, 02:58:37 am »
Strange.


I deleted record for Alfonso Cuaron (https://www.imdb.com/name/nm0190859/), that has 262 wons and 207 nominations and imported it from the scratch, and counted manually up to 262+207=469 and all are there?

Can you provide me the link you were stuck with, so I could try to reproduce?

Offline Ivek23

  • Global Moderator
  • *****
  • Posts: 2768
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #5 on: January 07, 2025, 05:50:03 am »
Strange.


I deleted record for Alfonso Cuaron (https://www.imdb.com/name/nm0190859/), that has 262 wons and 207 nominations and imported it from the scratch, and counted manually up to 262+207=469 and all are there?

Can you provide me the link you were stuck with, so I could try to reproduce?

Andrea Barber
https://www.imdb.com/name/nm0053347/

Aaron Spelling
https://www.imdb.com/name/nm0005455/
Ivek23
Win 10 64bit (32bit)   PVD v0.9.9.21, PVD v1.0.2.7, PVD v1.0.2.7 + MOD


Offline afrocuban

  • Moderator
  • *****
  • Posts: 572
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #6 on: January 07, 2025, 09:10:01 am »
I can't seem to reproduce it?

Offline Ivek23

  • Global Moderator
  • *****
  • Posts: 2768
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #7 on: January 07, 2025, 03:25:54 pm »
I can't seem to reproduce it?

As I can see from the attached pictures, the problem was on my side. What about downloading the filmography, does it download it on the first attempt for a specific person or do I have to run the script again for the same person to download the filmography. In most cases, for me, the filmography download is successful only on the second attempt, which means that my modified firefox selenium_script-People_4_pages_v3.2 is not fully compatible with the script. My selenium script will need to be fixed.
Ivek23
Win 10 64bit (32bit)   PVD v0.9.9.21, PVD v1.0.2.7, PVD v1.0.2.7 + MOD


Offline afrocuban

  • Moderator
  • *****
  • Posts: 572
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #8 on: January 08, 2025, 07:20:48 am »


What about downloading the filmography, does it download it on the first attempt for a specific person or do I have to run the script again for the same person to download the filmography.

For me everything works without a problem.


Ask Copilot to revise my Chrome script for FF and geckodriver. Compare that revision with the FF version you have in Notepad++ with Compareplus plugin. If they are different, try Copilot's version. If you still have issues, than at the moment I can think only that your computer is a bit outdated and it can't handle page loading scrolling and clicking that fast. Increase wait time in the script for page loading and scrolling. If you get back with testing details with these but still unsuccessful, maybe something else can come up on my mind.
« Last Edit: January 08, 2025, 01:50:46 pm by afrocuban »

Offline Ivek23

  • Global Moderator
  • *****
  • Posts: 2768
    • View Profile
Re: Python (+Selenium) Chrome script for IMDb People script
« Reply #9 on: January 08, 2025, 05:23:26 pm »
In selenium_script-Chrome_People_Base_page_v3.2 it will be necessary to add a function to open Expand buttons for alternative names, because now it only passes one alternative name, where otherwise there are multiple names.

Here is a nice example for Ann Gillespie alternative names

https://www.imdb.com/name/nm0318905/
Ivek23
Win 10 64bit (32bit)   PVD v0.9.9.21, PVD v1.0.2.7, PVD v1.0.2.7 + MOD


 

anything