Author Topic: Integrating Selenium to PVD (Read 13141 times)

afrocuban · « **on:** December 15, 2024, 03:09:47 am »

Sorry that I'm in the conversation uninvited and not knowing programming. But I learned a bit along the way trying to locally download FA page with trailers. I succeeded to download it as downpage_trailer-UTF8_NO_BOM_FA.htm beside downpage-UTF8_NO_BOM.htm in order to try to parse them both, but no good news. I mean, I can parse them both, but dynamic content isn't downloaded,

I don't think we can do it with PVdBDownPage.exe:

Quote

// Download the URL page.
LogMessage(Chr(9) + Chr(9) + ' Download with PVdBDownPage in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');
FileExecute(ScriptPath + 'PVdBDownPage.exe', '"' + URL + '" "' + ScriptPath + OutFile + '"');

because it cannot download dynamic content since it

Quote

is likely due to the JavaScript that loads the video elements dynamically on the original page. When you download the page using DownloadPage, the JavaScript doesn’t execute, so the dynamic content (like embedded trailers) isn’t included in the downloaded HTML file.
To work around this, we have a couple of options:

Use a Web Browser Automation Tool: Tools like Selenium can control a web browser to visit the page and let all JavaScript execute before saving the fully rendered page’s HTML. This way, the dynamic content will be included.

Find the Video URL Manually: Check the original HTML of the page in your browser (right-click and “View Page Source” or inspect elements with Developer Tools). Look for the actual URLs of the video files. You can then use those URLs directly in your script.

Using a web browser automation tool like Selenium is a bit more involved and requires additional setup, but it is a more robust solution for handling dynamic content. If you prefer to go this route, I can guide you through the setup process.

Alternatively, manually inspecting the page’s source for video URLs is quicker but less convenient for repeated tasks.

Quote

Here's the rest of the conversation:

Invoking Selenium directly from within a PVD script (written in Pascal) is not straightforward because Selenium typically operates via Python, Java, or another high-level language. However, you can use a workaround by invoking an external Python script from your PVD Pascal script.

Here’s how you can integrate Selenium into your workflow:

Create a Python Script Using Selenium:

This script will navigate to the URL, wait for the content to load, and save the fully rendered HTML.

python

Quote

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Navigate to the URL
driver.get("https://www.filmaffinity.com/en/evideos.php?movie_id=xxxx")

# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)

# Get the fully rendered HTML
html = driver.page_source

# Save the HTML to a file
with open('downpage_trailer-UTF8_NO_BOM_FA.htm', 'w', encoding='utf-8') as file:
file.write(html)

# Close the browser

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Navigate to the URL
driver.get("https://www.filmaffinity.com/en/evideos.php?movie_id=xxxx")

# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)

# Get the fully rendered HTML
html = driver.page_source

# Save the HTML to a file
with open('downpage_trailer-UTF8_NO_BOM_FA.htm', 'w', encoding='utf-8') as file:
file.write(html)

# Close the browser
driver.quit()

Invoke the Python Script from Your PVD Script:

Use the FileExecute method to call the Python script from within your Pascal script.

pascal

Quote

// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');

// Construct the command to run the Python script
FileExecute('python.exe', ScriptPath + 'selenium_script.py');

In this setup:

The Python script handles the browsing and downloading of the fully rendered HTML using Selenium.

Your PVD script uses the FileExecute method to call the Python script, passing any necessary parameters if needed.

After the Python script finishes and the HTML is saved, your PVD script can continue processing the saved HTML as usual.

Example Complete Workflow
1. Python Script (selenium_script.py)

python

Quote

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import sys

# Get the URL from the command line arguments
url = sys.argv[1]
output_file = sys.argv[2]

# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Navigate to the URL
driver.get(url)

# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)

# Get the fully rendered HTML
html = driver.page_source

# Save the HTML to a file
with open(output_file, 'w', encoding='utf-8') as file:
file.write(html)

# Close the browser
driver.quit()

2. Pascal Script (Invoke Selenium Script)

pascal

Quote

// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');

// Construct the command to run the Python script
FileExecute('python.exe', '"' + ScriptPath + 'selenium_script.py" "' + URL + '" "' + ScriptPath + OutFile + '"');

By following this workflow, you can effectively integrate Selenium into your PVD script to handle pages with dynamic content. This way, you can ensure that all necessary content, like embedded trailers, is included in the HTML that your PVD script processes.

How would selenium know which url to open?
Great question. You can pass the URL to Selenium through command line arguments when you invoke the Python script. This way, your PVD script can dynamically supply the URL to Selenium. Here's how you can do it:

Step-by-Step Implementation
Modify Your Python Script to Accept Command Line Arguments:

You’ll adjust the Python script to accept the URL as an argument. This ensures that Selenium knows which URL to open.

Example Python Script (selenium_script.py):

python

Quote

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import sys

# Get the URL and output file from the command line arguments
url = sys.argv[1]
output_file = sys.argv[2]

# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Navigate to the URL
driver.get(url)

# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)

# Get the fully rendered HTML
html = driver.page_source

# Save the HTML to a file
with open(output_file, 'w', encoding='utf-8') as file:
file.write(html)

# Close the browser
driver.quit()

Invoke the Python Script from Your PVD Script:

Use the FileExecute method to call the Python script, passing the URL and the output file path as arguments.

Example Pascal Script:

pascal

Quote

// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');

// Construct the command to run the Python script
FileExecute('python.exe', '"' + ScriptPath + 'selenium_script.py" "' + URL + '" "' + ScriptPath + OutFile + '"');

How It Works
Python Script: The Python script (selenium_script.py) receives the URL and the output file path as command line arguments. It then uses Selenium to open the URL, waits for the trailers to load, and saves the fully rendered HTML to the specified output file.

Pascal Script: Your PVD script constructs the command to run the Python script with the necessary arguments (the URL and the output file path) and then executes it. This ensures that the correct URL is passed to Selenium dynamically.

This approach allows you to integrate Selenium seamlessly into your PVD script, enabling it to handle pages with dynamic content effectively. If you need further assistance with any part of the setup, feel free to ask!

How slow is this process?
The process of using Selenium to render and download the HTML can be slower than simply downloading a static page because Selenium has to:

Open a Web Browser: Launching a browser instance (e.g., Chrome or Firefox) takes a few seconds.

Navigate to the URL: Just like a human user, Selenium will navigate to the specified URL, which involves network latency.

Wait for JavaScript Execution: Selenium waits for the page to fully load, including the execution of any JavaScript that dynamically loads content. This can vary in time depending on the complexity of the page and the speed of your internet connection.

Save the HTML: Once the page is fully loaded, Selenium will extract and save the HTML content, which is relatively quick.

Estimated Time
Browser Launch: 1-3 seconds

Page Navigation: 1-5 seconds (depends on network speed)

JavaScript Execution & Content Load: 3-10 seconds (depends on page complexity)

Saving HTML: <1 second

In total, it can take anywhere from 5 to 20 seconds or more per page, depending on various factors.

Performance Considerations
Network Speed: Faster internet connections will reduce the navigation and loading time.

Page Complexity: Pages with heavy JavaScript and multiple dynamic elements will take longer to fully load.

Browser & System Performance: Using a faster browser and a system with more resources (CPU, RAM) will speed up the process.

While this is slower compared to a direct HTML download using PVdBDownPage.exe, it ensures that all dynamic content is fully loaded and accessible. This trade-off can be worthwhile if you need to scrape content that is loaded dynamically by JavaScript.

If speed becomes a critical issue, you might look into optimizing the Selenium script further, such as by reducing the wait times where possible or running multiple instances in parallel if applicable.

afrocuban · « **Reply #1 on:** December 15, 2024, 03:38:03 am »

ON AKAS

Yes, you’re correct. When dealing with elements that dynamically load additional content upon interaction, such as the "12 more" button on the IMDb page, Selenium is an excellent tool to use.How Selenium Can Help

Automate Browser Actions: Selenium can open the browser and navigate to the desired URL.
Simulate User Interaction: Selenium can simulate clicking on the "12 more" button to load the additional AKAs.
Wait for Dynamic Content: Selenium can wait for the dynamically loaded content to appear before capturing the page’s HTML.
Extract Fully Rendered HTML: Once all content is loaded, Selenium can extract and save the fully rendered HTML, including the dynamically loaded AKAs.

Example Selenium WorkflowHere’s how you might modify your Selenium script to handle this case:
1. Python Script Using Selenium (selenium_imdb.py):

python

from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import sys # Get the URL and output file from the command line arguments url = sys.argv[1] output_file = sys.argv[2] # Setup the WebDriver driver = webdriver.Chrome(executable_path='path_to_chromedriver') # Navigate to the URL driver.get(url) # Wait for the "12 more" button to be clickable and click it more_button = WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH, "//span[contains(text(), '12 more')]")) ) more_button.click() # Wait for the new content to load WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CSS_SELECTOR, "div[data-testid='sub-section-akas']")) ) # Get the fully rendered HTML html = driver.page_source # Save the HTML to a file with open(output_file, 'w', encoding='utf-8') as file: file.write(html) # Close the browser driver.quit()

2. Invoke the Python Script from Your PVD Script:

pascal

// Log the action LogMessage(Chr(9) + Chr(9) + '      Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||'); // Construct the command to run the Python script FileExecute('python.exe', '"' + ScriptPath + 'selenium_imdb.py" "' + URL + '" "' + ScriptPath + OutFile + '"');

Summary

Automate Browser Actions: Open the browser and navigate to the URL.
Simulate User Click: Click the "12 more" button to load additional AKAs.
Wait for Content: Wait for the new content to load.
Save Fully Rendered HTML: Save the fully rendered HTML to a file.
Integrate with PVD Script: Use FileExecute in your PVD script to call the Selenium script and save the HTML.

By using Selenium, you can ensure that all dynamic content, such as the additional AKAs on the IMDb page, is loaded and saved for processing by your PVD script. This makes Selenium a powerful tool for dealing with dynamically loaded web content.

afrocuban · « **Reply #2 on:** December 15, 2024, 03:42:13 am »

And to be sure it's not about PVD:

So, regardless of PVD handles https or not, no application could access dynamically loaded content actually, without tools like Selenium?

That's right. Dynamically loaded content, which relies on JavaScript to fetch and display data after the initial HTML page load, generally can't be accessed using tools that only handle static content, such as curl or traditional HTTP request libraries. These tools can fetch the initial HTML, but they can't execute JavaScript to render additional content.Why Tools Like Selenium Are Necessary

JavaScript Execution: Selenium can control a web browser to execute JavaScript, allowing it to load dynamic content.
User Interaction: Selenium can simulate user actions such as clicking buttons, filling out forms, and scrolling, which are often required to trigger JavaScript that loads additional content.
Full Page Rendering: Selenium waits until the page is fully rendered, ensuring all dynamic elements are loaded before extracting the HTML.

Use Cases for Selenium

Expanding Sections: Like the "12 more" button on IMDb, where content only appears after clicking.
Infinite Scrolling: Pages that load more content as you scroll down.
Interactive Elements: Forms, drop-downs, and other UI elements that trigger content changes.

Without tools like Selenium, accessing this kind of dynamically loaded content programmatically is very difficult, as these elements require interaction and JavaScript execution that static tools can't handle.

Ivek23 · « **Reply #3 on:** December 15, 2024, 12:29:31 pm »

In the Python attachment, there are fetch_imdbs_titles.py and imdb_aka_fetcher.py scripts to help and see the specific paths of geckodriver for firefox when writing Python scripts.

I would like to ask you to make a Python Script (selenium_script.py) and Pascal Script (Invoke Selenium Script) path to IMDB_[EN][HTTPS]_TEST_2c 2c script for me and add all this in the attachment here.

Where do I need to add geckodriver for firefox.

afrocuban · « **Reply #4 on:** December 15, 2024, 09:26:45 pm »

Quote from: Ivek23 on December 15, 2024, 12:29:31 pm

In the Python attachment, there are fetch_imdbs_titles.py and imdb_aka_fetcher.py scripts to help and see the specific paths of geckodriver for firefox when writing Python scripts.

I would like to ask you to make a Python Script (selenium_script.py) and Pascal Script (Invoke Selenium Script) path to IMDB_[EN][HTTPS]_TEST_2c 2c script for me and add all this in the attachment here.

Where do I need to add geckodriver for firefox.

It isn't clear who are you asking to do this, but in case you are asking me, I am still at the very beginning of even comprehending the concept, not to say to code. Interacting with AI can be and is extremely frustrating, and whatever I tried, I needed to try it live, otherwise I had to started over each time. Meaning, asking me to provide it for you isn't productive way, unless you too want to get crazy like I did while upgrading FA script, hahahah. To get there to be able to parse FA trailers page (meaning to be able to download and parse dynamic content of HTML on FA), I think I'll need a month at least, but I'm not surrendering.

Meanwhile, I started to fix and upgrade IMDb people script. I already fixed "bio" field, but I need and want to further tweak, update and upgrade it before meaningfully post it.

Ivek23 · « **Reply #5 on:** December 16, 2024, 08:23:56 am »

Quote from: afrocuban on December 15, 2024, 09:26:45 pm

Quote from: Ivek23 on December 15, 2024, 12:29:31 pm
In the Python attachment, there are fetch_imdbs_titles.py and imdb_aka_fetcher.py scripts to help and see the specific paths of geckodriver for firefox when writing Python scripts.

I would like to ask you to make a Python Script (selenium_script.py) and Pascal Script (Invoke Selenium Script) path to IMDB_[EN][HTTPS]_TEST_2c 2c script for me and add all this in the attachment here.

Where do I need to add geckodriver for firefox.

It isn't clear who are you asking to do this, but in case you are asking me, I am still at the very beginning of even comprehending the concept, not to say to code. Interacting with AI can be and is extremely frustrating, and whatever I tried, I needed to try it live, otherwise I had to started over each time. Meaning, asking me to provide it for you isn't productive way, unless you too want to get crazy like I did while upgrading FA script, hahahah. To get there to be able to parse FA trailers page (meaning to be able to download and parse dynamic content of HTML on FA), I think I'll need a month at least, but I'm not surrendering.

Meanwhile, I started to fix and upgrade IMDb people script. I already fixed "bio" field, but I need and want to further tweak, update and upgrade it before meaningfully post it.

Ok, I just asked if there is such a possibility. I would like to ask that the Python Script (selenium_script.py) be published, which would be good for other users as well, maybe I could find someone else who could help with this.

afrocuban · « **Reply #6 on:** December 17, 2024, 04:38:10 am »

I installed: selenium, beautifulsoap4, node.js and puppeteer. No solution downloaded dynamic content locally.

We have no choice for now, it looks....

I tried it for People credits only...

Ivek23 · « **Reply #7 on:** December 17, 2024, 07:08:14 pm »

Quote from: afrocuban on December 17, 2024, 04:38:10 am

I installed: selenium, beautifulsoap4, node.js and puppeteer. No solution downloaded dynamic content locally.

We have no choice for now, it looks....

I tried it for People credits only...

Here is the url for Aaron Spelling credits

https://www.imdb.com/name/nm0005455/?showAllCredits=true

I already found a solution for AKA titles. How to download them all and they work in test form, but there are still some details missing that need to be tested as well as selenium_script.py .

IMDB_[EN][HTTPS]_TEST_2c 2c script I had to change some parts of the code so that now Function ParsePage_IMDBMovieAKA is the only one that is used.

afrocuban · « **Reply #8 on:** December 17, 2024, 11:37:40 pm »

Quote from: Ivek23 on December 17, 2024, 07:08:14 pm

Quote from: afrocuban on December 17, 2024, 04:38:10 am
I installed: selenium, beautifulsoap4, node.js and puppeteer. No solution downloaded dynamic content locally.

We have no choice for now, it looks....

I tried it for People credits only...

Here is the url for Aaron Spelling credits

https://www.imdb.com/name/nm0005455/?showAllCredits=true

I already found a solution for AKA titles. How to download them all and they work in test form, but there are still some details missing that need to be tested as well as selenium_script.py .

IMDB_[EN][HTTPS]_TEST_2c 2c script I had to change some parts of the code so that now Function ParsePage_IMDBMovieAKA is the only one that is used.

Great to hear. I succeeded to eventually download full content of https://www.imdb.com/name/nm0005455/?showAllCredits=true to local file. The trick was to download and save it as mhtml. Now I'm looking how to parse that page and later how to invoke .py from within .psf, that is to pass the url to .py...

Ivek23 · « **Reply #9 on:** December 18, 2024, 08:05:39 am »

Here is a script to help

python

Quote

import sys
import os
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
import time

# Preverite, ali je IMDb URL podan kot parameter
if len(sys.argv) < 2:
print("IMDb URL ni bil posredovan kot parameter.")
sys.exit(1)

imdb_url = sys.argv[1] # IMDb URL iz ukazne vrstice

# Pot do geckodriver.exe
gecko_path = "C:/Projects/geckodriver.exe" # Prilagodite pot glede na lokacijo gonilnika

# Pridobite trenutno pot aplikacije
app_path = os.path.dirname(os.path.abspath(__file__)) # Pot do trenutne Python skripte

# Preverite, ali je vaša "PVD_0.9.9.21_MOD-Simple AllMovies" mapa na D: disku ali drugje
pvd_path = "D:\MyPVD\PVD_0.9.9.21_MOD-Simple AllMovies" # Nastavite to pot enkrat, da se ne spreminja

# Če želite univerzalno pot, uporabite app_path za združitev
output_path = os.path.join(pvd_path, "Scripts", "Tmp", "downpage-UTF8_NO_BOM.htm")

# Preverite, ali mapa obstaja, če ne, jo ustvarite
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Ustvarite objekt za brskalnik
service = Service(gecko_path)
driver = webdriver.Firefox(service=service)

try:
# Odprite IMDb stran
driver.get(imdb_url)
print(f"Stran {imdb_url} je naložena.")

# Počakajte, da se stran naloži
time.sleep(5)

# Pridobite celoten izvorni HTML strani
html_source = driver.page_source

# Shranite HTML v datoteko
with open(output_path, 'w', encoding='utf-8') as file:
file.write(html_source)
print(f"HTML je shranjen v datoteko: {output_path}")

finally:
# Zaprite brskalnik
driver.quit()

I apologize for some parts of the text being in Slovenian, because I used ChatGPT - GPT Chat Free Online AI and asked it questions in my own language.

You need to change some items in the script, including the path to your pvd database or is this the universal path to the pvd folder of the program.

You also need to change certain paths in the script, such as these parts of the code

Quote

Function GetDownloadURL:AnsiString; //BlockOpen
Var
curPos:Integer;
ScriptPath,MovieID:String;
Begin
LogMessage('Testna inicializacija log sistema.');
LogMessage('Testno sporočilo: Log deluje.');
LogMessage('Function GetDownloadURL BEGIN======================|');
LogMessage('Global Var-Mode|'+IntToStr(Mode)+'|');
LogMessage('Global Var-DownloadURL|'+DownloadURL+'|');
//Comprobation of needed external files.
ScriptPath:=GetAppPath+'Scripts\';
If Not(FileExists(ScriptPath+'PVdBDownPage.exe')) Then Begin
ShowMessage ('This script needs the external file for work.'+Chr(13)+'• PVdBDownPage.exe'+Chr(13)+'Read script text for futher information',SCRIPT_NAME);
Mode:=smFinished;
Result:=''; //If error returns empty string
exit;
End;
If (Mode=smSearch) Then Begin
//Get stored URL if exist.
StoredURL:=GetFieldValueXML('url');
LogMessage('Stored URL is:'+StoredURL+'||');
//Standarize the URL
StoredURL:=LowerCase(StoredURL);
StoredURL:=StringReplace(StoredURL,'https','http',True,True,False);
StoredURL:=StringReplace(StoredURL,'http://imdb.com/', 'http://www.imdb.com/', True,True,False);
StoredURL:=StringReplace(StoredURL,'http://httpbin.org/response-headers?key=','',True,False,False);
StoredURL:=StringReplace(StoredURL,' ',BASE_URL_SUF,True,True,False)+BASE_URL_SUF; //Asure that the URLs always finish BASE_URL_SUF (even in the last position
     LogMessage('* Stored URL is:'+StoredURL+'||');
//Get IMDB ID if exist.
curPos:=Pos(BASE_URL_PRE,StoredURL);
If 0<curPos Then Begin //Get IMDB_ID for search
LogMessage(' IMDB URL.');
MovieID:=TextBetWeen(StoredURL,BASE_URL_PRE,BASE_URL_SUF,false,curPos); //WEB_SPECIFIC
DownloadURL:=BASE_URL_PRE_TRUE+ MovieID +BASE_URL_SUF; //WEB_SPECIFIC
LogMessage(' Parse stored information DownloadURL:'+DownloadURL+' ||');
Mode:=smNormal; //->Go to function ParsePage for parse the film information
Result:=GetAppPath+DUMMY_HTML_FILE; //Any existing little file for cheating PVdB automatic download (little).).
LogMessage('Function GetDownloadURL END====================== with Mode='+IntToStr(Mode)+' Result='+Result+'|');
exit;
End Else Begin //The movie URL not exist, search mode needed. Download the search page.
//ShowMessage('No IMDB URL.',SCRIPT_NAME);
LogMessage(' No IMDB URL.');
Mode:=smSearch; //->Go to function ParsePage for search the URL (in this funtion you can't not use user funtions)ntions)
DownloadURL:=''; //Has not movie URL.
Result:=GetAppPath+DUMMY_HTML_FILE; //Any existing little file for cheating PVdB automatic download (little).).
LogMessage('Function GetDownloadURL END====================== with Mode='+IntToStr(Mode)+' Result='+Result+'|');
exit; //Go to the
End;
End;
//Not other modes working needs in this function.
//smNormal = 1; //This scripts download with external program (not with GetDownloadURL) so it only make one pass to ParsePage for retrieve all info, credits, poster, etc. other field modes aren't necesarye pass to ParsePage for retrieve all info, credits, poster, etc. other field modes aren't necesary
//smSearchList = 8; //Used in ParsePage for demands download the https link returned by user in the window of (AddSearchResult)(AddSearchResult)
Result:=GetAppPath+DUMMY_HTML_FILE; //Any existing little file for cheating PVdB automatic download (little).
LogMessage('Function GetDownloadURL END====================== with Mode='+IntToStr(Mode)+' Result='+Result+'|');
exit;
End; //BlockClose
.
.
.
Function DownloadPage(URL:AnsiString):String; //BlockOpen
//Returns the URL page text. If error returns empty string
Var
i:Integer;
ScriptPath,WebText:String;
Begin
LogMessage(Chr(9)+Chr(9)+'Function DownloadPage BEGIN======================|');
LogMessage(Chr(9)+Chr(9)+'Global Var-DownloadURL|'+DownloadURL+' |');
LogMessage(Chr(9)+Chr(9)+' Local Var-URL|'+URL+' |');
ScriptPath:=GetAppPath+'Scripts\';
//Delete the ancient downloaded page file.
While FileExists(ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM) Do Begin
LogMessage(Chr(9)+Chr(9)+'Deleting existing file: ' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM);
       FileExecute('cmd.exe', '/C del "'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM+'"');
LogMessage(Chr(9)+Chr(9)+' Waiting 1s for delete:'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM);
wait (1000);
End;

// Download the URL page.
//LogMessage(Chr(9)+Chr(9)+' Download with PVdBDownPage in file:|'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM+' the information of:|'+URL+' ||');
//FileExecute(ScriptPath+'PVdBDownPage.exe', '"'+URL+'" "'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM+'"');

LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:| ' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM + ' the information of:|' + URL + '||');
   LogMessage(Chr(9)+Chr(9)+'Executing Python script to download URL content.');
FileExecute('python.exe', '"' + ScriptPath + 'selenium_script.py" "' + URL + '" "' + ScriptPath + BASE_DOWNLOAD_FILE_NO_BOM + '"');

// Wait download finish and exist the downloaded page.
i:=0; // INTERNET_TEST_ITERATIONS
While Not(FileExists(ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM)) Do Begin
LogMessage(Chr(9)+Chr(9)+' Waiting 2s for exists of:'+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM);
wait (5000);
i:=i+1;
If i=INTERNET_TEST_ITERATIONS Then Begin
if 2=MessageBox('Too many faulty attempts to internet connection.'+Chr(13)+ 'Retry or Cancel?',SCRIPT_NAME,5) then begin
LogMessage(Chr(9)+Chr(9)+'Function DownloadPage END with NOT INTERNET connection ===============|');
Result:='';
Exit;
End;
i:=0;
End;
End;

LogMessage(Chr(9)+Chr(9)+' Now present complete page file: '+ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM);
WebText:=FileToString(ScriptPath+BASE_DOWNLOAD_FILE_NO_BOM);
LogMessage(Chr(9)+Chr(9)+'File content length: ' + IntToStr(Length(WebText)));
LogMessage(Chr(9)+Chr(9)+'File content (first 100 chars): ' + Copy(WebText, 1, 100));
WebText:=ConvertEncoding(WebText, 65001);
Result:=WebText;

// Some download data validations.
if (Pos('404 Not Found',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The URL is not in use (404 Not Found).'+Chr(13)+'Go to the provider web in order to find the good page',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' 404 Not Found|');
Result:='';
End;

if (Pos('404 Error - IMDb',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The URL is not in use (404 Error - IMDb).'+Chr(13)+'Go to the provider web in order to find the good page',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' 404 Error - IMDb|');
Result:='';
End;

if (Pos('Page not found',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The URL is not in use (Page not found).'+Chr(13)+'Go to the provider web in order to find the good page',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' Page not found|');
Result:='';
End;

if (Pos('405 Method not allowed',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The URL has HTTP method problems (405 Method not allowed).'+Chr(13)+'Go to the provider web in order to find the good page',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' 405 Method not allowed|');
Result:='';
End;
if (Pos('Too many request',Result)>0) then begin
If BYPASS_SILENT Then ShowMessage('The provider has banned your IP (Too many request).'+Chr(13)+'Go to the provider web and resolve the captcha in order to prove you are not a robot',SCRIPT_NAME);
LogMessage(Chr(9)+Chr(9)+' Banned IP|');
Result:='';
End;

   LogMessage('Value BASE_DOWNLOAD_FILE_NO_BOM: ' + BASE_DOWNLOAD_FILE_NO_BOM);
LogMessage(Chr(9)+Chr(9)+'Function DownloadPage END======================|');
exit;
End; //BlockClose

Function DownloadImage(URL:AnsiString;OutPutFile:AnsiString):Integer; //BlockOpen
//Returns 1 or 0 if the downloaded image file exists in Exit.
//Var
//i:Integer;
//ScriptPath:String;
Begin
   (*
LogMessage(Chr(9)+Chr(9)+'Function DownloadImage BEGIN======================|');
LogMessage(Chr(9)+Chr(9)+'Global Var-DownloadURL|'+DownloadURL+' |');
LogMessage(Chr(9)+Chr(9)+' Local Var-URL|'+URL+' |');
LogMessage(Chr(9)+Chr(9)+' Local Var-OutPutFile|'+OutPutFile+'|');
ScriptPath:=GetAppPath+'Scripts\';
//Delete the ancient dowloaded page file. Needed for wait to curl download included in PowerShell command.
While FileExists(OutPutFile) Do Begin
FileExecute('cmd.exe', '/C del "'+OutPutFile+'"');
LogMessage(Chr(9)+Chr(9)+' Waiting 1s for delete:'+OutPutFile);
wait (1000);
End;
//Download the URL page.
LogMessage(Chr(9)+Chr(9)+' Download with PVdBDownPage in file:|'+OutPutFile+' the information of:|'+URL+' ||');
FileExecute(ScriptPath+'PVdBDownPage.exe', '"'+URL+'" "'+OutPutFile+'"');
//Wait download finish and exist the downloaded page.
i:=0; // INTERNET_TEST_ITERATIONS
While Not(FileExists(OutPutFile)) Do Begin
LogMessage(Chr(9)+Chr(9)+' Waiting 2s for exists of:'+OutPutFile);
wait (2000);
i:=i+1;
If i=INTERNET_TEST_ITERATIONS Then Begin //In the images download the scritp can not ask to the user for internet conexion because perhaps the file doesn't exist.
LogMessage(Chr(9)+Chr(9)+'Function DownloadImage END with NOT file downloaded ===============|');
Result:=0;
exit;
End;
End;
LogMessage(Chr(9)+Chr(9)+' Now present complete page file: '+OutPutFile);
Result:=1;
LogMessage(Chr(9)+Chr(9)+'Function DownloadImage END======================|');
exit;
*)
End; //BlockClose
.
.
.
Function ParsePage(HTML:String;URL:AnsiString):Cardinal; //BlockOpen
Var
MovieID,titleValue,yearValue:String;
ResultTmp:Cardinal;
Date:String;
Fullinfo,Movie_URL,IMDB_URL:String;
DateParts:TWideArray;
   Fullinfo1,MovieID1:String;
Begin
.
.
.
//Parse Also Known As provider page = BASE_URL_AKA-------------------------------------------------------------------
If (GET_FULL_AKA and Not(USE_SAVED_PVDCONFIG and (Copy(PVDConfigOptions,opAKA,1)='0'))) Then Begin
//If (GET_FULL_AKA and (MediaType='Movie') and Not(USE_SAVED_PVDCONFIG and (Copy(PVDConfigOptions,opAKA,1)='0'))) Then Begin
//If (GET_FULL_AKA and Not(USE_SAVED_PVDCONFIG and (Copy(PVDConfigOptions,opAKA,1)='0'))) Then Begin
DownloadURL:=StringReplace(BASE_URL_AKA,'%IMDB_ID',MovieID,True,True,False);
HTML:=DownloadPage(DownloadURL); //True page for parsing
         //HTML := DownloadPage(DownloadURL, 'Tmp\downpage-UTF8_NO_BOM_AKA.htm'); // True page for parsing
         //BASE_DOWNLOAD_FILE_NO_BOM_AKA = 'Tmp\downpage-UTF8_NO_BOM_AKA.htm';
HTML:=HTMLToText(HTML);
ResultTmp:=ParsePage_IMDBMovieAKA(HTML);
If Not(ResultTmp=prFinished) then Result:=ResultTmp;
End;

But leave the ParsePage function only one as seen in my example and it should work

afrocuban · « **Reply #10 on:** December 18, 2024, 09:10:09 pm »

THanks. I'll keep this in mind.

I am now too deep in the People script, to leave it this point, and I thought movie script would be your focus, and me to help with surrounding scripts ad selenium itself. I thought I'd need more time to get into selenium, but I was lucky to adapt quickly. At the moment, I am at the point that I adapted script to download multiple pages locally to easier track what is actually scraped, I got selenium script to downlad those pages, and now in the middle of how to get dynamic content actually downloaded. I got it for People Credits page:

Quote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os

# Paths
CHROME_DRIVER_PATH = r"C:\PersonalVideoDB\Scripts\Tmp\chromedriver.exe"
CHROME_BINARY_PATH = r"C:\GoogleChromePortable64\App\Chrome-bin\chrome.exe"
SAVE_PATH = r"C:\PersonalVideoDB\Scripts\Tmp\downpage-UTF8_NO_BOM-Credit.mhtml"

# IMDb URL
IMDB_URL = "https://www.imdb.com/name/nm0000040/?showAllCredits=true"

# Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = CHROME_BINARY_PATH # Specify the Chrome binary location
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu") # Disable GPU for headless mode stability
chrome_options.add_argument("--headless") # Running Chrome in headless mode
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

# Service object for ChromeDriver
service = Service(executable_path=CHROME_DRIVER_PATH)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Add a custom cookie
driver.get("https://www.imdb.com") # Open the base URL to set the cookie
cookie = {'name': 'example_cookie', 'value': 'example_value', 'domain': 'imdb.com'}
driver.add_cookie(cookie)

# Navigate to the IMDb page
driver.get(IMDB_URL)

# Wait for the page to fully load and specific element to ensure all content is loaded
try:
WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "span.ipc-title__text"))
)
except Exception as e:
print(f"Error waiting for the page to load: {e}")

# Get page source
page_source = driver.page_source

# Constructing the MHTML content manually
mhtml_content = f"""MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01D4E1C0.CE6AA5F0"

This document is a Single File Web Page, also known as a Web Archive file.

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0
Content-Location: {IMDB_URL}
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

{page_source}

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0--
"""

# Write the MHTML data to the specified file path
with open(SAVE_PATH, "w", encoding="utf-8") as file:
file.write(mhtml_content)

# Wait to ensure the file is saved
time.sleep(5) # Adjust the sleep time if necessary

# Confirm file creation
if os.path.exists(SAVE_PATH):
print(f"Page saved successfully to {SAVE_PATH}")
else:
print(f"Failed to save the page to {SAVE_PATH}")

# Close the browser
driver.quit()

and just added Awards function to script, and modified DownloadPage and ParsePage functions to split downpage-UTF8_NO_BOM.htm into downloading different file for each function: Principal, Bio, Credit, Awards and Genre.

Now, when I have all these pages, it'll be easier to track and parse, at least for me not knowing to code.

Awards page is crucial for all other scripts, because there I have to make it mimic clicking on a "More" /"All" and similar buttons, and to search and recognize them all actually, to wait for them. On filmaffinity.co was easier because their values are hidden behind the button, but on imdb that's not the case, thus more challenging.

After that comes the challenge to pass the url to selenium script, then to again readapt script not them to try to download pages (I need them now to see what selenium needs to downlad, but properly), and after that, as I see it as a long term goal, those psf files to serve only to call selenium script to pass it Title and year, and actually selenium would do the whole job of finding the movie including offering us too choose after which goes parsing, extracting and formatting data, and to pass them back to .psf to get it to PVD database. The concept is feasible:

afrocuban · « **Reply #11 on:** December 18, 2024, 09:13:02 pm »

Python scripts using Selenium can definitely parse data from an IMDb page. Let's extend our existing script to extract and parse relevant information such as movie titles, roles, and other credits from the IMDb page.

Example Script to Parse Data from IMDb Page:
python

Quote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import csv

# Paths
CHROME_DRIVER_PATH = r"Q:\\ChromeDriver-win64\\chromedriver.exe"
CHROME_BINARY_PATH = r"Q:\\GoogleChromePath\\chrome.exe"

# IMDb URL
IMDB_URL = "https://www.imdb.com/name/nm0000040/?showAllCredits=true"

# Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = CHROME_BINARY_PATH
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

# Service object for ChromeDriver
service = Service(executable_path=CHROME_DRIVER_PATH)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Navigate to the specific IMDb page
driver.get(IMDB_URL)

# Wait for the credits section to load
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".filmo-category-section")))

# Scroll to the bottom to ensure all content is loaded
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(10)
driver.execute_script("window.scrollTo(0, 0);")
time.sleep(5)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(10)

# Extract movie titles and roles
credits = driver.find_elements(By.CSS_SELECTOR, ".filmo-category-section .filmo-row")

# Prepare the data for CSV
data = []
for credit in credits:
title_element = credit.find_element(By.CSS_SELECTOR, "b a")
title = title_element.text if title_element else "N/A"
year = credit.find_element(By.CSS_SELECTOR, ".year_column").text.strip()
role_elements = credit.find_elements(By.CSS_SELECTOR, "a[href*='?ref_=nmbio_']") # Adjust the selector to match your needs
roles = [role.text for role in role_elements] if role_elements else ["N/A"]
data.append([title, year, ", ".join(roles)])

# Save the data to a CSV file
csv_path = r"C:\\PersonalVideoDB\\Scripts\\Tmp\\credits.csv"
with open(csv_path, 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(["Title", "Year", "Roles"])
writer.writerows(data)

# Confirm file creation
if os.path.exists(csv_path):
print(f"Data saved successfully to {csv_path}")
else:
print(f"Failed to save the data to {csv_path}")

# Close the browser
driver.quit()

Key Points:
Wait for the Page to Fully Load: Ensure dynamic content is loaded by waiting for specific elements.

Scroll to Load All Content: Scroll up and down to trigger lazy loading.

Extract Relevant Data: Parse movie titles, years, and roles from the IMDb page.

Save Data: Write the parsed data to a CSV file for easy import into your database.

afrocuban · « **Reply #12 on:** December 18, 2024, 09:15:31 pm »

The concept is feasible and can be efficiently implemented. The idea of using a Pascal/Delphi script to call a Python script with Selenium is quite practical. Here’s a step-by-step outline on how to achieve this efficiently:

1. Pascal/Delphi Script (.psf)
Your Delphi/Pascal application calls a Python script.

It passes the movie title and year to the Python script.

2. Python Script with Selenium
The Python script searches IMDb for the movie.

It offers titles if there are multiple matches and lets you choose.

It parses the relevant data, formats it, and passes it back to the Pascal script.

3. Pascal/Delphi Script Receives Data
The Pascal script receives the data and integrates it into your database.

Detailed Steps
Step 1: Pascal/Delphi Script to Call Python Script
Here’s an example of how to call a Python script from Pascal/Delphi:

pascal

Quote

program CallPython;

uses
ShellAPI, SysUtils, Windows;

var
Title, Year: string;
PythonExe, ScriptPath, Parameters: string;
ReturnCode: Integer;
ResultFile: TextFile;
Line: string;
begin
Title := 'MovieTitle'; // These would be input parameters in your actual app
Year := '2021';

PythonExe := 'C:\Path\To\Python\python.exe';
ScriptPath := 'C:\Path\To\Script\imdb_script.py';
Parameters := Format('"%s" "%s" "%s"', [ScriptPath, Title, Year]);

ReturnCode := ShellExecute(0, 'open', PChar(PythonExe), PChar(Parameters), nil, SW_HIDE);
if ReturnCode > 32 then
Writeln('Python script executed successfully')
else
Writeln('Failed to execute Python script');

// Assuming the Python script writes to a result file
AssignFile(ResultFile, 'C:\Path\To\Script\results.txt');
Reset(ResultFile);
while not Eof(ResultFile) do
begin
ReadLn(ResultFile, Line);
Writeln(Line); // Process each line (this would be your actual data processing)
end;
CloseFile(ResultFile);
end.

Step 2: Python Script to Search and Parse IMDb
Here’s an example Python script that receives parameters, searches IMDb, and writes results to a file:

python

Quote

import sys
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time

CHROME_DRIVER_PATH = "path/to/chromedriver"
IMDB_URL_TEMPLATE = "https://www.imdb.com/find?q={query}&s=tt"

def main():
if len(sys.argv) != 3:
print("Usage: imdb_script.py <Title> <Year>")
return

title = sys.argv[1]
year = sys.argv[2]
query = f"{title} {year}"

service = Service(executable_path=CHROME_DRIVER_PATH)
options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(service=service, options=options)
driver.get(IMDB_URL_TEMPLATE.format(query=query))

time.sleep(3) # Wait for page to load

# Assuming first result is the desired movie
first_result = driver.find_element_by_css_selector(".result_text a")
first_result.click()

time.sleep(3) # Wait for movie page to load

movie_title = driver.find_element_by_css_selector("h1").text
movie_year = driver.find_element_by_css_selector("#titleYear a").text

# Collect other relevant data...
director = driver.find_element_by_css_selector("div.credit_summary_item a").text

# Write results to file
with open("path/to/results.txt", "w") as file:
file.write(f"Title: {movie_title}\n")
file.write(f"Year: {movie_year}\n")
file.write(f"Director: {director}\n")

driver.quit()

if __name__ == "__main__":
main()

Step 3: Process Results in Pascal/Delphi
Back in your Pascal/Delphi script, read and process the results written by the Python script.

Efficiency Considerations
Parallel Execution: If you’re processing multiple movies, consider parallel execution where possible.

Caching: Implement caching for already fetched and processed data to avoid redundant processing.

Error Handling: Implement robust error handling and logging to manage issues with web scraping or data extraction.

Summary
This approach ensures the seamless integration of Python and Pascal/Delphi scripts, leveraging the strengths of each for your specific needs. This should streamline the process and reduce manual intervention, improving overall efficiency.

afrocuban · « **Reply #13 on:** December 18, 2024, 09:19:56 pm »

At the very end, to set up environment is too complex for an average end user, so I'll try exploring options at the beginning how to check if user has python, selenium, chrome driver, and if not to offer to the user to download an set all of that for him...

A long road ahead for inevitable transition to almost full selenium-like tools, but once set up it'll be way easier because we at least won't have to download pages thus overriding "HTTPS issue" once and for all.

afrocuban · « **Reply #14 on:** December 18, 2024, 09:22:39 pm »

Regarding your script in Slovenian, I just asked Copilot to translate it to English and here it is:

Quote

import sys
import os
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
import time

# Check if IMDb URL is provided as a parameter
if len(sys.argv) < 2:
print("IMDb URL was not provided as a parameter.")
sys.exit(1)

imdb_url = sys.argv[1] # IMDb URL from the command line

# Path to geckodriver.exe
gecko_path = "C:/Projects/geckodriver.exe" # Adjust the path according to the driver location

# Get the current application path
app_path = os.path.dirname(os.path.abspath(__file__)) # Path to the current Python script

# Check if your "PVD_0.9.9.21_MOD-Simple AllMovies" folder is on the D: drive or elsewhere
pvd_path = "D:\\MyPVD\\PVD_0.9.9.21_MOD-Simple AllMovies" # Set this path once, so it does not change

# If you want a universal path, use app_path to combine
output_path = os.path.join(pvd_path, "Scripts", "Tmp", "downpage-UTF8_NO_BOM.htm")

# Check if the folder exists, if not, create it
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Create a browser object
service = Service(gecko_path)
driver = webdriver.Firefox(service=service)

try:
# Open the IMDb page
driver.get(imdb_url)
print(f"The page {imdb_url} is loaded.")

# Wait for the page to load
time.sleep(5)

# Get the entire source HTML of the page
html_source = driver.page_source

# Save the HTML to a file
with open(output_path, 'w', encoding='utf-8') as file:
file.write(html_source)
print(f"HTML is saved to file: {output_path}")

finally:
# Close the browser
driver.quit()

afrocuban · « **Reply #15 on:** December 19, 2024, 06:05:36 am »

Here's selenium script that expands "more"-like buttons and captures whole content of the page, static and dynamic:

Quote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import os

# Paths
CHROME_DRIVER_PATH = r"C:\ChromeDriver-win64\chromedriver.exe"
CHROME_BINARY_PATH = r"C:\PATH_TO\chrome.exe"
SAVE_PATH = r"C:\PersonalVideoDB\Scripts\Tmp\UTF8_NO_BOM-Awards.mhtml"

# IMDb URL
IMDB_URL = "https://www.imdb.com/name/nm0190859/awards/"

# Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = CHROME_BINARY_PATH # Specify the Chrome binary location
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu") # Disable GPU for stability
# Remove the headless mode option for non-headless browsing
# chrome_options.add_argument("--headless")
# chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

# Service object for ChromeDriver
service = Service(executable_path=CHROME_DRIVER_PATH)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Navigate to the IMDb page
driver.get(IMDB_URL)

# Wait for the page to fully load
time.sleep(3) # Adjusted sleep time for faster loading

# Function to find and click each "See More" button one by one
def click_see_more_buttons():
while True:
buttons = driver.find_elements(By.XPATH, "//span[@class='ipc-btn__text']//span[contains(@class, 'ipc-see-more__text')]")
if not buttons:
break
for button in buttons:
try:
driver.execute_script("arguments[0].scrollIntoView(true);", button) # Scroll to the button to ensure its visible
time.sleep(1)
button.click()
print(f"Clicked a 'See More' button.")
time.sleep(1) # Short delay to ensure the click is registered
except Exception as e:
print(f"Error clicking a 'See More' button: {e}")

# Click the "See More" buttons
click_see_more_buttons()

# Get page source
page_source = driver.page_source

# Constructing the MHTML content manually
mhtml_content = f"""MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01D4E1C0.CE6AA5F0"

This document is a Single File Web Page, also known as a Web Archive file.

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0
Content-Location: {IMDB_URL}
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

{page_source}

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0--
"""

# Write the MHTML data to the specified file path
with open(SAVE_PATH, "w", encoding="utf-8") as file:
file.write(mhtml_content)

# Wait to ensure the file is saved
time.sleep(2) # Adjust the sleep time if necessary

# Confirm file creation
if os.path.exists(SAVE_PATH):
print(f"Page saved successfully to {SAVE_PATH}")
else:
print(f"Failed to save the page to {SAVE_PATH}")

# Close the browser
driver.quit()

afrocuban · « **Reply #16 on:** December 19, 2024, 10:46:50 am »

And here's final headless working version that can be used for all sections. The only further speed up I can think of is running in parallel processes, when selenium is (not to wait each page to be downloaded to be parsed by .psf), but I'm not sure how big that improvement can be on the side of .psf or anywhere else..

Quote

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import os

# Paths
CHROME_DRIVER_PATH = r"C:\PATH\TO\chromedriver.exe"
CHROME_BINARY_PATH = r"C:\PATH\TO\chrome.exe"
SAVE_PATH = r"C:\PATH\TO\PersonalVideoDB\Scripts\Tmp\UTF8_NO_BOM-Awards.mhtml"

# IMDb URL
IMDB_URL = "https://www.imdb.com/name/nm0190859/awards/"

# Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = CHROME_BINARY_PATH # Specify the Chrome binary location
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu") # Disable GPU for stability
# Uncomment the headless mode option for headless browsing
chrome_options.add_argument("--headless")
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

# Service object for ChromeDriver
service = Service(executable_path=CHROME_DRIVER_PATH)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Adding cookie
cookie = {'name': 'example_cookie', 'value': 'cookie_value'}
driver.get(IMDB_URL)
driver.add_cookie(cookie)

# Navigate to the IMDb page
driver.get(IMDB_URL)

# Wait for the page to fully load
time.sleep(3) # Adjusted sleep time for faster loading

# Function to find and click each "See More" button one by one
def click_see_more_buttons():
while True:
buttons = driver.find_elements(By.XPATH, "//span[@class='ipc-btn__text']//span[contains(@class, 'ipc-see-more__text')]")
if not buttons:
break
for button in buttons:
try:
driver.execute_script("arguments[0].scrollIntoView(true);", button) # Scroll to the button to ensure its visible
time.sleep(1)
button.click()
print(f"Clicked a 'See More' button.")
time.sleep(1) # Short delay to ensure the click is registered
except Exception as e:
print(f"Error clicking a 'See More' button: {e}")

# Click the "See More" buttons
click_see_more_buttons()

# Get page source
page_source = driver.page_source

# Constructing the MHTML content manually
mhtml_content = f"""MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01D4E1C0.CE6AA5F0"

This document is a Single File Web Page, also known as a Web Archive file.

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0
Content-Location: {IMDB_URL}
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

{page_source}

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0--
"""

# Write the MHTML data to the specified file path
with open(SAVE_PATH, "w", encoding="utf-8") as file:
file.write(mhtml_content)

# Wait to ensure the file is saved
time.sleep(2) # Adjust the sleep time if necessary

# Confirm file creation
if os.path.exists(SAVE_PATH):
print(f"Page saved successfully to {SAVE_PATH}")
else:
print(f"Failed to save the page to {SAVE_PATH}")

# Close the browser
driver.quit()

afrocuban · « **Reply #17 on:** December 19, 2024, 11:17:10 am »

And here is integrated selenium script that downloads in parallel all the pages needed for parsing. Genres has to be html, because of what has to be parsed, otherwise in mhtml it's not accessible.

Quote

import threading
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os

# Paths
CHROME_DRIVER_PATH = r"Q:\Portableapps\ChromeDriver-win64\chromedriver.exe"
CHROME_BINARY_PATH = r"Q:\Portableapps\GoogleChromePortable64\App\Chrome-bin\chrome.exe"

# IMDb URLs and their corresponding save paths
URLS_AND_PATHS = {
"https://www.imdb.com/name/nm0190859/awards/": r"Q:\Portableapps\PersonalVideoDB\Scripts\Tmp\UTF8_NO_BOM-Awards.mhtml",
"https://www.imdb.com/name/nm0190859/": r"Q:\Portableapps\PersonalVideoDB\Scripts\Tmp\UTF8_NO_BOM-Main.mhtml",
"https://www.imdb.com/name/nm0190859/bio/": r"Q:\Portableapps\PersonalVideoDB\Scripts\Tmp\UTF8_NO_BOM-Bio.mhtml",
"https://www.imdb.com/search/title/?explore=genres&role=nm0190859": r"Q:\Portableapps\PersonalVideoDB\Scripts\Tmp\UTF8_NO_BOM-Genres.html", # Change to .html for classic HTML
"https://www.imdb.com/name/nm0190859/?showAllCredits=true": r"Q:\Portableapps\PersonalVideoDB\Scripts\Tmp\UTF8_NO_BOM-Credit.mhtml"
}

# Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = CHROME_BINARY_PATH # Specify the Chrome binary location
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu") # Disable GPU for headless mode stability
chrome_options.add_argument("--headless") # Running Chrome in headless mode
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

# Function to download a page
def download_page(IMDB_URL, SAVE_PATH):
# Service object for ChromeDriver
service = Service(executable_path=CHROME_DRIVER_PATH)

# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)

# Add a custom cookie
driver.get("https://www.imdb.com") # Open the base URL to set the cookie
cookie = {'name': 'example_cookie', 'value': 'example_value', 'domain': 'imdb.com'}
driver.add_cookie(cookie)

# Navigate to the IMDb page
driver.get(IMDB_URL)

# Wait for the page to fully load and specific element to ensure all content is loaded
try:
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "span.ipc-title__text"))
)
except Exception as e:
print(f"Error waiting for the page to load: {e}")

# Special handling for the awards page
if "awards" in IMDB_URL:
# Function to find and click each "See More" button one by one
def click_see_more_buttons():
while True:
buttons = driver.find_elements(By.XPATH, "//span[@class='ipc-btn__text']//span[contains(@class, 'ipc-see-more__text')]")
if not buttons:
break
for button in buttons:
try:
driver.execute_script("arguments[0].scrollIntoView(true);", button) # Scroll to the button to ensure its visible
time.sleep(1)
button.click()
print(f"Clicked a 'See More' button.")
time.sleep(1) # Short delay to ensure the click is registered
except Exception as e:
print(f"Error clicking a 'See More' button: {e}")

# Click the "See More" buttons
click_see_more_buttons()

# Get page source
page_source = driver.page_source

# Save page source as MHTML or HTML
if IMDB_URL == "https://www.imdb.com/search/title/?explore=genres&role=nm0190859":
# Save as classic HTML
with open(SAVE_PATH, "w", encoding="utf-8") as file:
file.write(page_source)
print(f"Page saved as HTML to {SAVE_PATH}")
else:
# Constructing the MHTML content manually
mhtml_content = f"""MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01D4E1C0.CE6AA5F0"

This document is a Single File Web Page, also known as a Web Archive file.

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0
Content-Location: {IMDB_URL}
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

{page_source}

------=_NextPart_000_0000_01D4E1C0.CE6AA5F0--
"""

# Write the MHTML data to the specified file path
with open(SAVE_PATH, "w", encoding="utf-8") as file:
file.write(mhtml_content)
print(f"Page saved as MHTML to {SAVE_PATH}")

# Wait to ensure the file is saved
time.sleep(2) # Adjust the sleep time if necessary

# Confirm file creation
if os.path.exists(SAVE_PATH):
print(f"Page saved successfully to {SAVE_PATH}")
else:
print(f"Failed to save the page to {SAVE_PATH}")

# Close the browser
driver.quit()

# Create and start threads for each URL
threads = []
for url, save_path in URLS_AND_PATHS.items():
thread = threading.Thread(target=download_page, args=(url, save_path))
threads.append(thread)
thread.start()

# Wait for all threads to complete
for thread in threads:
thread.join()

print("All pages have been saved successfully.")

Next steps:
1. Testing .psf just parsing these pages, for the speficic person.
2. Testing passing person url to selenium script, waiting for downloading to finish.
3. Adjusting selenium script to accept parameters (urls, title, year) and to process them.
4. Testing .psf to get back results and parse them, format them and populate them to PVD.

Ivek23 · « **Reply #18 on:** December 19, 2024, 01:04:57 pm »

This is out of the question for me because I only use Firefox and geckodriver.

afrocuban · « **Reply #19 on:** December 20, 2024, 12:41:14 am »

Well, I don't use any major browser at all actually. I'm too into security. If there wasn't PVD I wouldn't use Windows at all. That's why I have my Windows virtual machine - only to be able to have PVD. But I don't care if in this case it's firefox or chrome, because I will not use them anyway. Selenium will, and neither me or anyone else will notice. Except there is a reason I'm not aware of, for which I apologize in advance!

News:

Author Topic: Integrating Selenium to PVD (Read 13141 times)