NetworkShark I have a question for you.
Would it be possible to write a script to download all IMDb Also Known As (AKA) titles from, for example, this url
https://www.imdb.com/title/tt0147800/releaseinfo/?ref_=tt_ov_at_dt_rdat
Namely, IMDb uses dynamic display of Also Known As (AKA) titles and uses this method on some other IMDb websites.
I am adding the IMDB_[EN][HTTPS]_TEST_2c 2c script to help make it easier.
I hope that some solution will be found within PVD so that it can be used to transfer information to the pvd database.
Are you sure the AKA data is dynamic? I've dowloaded by linux machine and the data are here in the page.
I do some tests with the script 2c2c you have linked here and I return with some answers
Sorry that I'm in the conversation uninvited and not knowing programming.
No problem afrocuban, indeed I am interested in how I can integrate other ways to improve info retrieval from PVD
Maybe this part can be a separate thread in this forum
But I learned a bit along the way trying to locally download FA page with trailers. I succeeded to download it as downpage_trailer-UTF8_NO_BOM_FA.htm beside downpage-UTF8_NO_BOM.htm in order to try to parse them both, but no good news. I mean, I can parse them both, but dynamic content isn't downloaded,[/size]
I don't think we can do it with PVdBDownPage.exe:
// Download the URL page.
LogMessage(Chr(9) + Chr(9) + ' Download with PVdBDownPage in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');
FileExecute(ScriptPath + 'PVdBDownPage.exe', '"' + URL + '" "' + ScriptPath + OutFile + '"');
because it cannot download dynamic content since it is likely due to the JavaScript that loads the video elements dynamically on the original page. When you download the page using DownloadPage, the JavaScript doesn’t execute, so the dynamic content (like embedded trailers) isn’t included in the downloaded HTML file.
To work around this, we have a couple of options:
Use a Web Browser Automation Tool: Tools like Selenium can control a web browser to visit the page and let all JavaScript execute before saving the fully rendered page’s HTML. This way, the dynamic content will be included.
Find the Video URL Manually: Check the original HTML of the page in your browser (right-click and “View Page Source” or inspect elements with Developer Tools). Look for the actual URLs of the video files. You can then use those URLs directly in your script.
Using a web browser automation tool like Selenium is a bit more involved and requires additional setup, but it is a more robust solution for handling dynamic content. If you prefer to go this route, I can guide you through the setup process.
Alternatively, manually inspecting the page’s source for video URLs is quicker but less convenient for repeated tasks.
Here's the rest of the conversation:
Invoking Selenium directly from within a PVD script (written in Pascal) is not straightforward because Selenium typically operates via Python, Java, or another high-level language. However, you can use a workaround by invoking an external Python script from your PVD Pascal script.
Here’s how you can integrate Selenium into your workflow:
Create a Python Script Using Selenium:
This script will navigate to the URL, wait for the content to load, and save the fully rendered HTML.
python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')
# Navigate to the URL
driver.get("https://www.filmaffinity.com/en/evideos.php?movie_id=xxxx")
# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)
# Get the fully rendered HTML
html = driver.page_source
# Save the HTML to a file
with open('downpage_trailer-UTF8_NO_BOM_FA.htm', 'w', encoding='utf-8') as file:
file.write(html)
# Close the browser
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')
# Navigate to the URL
driver.get("https://www.filmaffinity.com/en/evideos.php?movie_id=xxxx")
# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)
# Get the fully rendered HTML
html = driver.page_source
# Save the HTML to a file
with open('downpage_trailer-UTF8_NO_BOM_FA.htm', 'w', encoding='utf-8') as file:
file.write(html)
# Close the browser
driver.quit()
Invoke the Python Script from Your PVD Script:
Use the FileExecute method to call the Python script from within your Pascal script.
pascal
// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');
// Construct the command to run the Python script
FileExecute('python.exe', ScriptPath + 'selenium_script.py');
In this setup:
The Python script handles the browsing and downloading of the fully rendered HTML using Selenium.
Your PVD script uses the FileExecute method to call the Python script, passing any necessary parameters if needed.
After the Python script finishes and the HTML is saved, your PVD script can continue processing the saved HTML as usual.
Example Complete Workflow
1. Python Script (selenium_script.py)
python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import sys
# Get the URL from the command line arguments
url = sys.argv[1]
output_file = sys.argv[2]
# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')
# Navigate to the URL
driver.get(url)
# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)
# Get the fully rendered HTML
html = driver.page_source
# Save the HTML to a file
with open(output_file, 'w', encoding='utf-8') as file:
file.write(html)
# Close the browser
driver.quit()
2. Pascal Script (Invoke Selenium Script)
pascal
// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');
// Construct the command to run the Python script
FileExecute('python.exe', '"' + ScriptPath + 'selenium_script.py" "' + URL + '" "' + ScriptPath + OutFile + '"');
By following this workflow, you can effectively integrate Selenium into your PVD script to handle pages with dynamic content. This way, you can ensure that all necessary content, like embedded trailers, is included in the HTML that your PVD script processes.
How would selenium know which url to open?
Great question. You can pass the URL to Selenium through command line arguments when you invoke the Python script. This way, your PVD script can dynamically supply the URL to Selenium. Here's how you can do it:
Step-by-Step Implementation
Modify Your Python Script to Accept Command Line Arguments:
You’ll adjust the Python script to accept the URL as an argument. This ensures that Selenium knows which URL to open.
Example Python Script (selenium_script.py):
python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import sys
# Get the URL and output file from the command line arguments
url = sys.argv[1]
output_file = sys.argv[2]
# Setup the WebDriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')
# Navigate to the URL
driver.get(url)
# Wait for the trailers to load (adjust the waiting condition as needed)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "divvid"))
)
# Get the fully rendered HTML
html = driver.page_source
# Save the HTML to a file
with open(output_file, 'w', encoding='utf-8') as file:
file.write(html)
# Close the browser
driver.quit()
Invoke the Python Script from Your PVD Script:
Use the FileExecute method to call the Python script, passing the URL and the output file path as arguments.
Example Pascal Script:
pascal
// Log the action
LogMessage(Chr(9) + Chr(9) + ' Download with Selenium in file:|' + ScriptPath + OutFile + ' the information of:|' + URL + '||');
// Construct the command to run the Python script
FileExecute('python.exe', '"' + ScriptPath + 'selenium_script.py" "' + URL + '" "' + ScriptPath + OutFile + '"');
How It Works
Python Script: The Python script (selenium_script.py) receives the URL and the output file path as command line arguments. It then uses Selenium to open the URL, waits for the trailers to load, and saves the fully rendered HTML to the specified output file.
Pascal Script: Your PVD script constructs the command to run the Python script with the necessary arguments (the URL and the output file path) and then executes it. This ensures that the correct URL is passed to Selenium dynamically.
This approach allows you to integrate Selenium seamlessly into your PVD script, enabling it to handle pages with dynamic content effectively. If you need further assistance with any part of the setup, feel free to ask!
How slow is this process?
The process of using Selenium to render and download the HTML can be slower than simply downloading a static page because Selenium has to:
Open a Web Browser: Launching a browser instance (e.g., Chrome or Firefox) takes a few seconds.
Navigate to the URL: Just like a human user, Selenium will navigate to the specified URL, which involves network latency.
Wait for JavaScript Execution: Selenium waits for the page to fully load, including the execution of any JavaScript that dynamically loads content. This can vary in time depending on the complexity of the page and the speed of your internet connection.
Save the HTML: Once the page is fully loaded, Selenium will extract and save the HTML content, which is relatively quick.
Estimated Time
Browser Launch: 1-3 seconds
Page Navigation: 1-5 seconds (depends on network speed)
JavaScript Execution & Content Load: 3-10 seconds (depends on page complexity)
Saving HTML: <1 second
In total, it can take anywhere from 5 to 20 seconds or more per page, depending on various factors.
Performance Considerations
Network Speed: Faster internet connections will reduce the navigation and loading time.
Page Complexity: Pages with heavy JavaScript and multiple dynamic elements will take longer to fully load.
Browser & System Performance: Using a faster browser and a system with more resources (CPU, RAM) will speed up the process.
While this is slower compared to a direct HTML download using PVdBDownPage.exe, it ensures that all dynamic content is fully loaded and accessible. This trade-off can be worthwhile if you need to scrape content that is loaded dynamically by JavaScript.
If speed becomes a critical issue, you might look into optimizing the Selenium script further, such as by reducing the wait times where possible or running multiple instances in parallel if applicable.