feat(youtube): support youtu.be short URLs and update transcript API …#1714
feat(youtube): support youtu.be short URLs and update transcript API …#1714FreyaFujo wants to merge 2 commits intomicrosoft:mainfrom
Conversation
…usage Fixes microsoft#1704. Migrated from deprecated YouTubeTranscriptApi.list and .fetch to instance-based .list_transcripts and .find_transcript().fetch(). Added support for short youtu.be URLs by parsing video ID from path.
|
@FreyaFujo please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
This PR updates the YouTube converter to better handle real-world YouTube URLs and to migrate transcript fetching to the newer youtube-transcript-api instance-based API, addressing #1704.
Changes:
- Accept
https://youtu.be/<id>short URLs in the YouTube converter. - Extract video IDs from the URL path for
youtu.beURLs (and fromv=for watch URLs). - Migrate transcript fetching from deprecated API usage to
list_transcripts()+find_transcript(...).fetch()with retry.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
packages/markitdown/src/markitdown/converters/_youtube_converter.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not ( | ||
| url.startswith("https://www.youtube.com/watch?") | ||
| or url.startswith("https://youtu.be/") | ||
| ): | ||
| # Not a YouTube URL |
There was a problem hiding this comment.
accepts() still relies on hard-coded startswith("https://...") checks. This will reject many valid YouTube URLs (e.g., http://..., https://youtube.com/watch?... without www, https://m.youtube.com/watch?..., etc.), and it also reintroduces the same limitation noted in the prior review thread. Consider using urlparse(url) and validating scheme in {"http","https"}, hostname in an allowlist (e.g., {"youtube.com","www.youtube.com","m.youtube.com","youtu.be"}), and (optionally) path matching instead of string prefixes.
| if not ( | |
| url.startswith("https://www.youtube.com/watch?") | |
| or url.startswith("https://youtu.be/") | |
| ): | |
| # Not a YouTube URL | |
| parsed_url = urlparse(url) | |
| scheme = (parsed_url.scheme or "").lower() | |
| hostname = (parsed_url.hostname or "").lower() | |
| path = parsed_url.path or "" | |
| is_youtube_url = False | |
| if scheme in {"http", "https"}: | |
| if hostname in {"youtube.com", "www.youtube.com", "m.youtube.com"}: | |
| is_youtube_url = path == "/watch" | |
| elif hostname == "youtu.be": | |
| is_youtube_url = path not in {"", "/"} | |
| if not is_youtube_url: | |
| # Not a supported YouTube URL |
| parsed_url = urlparse(stream_info.url) # type: ignore | ||
| params = parse_qs(parsed_url.query) # type: ignore | ||
| if "v" in params and params["v"][0]: | ||
| video_id = str(params["v"][0]) | ||
| transcript_list = ytt_api.list(video_id) | ||
|
|
||
| video_id = "" | ||
| if parsed_url.hostname == "youtu.be": | ||
| video_id = parsed_url.path.lstrip("/") | ||
| else: | ||
| params = parse_qs(parsed_url.query) # type: ignore | ||
| if "v" in params and params["v"][0]: | ||
| video_id = str(params["v"][0]) |
There was a problem hiding this comment.
accepts() normalizes/decodes the URL (unquote, replacing \\? and \\=), but convert() parses stream_info.url without applying the same normalization. This can cause urlparse() / parse_qs() to miss the v parameter (or mis-parse the URL) for the same escaped URLs that accepts() was designed to handle. Normalize the URL in convert() the same way before extracting video_id.
| if transcript: | ||
| transcript_text = " ".join( | ||
| [part.text for part in transcript] | ||
| ) # type: ignore | ||
| # Version 1.x of youtube-transcript-api returns a list of FetchedTranscript objects | ||
| # that use attribute access (snip.text) rather than dictionary access (part["text"]). | ||
| # To support both, we check if the first element is a dictionary. | ||
| if isinstance(transcript[0], dict): | ||
| transcript_text = " ".join( | ||
| [part["text"] for part in transcript] | ||
| ) | ||
| else: | ||
| transcript_text = " ".join([part.text for part in transcript]) |
There was a problem hiding this comment.
The transcript handling assumes transcript is indexable (transcript[0]) to detect dict-vs-attribute access. With youtube-transcript-api 1.x, .fetch() commonly returns an iterable FetchedTranscript object; if it isn't subscriptable (or if it’s an iterator), transcript[0] will raise and prevent any transcript output. Prefer iterating once (e.g., peek via next(iter(transcript), None) or build the string by trying attribute access with a fallback to mapping access) without indexing.
| if transcript: | ||
| if isinstance(transcript[0], dict): | ||
| transcript_text = " ".join( | ||
| [part["text"] for part in transcript] | ||
| ) | ||
| else: | ||
| transcript_text = " ".join([part.text for part in transcript]) |
There was a problem hiding this comment.
Same issue in the translation fallback: isinstance(transcript[0], dict) will fail if the fetched transcript object isn’t subscriptable. Use the same non-indexing approach here as in the main fetch path so the fallback doesn’t raise while constructing transcript_text.
…usage
Fixes #1704. Migrated from deprecated YouTubeTranscriptApi.list and .fetch to instance-based .list_transcripts and .find_transcript().fetch(). Added support for short youtu.be URLs by parsing video ID from path.