Skip to content

(phonic): support realtimemodel say()#5293

Open
tinalenguyen wants to merge 4 commits intomainfrom
tina/rt-session-say
Open

(phonic): support realtimemodel say()#5293
tinalenguyen wants to merge 4 commits intomainfrom
tina/rt-session-say

Conversation

@tinalenguyen
Copy link
Copy Markdown
Member

No description provided.

@chenghao-mou chenghao-mou requested a review from a team April 1, 2026 04:43
devin-ai-integration[bot]

This comment was marked as resolved.

@qionghuang6
Copy link
Copy Markdown
Contributor

Just played around with this. Seems to be working!

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 8 additional findings in Devin Review.

Open in Devin Review

name="AgentActivity.tts_say",
)

task.add_done_callback(self._on_pipeline_reply_done)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _on_pipeline_reply_done callback causes duplicate state transitions for realtime say path

In _generate_reply() (agent_activity.py:1054-1066), the RealtimeModel path intentionally does NOT add the _on_pipeline_reply_done callback, because _realtime_generation_task_impl already handles state transitions internally (setting agent state to "listening", calling on_end_of_agent_speech, and _restore_interruption_by_audio_activity at agent_activity.py:2962-2972). However, in the new say() method, task.add_done_callback(self._on_pipeline_reply_done) at line 988 is applied unconditionally to both the realtime say path and the TTS path. When the realtime say path is taken, _realtime_say_task calls _realtime_generation_task_realtime_generation_task_impl, which performs state transitions. Then when the task completes, _on_pipeline_reply_done (agent_activity.py:1938-1946) fires and calls on_end_of_agent_speech(ignore_user_transcript_until=time.time()) a second time with a later timestamp, briefly extending the window during which user transcripts are suppressed.

Suggested change
task.add_done_callback(self._on_pipeline_reply_done)
if self._rt_session is None or is_given(audio) or self.tts:
task.add_done_callback(self._on_pipeline_reply_done)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

self,
text: str | AsyncIterable[str],
*,
allow_interruptions: NotGivenOr[bool] = NOT_GIVEN,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we really expose allow_interruptions to this api? we don't have it in generate_reply, and disallow allow_interruptions=False for realtime session with server side VAD.

even the the api support it in server side VAD, but how can they know the audio playout is finished in agent and re-allow interruption?

):
model_info = (
"a RealtimeSession that implements say()"
if isinstance(self.llm, llm.RealtimeModel)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm cannot be RealtimeModel when rt_session is None? so maybe you need to add a capability to the RealtimeModel for say.

self._session._tool_items_added(tool_messages)

@utils.log_exceptions(logger=logger)
async def _realtime_say_task(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should we reuse _realtime_reply_task?

    async def _realtime_reply_task(
        self,
        *,
        speech_handle: SpeechHandle,
        model_settings: ModelSettings,
        user_input: str | None = None,
        instructions: str | None = None,
        text: str | AsyncIterable[str] | None = None,
    ) -> None:

and we check only one of the user_input and the text can be set.

return
except llm.RealtimeError as e:
logger.error("failed to say text: %s", str(e))
self._session._update_agent_state("listening")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed?

@theomonnom
Copy link
Copy Markdown
Member

theomonnom commented Apr 2, 2026

I'm not sure about this, isn't Phonic just using a TTS underneath? This seems like a very specialized method for them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants