Transient intent addition by AdaRoseCannon · Pull Request #1343 · immersive-web/webxr · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient intent addition #1343

Merged
merged 7 commits into from
Oct 5, 2023
Merged

Conversation

AdaRoseCannon
Copy link
Member

@AdaRoseCannon AdaRoseCannon commented Aug 31, 2023

This PR adds a new enum for inputs which don't represent any actual physical device but instead represent a user's intent from the operating system derived from other sources.

The enum "transient-intent" is certainly up for debate.


Preview | Diff

@AdaRoseCannon
Copy link
Member Author

It's a transient input because like screen taps it's not present all the time.

The targetRaySpace represents the ray of the user's intent for example where the user was looking at the start of the interaction. This shouldn't change with time.

The gripSpace is for associated gestures if that is applicable and can be used for manipulating the intended target.

@AdaRoseCannon
Copy link
Member Author

/tpac input discussion

@probot-label probot-label bot added the TPAC label Sep 1, 2023
@toji
Copy link
Member

toji commented Sep 1, 2023

Thank you, Ada, for writing this up! I have a couple of comments but am generally good with the direction of this proposal. Definitely worth discussion at TPAC!

  • The name 'transient-intent' probably needs a bit of bikeshedding. It's not bad, but I wonder if we could make the use cases where it would apply clearer. I kind of wish that we could just use the existing 'gaze' and just make it transient, since it would very clearly indicate what is likely to be the most prominent use case. But that might cause some undesired behavior for apps built against existing 'gaze' == head tracking uses. We should maybe do some research to figure out the impact of that. Also, that phrasing tends to exclude assistive devices.
    • Then again, maybe having an input modality that suggests a more common input but which can be emulated with assistive devices is preferred, because it would help obscure the fact that the user may be relying on assistive tech? I guess it's plausible that any of the existing input sources could also be emulated with assistive tech, so that wouldn't necessarily be anything new.
  • I would like some assurances in the spec text that any transient input source will have some way of allowing the user to know what they are interacting with. For example: Screen input either uses a visible cursor or facilitates direct touch. While it's probably not appropriate for the spec to prescribe an exact methodology for doing so, I think it IS reasonable for the spec to include some normative language stating that the user should be given some mechanism to understand where the select events will fire. It may also be useful to include non-normative text giving an example of one such mechanism, such as perhaps a cursor rendered by the OS but which is not observable to the session.

@toji
Copy link
Member

toji commented Sep 1, 2023

The gripSpace is for associated gestures if that is applicable and can be used for manipulating the intended target.

Can we get some clarification on this? My reading is that the target ray would (in the presumed case of eye-tracking-based selection) follow the users line of sight at the time of selection but stay static during the rest of the selection event chain (begin->end). In a system that uses a hand gesture to initiate the select event, though, the grip space would originate in the hand that did the gesture, and follow it?

That would allow developers to handle some fairly complex interactions by measuring relative movement of the grip space, but it would also fail for a number of existing drag-and-drop mechanisms (like the speakers in the Positional Audio sample), which generally ignore the grip space and instead track only the target ray. While I think that enabling the more expressive input is great, it would be good to have a fallback that more closely emulates existing behavior to avoid breaking existing content.

Perhaps a target ray that is initialized based on eye tracking but then for the duration of the select event becomes head locked. That way the user could at least do coarse manipulation of existing drag/drop interactions, like an incredibly expensive Google Cardboard. :) Hard to state that in a way that doesn't become overly device-specific, but I think it's worth addressing.

@AdaRoseCannon
Copy link
Member Author

The name 'transient-intent' probably needs a bit of bikeshedding. It's not bad, but I wonder if we could make the use cases where it would apply clearer. I kind of wish that we could just use the existing 'gaze' and just make it transient, since it would very clearly indicate what is likely to be the most prominent use case.

Yes, I think each time I draft this proposal I picked a new name for the enum, initially I focused on ones related to gaze since that was the primary focus for this particular use case. But the more we worked on the proposal, this input type seems a better fit for any generic input that can’t/shouldn’t give continuous input. As well as the inputs from assistive technology mentioned in the spec it would be a good fit for Brain Control Interfaces, or even standard tracked inputs who want to run in a mode to further reduced fingerprinting based on ambient user motion.

Then again, maybe having an input modality that suggests a more common input but which can be emulated with assistive devices is preferred, because it would help obscure the fact that the user may be relying on assistive tech? I guess it's plausible that any of the existing input sources could also be emulated with assistive tech, so that wouldn't necessarily be anything new.

My honest hope here is that we aren’t the only platform that uses it as their default input mechanism because it would be a really good fit for exposing a simpler model for assistive input technology than full emulation of tracked-pointers or hands but it shouldn't be used on it's own for just a11y otherwise it will reveal that assistive technology is being used.

I would like some assurances in the spec text that any transient input source will have some way of allowing the user to know what they are interacting with. For example: Screen input either uses a visible cursor or facilitates direct touch. While it's probably not appropriate for the spec to prescribe an exact methodology for doing so, I think it IS reasonable for the spec to include some normative language stating that the user should be given some mechanism to understand where the select events will fire. It may also be useful to include non-normative text giving an example of one such mechanism, such as perhaps a cursor rendered by the OS but which is not observable to the session.

This is a great point that needs further discussion because it can potentially compromise user privacy. Apple’s implementation is privacy-preserving in that websites and apps do not know what the user is looking at until an interaction is started. This includes Safari itself; as a user’s gaze moves around the page, the system highlights what the user is looking at with a glow effect (“gaze-glow”) but this information is not available to the page. Apps can declare interactive regions and the OS itself provides the visuals back to the user for which interactive are currently being gazed at.

I’ve been thinking that a gaze-glow like thing could work by utilising the layers API to add a new layer type so that developers could declare their interactive regions which the OS can then provide the highlights for, but that would need to be in a different repo and not in this PR.

From my experimentation on the Vision Pro with a prototype implementation of this proposal, no cursor has been required to perform interactions, you pinch and the thing you are looking at gets interacted with. Even with small hit targets such as chess pieces on a regulation size chess board from a standing or sitting position are a sufficiently large hit target that selecting pieces doesn’t present any issues.

The developer could provide cursor themselves when the interaction starts to show what was initially selected and let the user release the pinch if the wrong thing was targeted or they change their mind about the interaction.

In my more advanced demos I implement cursor re-targetting as a developer by calculating the point which is hit in my scene, then modifying that point by the inverse of the starting pose and applying the current pose each frame so that when the “selectend” even is fired if a ray through the new target location no longer connects then the event can be ignored. This can also be used for doing my own hover effects in the client.

Can we get some clarification on this? My reading is that the target ray would (in the presumed case of eye-tracking-based selection) follow the users line of sight at the time of selection but stay static during the rest of the selection event chain (begin→end). In a system that uses a hand gesture to initiate the select event, though, the grip space would originate in the hand that did the gesture, and follow it?

Yeah that’s correct. For example if the gesture was initiated by a pinch, the grip space would be set to the point the fingers connect.

That would allow developers to handle some fairly complex interactions by measuring relative movement of the grip space, but it would also fail for a number of existing drag-and-drop mechanisms (like the speakers in the Positional Audio sample), which generally ignore the grip space and instead track only the target ray. While I think that enabling the more expressive input is great, it would be good to have a fallback that more closely emulates existing behavior to avoid breaking existing content.

The tricky thing is that what the pose really wants to modify for the best experience is the point where the selection ray intersects the scene geometry so we can’t do smart things like this without using a depth buffer, (ideally a depth buffer of just interact-able objects) where as right now it works without it.

Perhaps a target ray that is initialized based on eye tracking but then for the duration of the select event becomes head locked. That way the user could at least do coarse manipulation of existing drag/drop interactions, like an incredibly expensive Google Cardboard. :) Hard to state that in a way that doesn't become overly device-specific, but I think it's worth addressing.

I think I would definitely prefer something based on the gripSpace if it is available with viewerSpace as a fallback, locking it to the head feels a little weird when the input will can naturally be modulated by the attached hand pose. Perhaps if gripSpace is available then modulating the direction of the targetRaySpace based on a point that sits one meter out from the user along the targetRaySpace would be acceptable.
Although it would be a significantly better experience if they used the actual point hit for this but at least it would still “work”. I think I would want to prototype this and try it out before I fully committed to it though.

@toji
Copy link
Member

toji commented Sep 7, 2023

Apple’s implementation is privacy-preserving in that websites and apps do not know what the user is looking at until an interaction is started. This includes Safari itself;

I guess I'm not surprised that Safari is treated like any other app in this regard, but I hadn't considered that limitation previously.

I’ve been thinking that a gaze-glow like thing could work by utilising the layers API to add a new layer type so that developers could declare their interactive regions which the OS can then provide the highlights for, but that would need to be in a different repo and not in this PR.

For both this and accessibility reasons, yeah. It would be great to explore some sort of object segmentation system! (Agreed that it's a separate PR, though. Maybe a new TPAC subject?)

I'd be interested in learning more about how native apps communicate gaze-glow regions to the OS. I'm only familiar with the new CSS APIs that Safari is adding to facilitate it. It would be great if we could figure out a "best effort" method that gave some basic, but possibly imprecise, feedback to the user by default that the developer could then improve with more explicit input.

Even with small hit targets such as chess pieces on a regulation size chess board from a standing or sitting position are a sufficiently large hit target that selecting pieces doesn’t present any issues.

This is good to hear! I'm a little wary of looking at one platform's high-quality implementation, though, and generalizing to "no user feedback is necessary". Again, sounds like we need further discussion.

The tricky thing is that what the pose really wants to modify for the best experience is the point where the selection ray intersects the scene geometry

Completely agree that the best experience is one that takes the platform norms into account. In this case I'm really just thinking about how to best facilitate apps that were built well before this more of iteration was devised and ensuring they have some level of functionality, even if it's not ideal. "Awkward" is better than "broken".

I think I would definitely prefer something based on the gripSpace [...] I think I would want to prototype this and try it out before I fully committed to it though.

I'd love to hear your conclusions from any prototyping that you do in this area! It's going to be tricky, but I'm optimistic that there's a way to facilitate both the ideal use cases and fallbacks.

I also want to make sure that we don't accidentally encourage users to design apps that ONLY work for a certain type of input, whether that be gaze, hand, or controller. I think our "select" system still does an admirable job at that, but like with the deprecation discussion it's probably worth taking another look and making sure our initial assumptions hold up.

@AdaRoseCannon
Copy link
Member Author

For both this and accessibility reasons, yeah. It would be great to explore some sort of object segmentation system! (Agreed that it's a separate PR, though. Maybe a new TPAC subject?)

I don't think I am prepared to talk about it yet

This is good to hear! I'm a little wary of looking at one platform's high-quality implementation, though, and generalizing to "no user feedback is necessary". Again, sounds like we need further discussion.

An interface which wasn't so precise could implement a cursor themselves, I am just saying for the Vision Pro use case I have found it unnecessary, though I think any cursor system would require the WebXR scene to provide a sensible depth map to show the object it's hitting.

I'd love to hear your conclusions from any prototyping that you do in this area! It's going to be tricky, but I'm optimistic that there's a way to facilitate both the ideal use cases and fallbacks.

I definitely want it to be good enough that it actually works well, the worst case scenario is that because it works well enough developers rely on the updated pose of the targetRaySpace even though it's just a best-guess when attaching the object to the gripSpace would give a really good experience.

@cabanier
Copy link
Member

It's a bit odd to add a transient input source. Is the intent that this inputsource is constantly reported or only right after (and only once?) after a system gesture?
This won't work well with existing WebXR experiences because they rarely check the inputsource.

I think that this use case would be better solved with an event. Maybe it could even report the space that the user was focusing on at the moment the system gesture happened.

@AdaRoseCannon
Copy link
Member Author

It is reported for the whole duration of the interaction, so for potentially many seconds for a more complex gesture. From my tests it seems to work well in existing frameworks.

@AdaRoseCannon
Copy link
Member Author

TPAC Feedback:

  • Investigate using "gaze" as the enum, should be fine to reuse it
  • Maybe don't do any adjustment of the targetRaySpace based on gripSpace or viewerSpace
  • Investigate if we can get the depth of the gaze target and expose it somehow

@cabanier
Copy link
Member

Thanks for this proposal @AdaRoseCannon !
After TPAC I have a better understanding what you're trying to accomplish. Do you have a very basic example? It could be based on one in the WebXR Samples repo. You could put it under the "proposals" tab.

@AdaRoseCannon
Copy link
Member Author

@cabanier my hope is that developers don't need to make any changes to support this specially. So I am not quite sure what I would put in a sample.

@AdaRoseCannon
Copy link
Member Author

Here are some of the proposed enums, if you have more please mention them it would be good to settle on an appropriate name:

  • transient-intent (current)
  • untracked-pointer
  • transient-pointer
  • static-pointer
  • (reusing) gaze*

I want to push back a little against gaze although for the very specific use case of a Vision Pro this static targetRay is based on the users gaze direction at the moment the interaction starts. My intentions for this is that it can be used for any interaction which has a momentary intention.

@cabanier
Copy link
Member

Here are some of the proposed enums, if you have more please mention them it would be good to settle on an appropriate name:

I prefer transient-pointer

@cabanier
Copy link
Member

@cabanier my hope is that developers don't need to make any changes to support this specially. So I am not quite sure what I would put in a sample.

I suspect that authors will want to treat this new inputsource differently so there will be separate paths for each type.

toji and others added 5 commits September 29, 2023 16:55
The "poses may be reported" algorithm previously indicated that
poses could only be reported if the visibility state was "visible",
but that appears to be a mistake originating from the fact that
"visible-blurred" was added later. The description of
"visible-blurred" mentioned throttled head poses, so obviously the
intention was to allow some poses in that state, just not input
poses.

This change alters the "poses may be reported" algorithm to only
suppress poses when the visiblity state is "hidden", and adds an
explicit step to the "populate the pose" algorithm that prevents
input poses from being returned if the visibility state is
"visible-blurred"
@AdaRoseCannon
Copy link
Member Author

AdaRoseCannon commented Sep 29, 2023

I changed the PR itself to say that

The pose for [the targetRaySpace] should be static within the gripSpace for this XRInput.

and that the gripSpace if it's not something that is otherwise defined

should be another space the user controls such as the ViewerSpace or the gripSpace or the targetRaySpace of another XRInput. This is to allow user the user to still manipulate the targetRaySpace.

@toji
Copy link
Member

toji commented Sep 29, 2023

Apologies, Ada. I had an action item to post alternate names and dropped it on the floor. The list you put up covered my suggestions, though. Thanks!

And after thinking about it some more I agree with Rik that transient-pointer is the one from that list that feels best, given the expected interaction mode.

@cabanier
Copy link
Member

cabanier commented Oct 2, 2023

@Manishearth we likely want to wait until @AdaRoseCannon makes a PR with the renamed targetRayMode

@AdaRoseCannon
Copy link
Member Author

Yeah, I will talk about the changes with the group and we’ll pick a name. Then it’s all good hopefully.

@AdaRoseCannon
Copy link
Member Author

The feel of the room is for "transient-pointer"

@AdaRoseCannon AdaRoseCannon removed the TPAC label Oct 3, 2023
@cabanier cabanier self-requested a review October 3, 2023 22:26
@AdaRoseCannon
Copy link
Member Author

I'm happy for this to merged if everyone else is :)

@AdaRoseCannon
Copy link
Member Author

I'm going to start working on a sample explicitly highlighting that for this space things that should be attached to the hand should use gripSpace. Which would be the main unexpected side-effect of this particular situation.

@cabanier cabanier merged commit 47d14b6 into immersive-web:main Oct 5, 2023
1 check passed
github-actions bot added a commit that referenced this pull request Oct 5, 2023
SHA: 47d14b6
Reason: push, by cabanier

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
github-actions bot added a commit to cabanier/webxr that referenced this pull request Feb 8, 2024
SHA: 47d14b6
Reason: push, by cabanier

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants