UX Research Studies for Microsoft AI's Copilot App
My work at Microsoft spans a range of research methods and product challenges, from early discovery to post-launch evaluation. These two case studies stand out as examples where research most directly influenced the direction of Copilot: one grounding the team in users' preferred interaction pattern, the other explaining why Copilot's AI responses were falling short of a competitor to improve the response quality.
Click on the project title below to go to their corresponding sections.
Study A: Microsoft Copilot Vision — Writing Workflow Research
Average Length of Projects: 2.5 weeks
Collaborators:
-
Principal UX Researcher
-
UX Researcher (Me)
-
Copilot Vision product squad (including Product Managers, UX Designers, Engineers)
-
Adjacent Copilot product squads in the same workstream (including Product Managers, UX Designers, Engineers)
My role: Lead researcher on this study
Utilized skills:
-
Research design
-
Usability testing (combining generative and evaluative interview questions)
-
Qualitative data analysis
-
Report writing and share-out
-
Stakeholder collaboration (with Product managers, UX Designers, Engineers)
Project Snapshot:
The Copilot Vision team was shipping new writing features fast but lacked user signal to prioritize the core workflow. I led a usability test combing generative and evaluative questions that surfaced a third interaction pattern users preferred over both options the PM had — one that was adopted into the 2026 roadmap and reshaped the workstream's workflow design direction.
Due to NDA constraints, specific product details are not disclosed. This case study focuses on research approach, high-level findings and impact.
Problem Statement:
The PM needed a confident roadmap call on two competing interaction flows (User Flow A or User Flow B), early signal on the voice dictation feature, and a clear read on usability friction in the prototype. The research had to answer three questions:
-
Which interaction flow do users prefer, and why?
-
Do users prefer voice dictation or "voice-in, text-out" for Vision writing tasks?
-
What usability friction and unmet expectations exist in the current prototype?
My Role:
I owned this study end-to-end, from recruitment and protocol design, user interview moderation, and data analysis. I collaborated with the Principal UX Researcher on the recruitment and protocol design, then indepedently conducted the user interviews, data analysis, and presented findings directly to the PM, designers, and engineers.
Process:
-
Recruitment: The team initially wanted general Copilot users. I recommended recruiting half the participants as active AI writing users, since a general pool would have introduced noise unrelated to the features being tested. The team adopted this approach.
-
Study Design: I co-designed a hybrid protocol pairing evaluative methods (to gauge usability issues) with generative methods (to uncover unmet expectations beyond the current feature set). Running both in one study meant I answered the PM's tactical question ("which flow?") and her strategic one ("what do users actually want?") without needing two separate studies.
-
Data Collection: I conducted remote 1:1 sessions via UserTesting.com, observing task completion in real time and probing for reasoning and unmet expectations across both flows.
-
Data Analysis: I independently analyzed findings across sessions, identified recurring friction patterns, synthesized insights, and provided design recommendations for the PM, designers, and engineers.
Key Insights:
-
Neither flow was right, and users pointed to a third flow they desired more. Across sessions, participants gravitated toward an interaction pattern neither flow had accounted for. The signal was consistent enough to bring forward this User Flow C as the concrete alternative.
-
Friction was concentrated, not scattered. Usability issues clustered around specific interaction points, and I provided actionable recommendations that gave designers a prioritized set of fixes they could quickly adopt.
-
Users' preferences for voice dictation v.s. voice-in, text-out features were contextual and based on their individual use cases. This gave the product team a good signal for when each feature would be preferred.
Impact:
-
Product Roadmap: User Flow C was adopted into the 2026 feature roadmap and triggered a workstream-wide shift in design direction before further development investment was made. The research share-out influenced not only the Copilot Vision squad, but also those from other adjacent squads in the same workstream.
-
UX Design: Critical design recommendations were adopted by the UX designers to improve usability.
-
Research Design: Adding participants with experience using GenAI for writing tasks improved the signal quality for the specialized use case and was adopted as the study's documented methodology.
Reflection:
Due to this being a study I conducted at the beginning of my onboarding to the team, I spent extra time finalizing my report to fit the report style of the MAI team. As I got used to standards, I would be able to speed up my report-writing process and provide insights faster to the stakeholders.
Study B: UX Eval Study to Improve Microsoft Copilot's AI Response Quality
Project Length: 4 weeks
Collaborators:
-
Principal UX Research Manager
-
UX Researcher (Quantitative)
-
UX Researcher (Qualitative)
-
Copilot shopping product workstream (including Product Managers, UX Designers, Engineers)
My role: Lead qualitative researcher
Utilized skills:
-
AI-moderated interview platform Outset.ai
-
Qualitative data analysis on video transcripts and recordings
-
Report writing and share-out
-
Stakeholder collaboration (with Product managers, UX Designers, Engineers)
Problem Statement:
Quantitative benchmarks confirmed that Copilot's AI shopping responses were consistently losing to a leading competitor — but the numbers couldn't explain why. The product team knew users preferred competitor outputs in certain scenarios, but without qualitative depth, improving Copilot's response quality would have meant guessing.
The core research question: What specific qualities in competitor responses were driving user preference, and what would it take for Copilot to close that gap?
Due to NDA constraints, specific product details are not disclosed. This case study focuses on research approach, high-level findings and impact.
My Role:
I owned the qualitative analysis component of this study end-to-end, partnering with a quantitative UX Researcher under the Principal UX Research Manager. While my partner led the statistical benchmarking, I was solely responsible for the qualitative layer: designing the analysis approach, coding and analyzing 200+ AI-moderated interview recordings, synthesizing themes, and producing the final brief delivered to product and design stakeholders.
Process:
-
Method choice: A study at this scale made traditional moderated interviews impractical. We chose AI-moderated interviews via Outset.ai to get users' video-format responses at scale and on a timeline that moderated sessions could not have matched. The format preserved qualitative richness (e.g., users' hesitation, verbal reasoning, emotional reaction) while giving us a sample large enough to surface patterns with statistical confidence.
-
Data collection: Participants compared Copilot's AI responses to the competitor's across identical shopping prompts. The recordings captured not just stated preference but the reasoning behind it, including where users hesitated, what they noticed, and where Copilot visibly failed to meet their expectations.
-
Qualitative Data Analysis: I analyzed transcripts and recordings from all 200+ participants using reflexive thematic analysis, coding for response patterns and user reactions across both products. Themes were organized into dimensions of AI response quality, then layered on top of the quantitative researcher's statistical data to produce a unified picture of Copilot's performance gaps.
Key Insights:
-
Responses were too long and hard to scan: Users struggled to extract what they needed from Copilot's long outputs. Where the competitor surfaced key product recommendations quickly, Copilot buried them in prose.
-
Copilot didn't retain users' specific criteria: Participants frequently noted that Copilot's responses felt generic and didn't reflect the constraints or priorities the user had stated in their prompt. This made responses feel less useful.
-
Pricing information was sometimes inaccurate: Several users caught pricing discrepancies in Copilot's outputs, which immediately undermined confidence in the response as a whole.
Impact:
-
My qualitative findings gave the stakeholders a targeted, evidence-backed explanation for Copilot's underperformance, directly building on the quantitative researcher's statistical analysis to give stakeholders both the what and the why.
-
The qualitative findings directly contributed to response quality and accuracy improvements shipped in Q1 2026.
Reflection:
-
This project reinforced how much the right method choice shapes what you can learn, and how fast. Using AI-moderated interviews at this scale wasn't a compromise, but a strategic decision that unlocked large-scale qualitative insight that wouldn't otherwise have existed within the project timeline.
-
If I were to run this study again, I'd build in a smaller follow-up round of traditional human-moderated sessions with a targeted subset of participants to probe the most surprising findings more deeply to understand the nuanced friction points.



