A race to expedite LLM data collection

How enabling a self-service workflow creation for data scientists reduced data collection design from 24 days to 30 minutes

RoleSolo designer working with four PMTs and 21 SDEs.
0 -> 1 product development. End-to-end interaction + UI design.

TimelineAugust 2023 — November 2024
MVP shipped in April 2024

ContextImagine you are a data scientist working on a human-in-the-loop data collection to improve your ML model. You know collecting data can be tricky and time consuming, but you didn’t realize that it would take this much time and effort building that workflow. Only if you could do this exactly how you envision it.

This is a high-level documentation of a product that I worked on between August 2023 and November 2024. An in-depth case study of this product can be found here.

Impact

24 days (Q4 ’23) -> ~30 minutes (Nov ’24) for workflow creation turnaround time
260 collections launched for labeling between April and November 2024

Problem statement

In 2023, engineers manually designed and developed each data collection workflow UI and routing, leading to about 24 days of turnaround time for data labeling development and 51 days for request to initial-batch delivery. But data scientists need the data quicker than what the team could provide.

Primary users“Customer” data scientists, language engineers, language PMs, and software dev engineers. It is a shift from the existing model where the data collection team’s engineers designed and developed collection workflows on behalf of the customers.

User feedback of from ’23

Data collection takes too long because of custom development.
There are too many manual touch points to launch a collection, even after the collection has been developed.
Modifying the collection questions take too long since it requires engineers to update the layout and question orchestration.

Key user problems

How can I receive fast and high-quality data?
How can I do things myself so I don't have to rely on the data collection tech team to build something from scratch for every collection?
How can I reduce communication and churn for all touch points for data collection?
How can I know what workflow UI the labelers will see?
How can I have access to my own conventions document that I can freely modify content?
How can I modify the workflow without affecting ongoing collection?

Embedding the solution to the existing workflow launch processWorkflow launch has a complicated downstream process, including privacy audit, budget confirmation, labeler identification, and labeler training by the collection team. After the approval, the user needs to upload batches of data that they want to be labeled, which then the labeler receives.

Figure 1.Existing data collection launch process

For the scope of this project, the product team and I focused on the initial data collection design process and its subsequent modification flow.

Finding a direction

The organization already had three tools for collection creation: A tool that requires coding, a highly restrictive tool for dealing with customer data, and an engineer-created and intended tool. These tools were too complex for an average user and did not have an established design pattern.

After rounds of brainstorming session, PMTs and I agreed to develop a tool from scratch, designed and dedicated for customers to create a workflow instead of internal team members.

How might we provide a flexible workflow creation experience that reduces the speed for launching and modifying a workflow?
Product tenets

Labeler UI transparency. What you see on the tool is what labelers will see.
Workflow design only. The tool does not touch upon data routing or quality management.
A simple, consistent, no-code experience for the users. The product should be intuitive enough for the customer to build a workflow from scratch on their own.

Identifying its location in the product suiteThe building tool resides inside of a separate to for ingesting and tracking data. This data would get routed to different channels, such as internal labelers, internal experts, crowdsourcing, and 3P vendors. As collection design is the most upstream process, it would be the starting point of all customer interaction.

Figure 2. Flow between this tool and subsequent products

Cultivating a linear creation experienceBy interviewing the process owner from different teams, I was able to identify five key steps that need to be in this tool from a process perspective:

Defining the collection UI
Configuring quality
Specifying the labeler requirements
Adding guidelines
Submitting a privacy ticket + submitting the workflow

Figure 3. Linear creation flow
We aligned to create linear flow from creation to submission as each step has a dependency on the previous one. For example, quality strategy options are determined by the output components added in Step 1, and conventions examples are dependent on labeler skills specified in Step 3. This flow ensures all dependencies are clearly laid out before making the next decision.

Specifying the exact flowFurther research identified the specific requirements of each page into the flow.

Figure 4. End to end flow of Workflow Builder as a tool

End-to-end collection creation process

Problem: Customers want to reduce manal touchpoints as much as possible and either in charge of their collection, or have someone else build their collection completely.

The tool needed to be guided and simple enough to a point that a scientist can come and build their own data collection workflow without needing to contact the team. Each page on the tool had a clear end user goal.

0. Template selectionAnnotation type is a UI template skeleton that consist of a specific layout and set of UI components. Each annotation type is tailored to a set of use case and accelerates the collection building process.

Figure 5. Template options

1. Collection UI design
UI builder allows flexible collection building via adding input content, which are slots for multi media content ingested via json file, and output questions, which are the questions to be displayed to labelers.

Figure 6. UI design page

2. Quality strategy & metricsStrategy refers to the number of data labeling conducted of the same set of data. Metrics is a way of calculating labeled data, traditionally through the agreement data.

Figure 7. Quality selection page

3. Labeler requirements The labeler requirements page is to identify the dataset’s sensitivity and confidentiality, and the knowledge/skillset and the languages the required for processing.

Figure 8. Labeler requirements page

4. Guideline writing
Guideline is an end-to-end instructions, best practices, and examples to aid labeling. Labelers read through these guidelines during processing to ensure they are understanding the goal of the collection correctly.

Figure 9. Guidelines page

5. Privacy ticket submissionThe user must submit a privacy ticket for downstream approval process. They must also confirm that the workflow is final and cannot be modified during the said approval process.

Figure 10. Privacy ticket page

Post-submissionOnce a workflow has been submitted, the launch review team reviews the workflow. During this process, the team can either reject the workflow if the workflow doesn't meet the team's standards or approval the workflow.

Workflow UI for various use cases

Problem: Each science team has a different data collection requirement. The workflow launched for a CV team would look different for a team working with voice files or advanced coding problems.

The tool has 10 templates (planned expansion to 21 templates by ’25) and 19 input and output components for scientists to mix and match components needed for their data collection.

Figure 11. UI designs by scientists Corresponding labeler view
10 templates for different labeler views.
Each template selected during step 0 has an associated labeler UI. This ensures customer-selected usecase is best fit for labeler experience.

Figure 12. Template selection affects labeler view

19 input and output components
Input component refers to content that will be visible to labelers, such as audio or image display. Output components are questions that requrie labeler input, such as single select. The tool offers 9 input components and 10 output components that the customer can mix and match for different use cases.

Figure 13. Various input and output components
Model-in-the-loop
The tool also has built in model-in-the-loop capability to fine tune the model via live input. This allows labelers to ask a question in the chat and obtain one or more model-generated responses to assess the performance of each model.

One known pain point for model-in-the-loop had been model integration duration. Integrating a model onto a hosting platform (owned by a different organization) can take up to two weeks due to communication and reliance on the engineers. I deisgned a holistic experience of “bring your own model” no-code model onboarding experience to expedite this onboarding process for the user.

Modification and iteration

Problem: A known pain point of collection design had been that modifying content takes longer than what customers desire. LLM collections are in smaller batches and more iterative than previous audio-based framework that the collection team was familiar with. This meant that we did not have an established system for bursts of iteration.

Major vs. Minor modification

Figure 14. A table on the difference between minor vs. major updates

Based on the assessment of use cases, the PM and I came to a conclusion that there are two types of changes: One that can be applied immediately to the ongoing collection without disrupting the process (minor), and one that requires an approval to ensure collection can still happen smoothly (major). Major modification takes up to 5 days to be reflected on the ongoing data collection since the business metrics reset every Monday.

Upon multiple rounds of explortation, I came to a conclusion that the modification type selection needs to happen upon clicking the “edit” button. This simplified the problem by

Limiting the type of edits the user can made if they’ve selected minor update.
Introducing a no-save mode on minor modification to prevent it intervening with the ongoing collection.

Figure 15. Modification modal upon clicking “Edit workflow”
This allowed minor updates like workflow UI typos or guideline changes to be updated immediately, while major updates that required approval due to training or budget dependencies would be taken via normal standard delivery time (5 days).

Rapid iteration
In September 2024, I’ve come to learn that some collections are more experimental in nature, iterating on a small batch of data until the customer finds the correct dataset they could use the improve the AI model. The existing processes had to be reconsidered to support this use case.

My PM, the engineering team, and I collaborated for three weeks to come up with a new system that allows certain users to bypass approval process, hence expediting the UI and the ongoing collection sync from up to 5 days to 0 minutes. This was achieved by giving the option to those in an approval group to either

Wait for up to 5 days until the next week for the major modification to be reflected on the ongoing changes (existing state)
Apply workflow changes to all unlabeled data that the customer uploaded
Apply workflow changes to both labeled and unlabeled data
Purge all unlabeled data so the customer can upload new data

Figure 16. Expedite change syncs page
This solution allowed customers to bypass approvals and receive iterative data that they need to improve the foundation model. It also introduced a new page that allows users to see all the ongoing collections that use this particular workflow template and strengthened the relationship between collection templates and the ongoing collection.

User quotes

30+ facilitated user interviews, five focus groups, hundreds of slack feedback, two 7+ page research paper, and two VP presentations later...

“The tool enabled us to develop the audio Q&A and audio verification workflows in 1 day whereas workflows last year took us several weeks. This is a huge improvement and will allow us to move much more quickly towards training models.” — Yuan H, Data scientist
“My team has been asking for a platform that will enable us as scientists to quickly iterate on our audio-based workflows for years, and this tool is exactly that solution. The tool’s templates mean that scientists don't need to spend cycles researching the various UX considerations for each annotation type. I feel empowered to innovate in human-in-the-loop space as quickly as we innovate in the modeling space.” — Eriko A., Data scientist

“My team was able to quickly onboard the screenshot-based evaluation workflow to the tool and the labeling platform. The workflow tool contained all of the feature sets we required to quickly configure the workflow per our in-house editors and scientists annotation needs.” — Max Z., Software engineer
“I enjoy the immediacy of the convention updates, as well as the in-context conventions that you can provide for particularly challenging steps for the DAs. The UI Preview is extremely helpful, as is the capability to download input and output files. This allows me to work on the tool, conventions, etc., while my science POC who will use the data can get it formatted in the right way.” — Tim M., Language engineer

Impact

140 regular users from 40 customer scientist groups across the company and internal program managers
24 days (Q4 ’23) -> ~30 minutes (Nov ’24) for workflow creation turnaround time
260 collections launched for labeling between April (MVP launch) and November 2024
95% workflow request coverage for foundation model data collection
21 major feature enhancements shipped to evolve with customer requests (six additional template offerings, 10+ newly added components, rapid iteration, model onboarding, major/minor modification, etc.)