Sampling Methodology

Sampling ensures meaningful AI insights by allowing Neticle’s AI engine to analyze a focused, representative subset of your entire dataset, while staying within the input limitations (context window) of the large language model (LLM).

Limitations

Methodology

Selecting Variables for Stratification

Applying Stratified Sampling

Specific case: Comparing subsets

Limitations

To ensure only relevant insights, some limitations are in place:

  1. Minimum Sample Size: A sample must contain at least one verbatim. Without data, no insights can be generated. If you encounter an error due to a lack of data, consider adjusting your filters.
  2. Verbatim Length Limit: The average length of each verbatim in the dataset must be less than 2000 characters. If all verbatims are considered too long, sampling becomes infeasible, as only a small number would fit within the LLM’s input limit.
  3. Total Character Limit: The total character count of all verbatims in the subset must be under 10 million. If this limit is exceeded, less than 1% of the data could be used as a sample, which would not be adequately representative. If you encounter this error, try refining your filters to reduce the data volume.

Methodology

If the dataset is small enough to fit within the LLM’s input limitation, no sampling is needed, all verbatims are analyzed. In other cases, stratified sampling is used to meet input limitations. Stratified sampling divides a dataset into distinct subgroups (strata) based on specific characteristics, ensuring each subgroup is proportionally represented. This method improves sample accuracy and reduces sampling bias.

To further enhance data quality, Zurvey’s sampling process is dataset-specific, accounting for the number and length of verbatims, as well as the types of dimensions in the dataset. This approach ensures that samples from, for example, survey responses or social media reviews are tailored to the characteristics of those data types.

Selecting Variables for Stratification

Zurvey.io applies two main sampling variables: verbatim length and a metric-based variable, which are determined as follows:

Verbatim Length Variable: Each dataset is categorized based on the length and variation of its verbatims:
  • Few similar-length verbatims: 3 buckets
  • Many similar-length verbatims: 4 buckets
  • Few varied-length verbatims: 4 buckets
  • Many varied-length verbatims: 5 buckets

The greater the length variation and quantity, the more buckets are used to better distinguish between verbatims. Verbatim lengths are grouped using logarithmic binning to address broad ranges and balance representation, reducing bias from outlier values.

Metric Variable: The following metrics are considered for sampling:
  • NPS segments (promoter, passive, detractor)
  • CSAT segments (satisfied, neutral, dissatisfied)
  • CES segments (easy, neutral, difficult)
  • Sentiment category (positive, neutral, negative)

If an NPS dimension exists, it is used as the metric variable. For multiple NPS dimensions, the one with the most values is selected. If no NPS, CSAT, or CES dimensions are present, sentiment categories are used, as those are always applicable.

Applying Stratified Sampling

With both sampling variables, each verbatim is classified into a stratum based on length and metric data. Stratified sampling is then performed proportionally, based on the combined length of verbatims within each stratum relative to the overall dataset. Sampling within each stratum is randomized, so re-sampling the same subset can produce different results each time, enhancing variability and robustness in the insights generated.

Specific case: Comparing subsets

When comparing subsets, the sample size for each subset is proportional to the ratio of the combined length of verbatims in each subset.