Core Statistics Concepts
Adjusted Rand index:
The adjusted Rand index measures how similar two market segmentation solutions are while correcting for agreement by chance. The adjusted Rand index is 1 if two market segmentation solutions are identical and 0 if the agreement between the two market segmentation solutions is the same as expected by chance.
A priori market segmentation:
Also referred to as commonsense segmentation or convenience group segmentation, this segmentation approach uses only one (or a very small number) of segmentation variables to group consumers into segments. The segmentation variables are known in advance, and determine the nature of market segments. For example, if age is used, age segments are the result. The success of a priori market segmentation depends on the relevance of the chosen segmentation variable, and on the detailed description of resulting market segments. A priori market segmentation is methodologically simpler than a posteriori or post hoc or data-driven market segmentation, but is not necessarily inferior. If the segmentation variable is highly relevant, it may well represent the optimal approach to market segmentation for an organisation.
A posteriori market segmentation:
Also referred to as data-driven market seg- mentation or post hoc segmentation, a posteriori market segmentation uses a set of segmentation variables to extract market segments. Segmentation variables used are typically similar in nature, for example, a set of vacation activities. The nature of the resulting segmentation solution is known in advance (for example: vacation activity segmentation). But, in contrast to commonsense segmentation, the characteristics of the emerging segments with respect to the segmentation variables are not known in advance. Resulting segments need to be both profiled and described in detail before one or a small number of target segments are selected.
Artificial data:
Artificial data is data created by a data analyst. The properties of artificial data – such as the number and shape of market segments contained – are known. Artificial data is critical to the development and comparative assessment of methods in market segmentation analysis because alternative methods can be evaluated in terms of their ability to reveal the true structure of the data. The true structure of empirical consumer data is never known.
Attractiveness criteria:
See segment attractiveness criteria.
Behavioural segmentation:
Behavioural segmentation is the result of using infor- mation about human behaviour as segmentation variable(s). Examples include scanner data from supermarkets, or credit card expenditure data.
Bootstrapping:
Bootstrapping is a statistical term for random sampling with replacement. Bootstrapping is useful in market segmentation to explore randomness when only a single data sample is available. Bootstrapping plays a key role in stability-based data structure analysis, which helps to prevent the selection of an inferior, not replicable segmentation solution.
Box-and-whisker plot:
The box-and-whisker plot (or boxplot) visualises the distribution of a unimodal metric variable. Parallel boxplots allow to compare the distribution of metric variables across market segments. It is a useful tool for the description of market segments using metric descriptor variables, such as age, or dollars spent.
Centroid:
The mathematical centre of a cluster (market segment) used in distance- based partitioning clustering or segment extraction methods such as k-means. The centroid can be imagined as the prototypical segment member; the best representative of all members of the segment.
Classification:
Classification is the statistical problem of learning a prediction algorithm where the predicted variable is a nominal variable. Classification is also referred to as supervised learning in machine learning. Logistic regression or recursive partitioning algorithms are examples for classification algorithms. Classification algorithms can be used to describe market segments.
Commonsense segmentation:
See a priori market segmentation.
Constructive segmentation:
The concept of constructive segmentation has to be used when the segmentation variables are found (in stability-based data structure analysis) to contain no structure. As a consequence of the lack of data structure, repeated segment extractions lead to different market segmentation solutions. This is not optimal, but from a managerial point of view it still often makes sense to treat groups of consumers differently. Therefore, in constructive market segmen- tation, segments are artificially constructed. The process of constructive market segmentation requires collaboration of the data analyst and the user of the market segmentation solution. The data analyst’s role is to offer alternative segmentation solutions. The user’s role is to assess which of the many possible groupings of consumers is most suitable for the segmentation strategy of the organisation.
Convenience group market segmentation:
See a priori market segmentation.
Cluster:
The term cluster is used in distance-based segment extraction methods to describe groups of consumers or market segments.
Clustering:
Clustering aims at grouping consumers in a way that consumers in the same segment (called a cluster) are more similar to each other than those in other segments (clusters). Clustering is also referred to as unsupervised learning in machine learning. Statistical clustering algorithms can be used to extract market segments.
Component:
The term components is used in model-based segment extraction methods to refer to groups of consumers or market segments.
Constructed market segments:
Groups of consumers (market segments) arti- ficially created from unstructured data. They do not re-occur across repeated calculations.
Data cleaning:
Irrespective of the nature of empirical data, it is necessary to check if it contains any errors and correct those before extracting segments. Typical errors in survey data include missing values or systematic biases.
Data-driven segmentation:
See a posteriori market segmentation.
Data structure analysis:
Exploratory analysis of the segmentation variables used to extract market segments. Stability-based data structure analysis provides insights into whether market segments are naturally existing (permitting natural segmen- tation to be conducted), can be extracted in a stable way (requiring reproducible market segmentation to be conducted), or need to be artificially created (requiring constructive market segmentation to be conducted). Stability-based data structure analysis also offers guidance on the number of market segments to extract.
Dendrogram:
A dendrogram visualises the solution of hierarchical clustering, and depicts how observations are merged step-by-step in the sequence of nested partitions. The height represents the distance between the two sets of observations being merged. The dendrogram has been proposed as a visual aid to select a suitable number of clusters. However, in data without natural clusters the identification of a suitable number of segments might be difficult and ambiguous.
Descriptor variable:
Descriptor variables are not used to extract segments. Rather, they are used after segment extraction to develop a detailed description of market segments. Detailed descriptions are essential to enable an organisation to select one or more target segments, and develop a marketing mix that is customised specifically to one or more target segments.
Exploratory data analysis:
Irrespective of the algorithm used to extract market segments, one single correct segmentation solution does not exist. Rather, many different segmentation solutions can result. Randomly choosing one of them is risky because the chosen solution may not be very good. The best way to avoid choosing a bad solution, is to invest time into exploratory data analysis. Exploratory data analysis provides glimpses of the data structure from different perspectives,thus guiding the data analyst towards a managerially useful market segmentation solution. A range of tools is available to explore data, including tables and graphical visualisations.
Factor-cluster analysis:
Factor-cluster analysis is sometimes used in an attempt to reduce the number of segmentation variables in empirical data sets. It consists of two steps: first the original segmentation variables are factor analysed based on principal components analysis. Principal components with eigenvalues equal or larger than one are then selected and suitably rotated to obtain the factor scores. Factor scores are then used as segmentation variables for segment extraction. Because only a small number of factors are used, a substantial amount of information contained in the original consumer data might be lost. Factor-cluster analysis is therefore not recommended, and has been empirically proven to not outperform segment extraction using the original variables. If the number of original segmentation variables is too high, a range of other options are available to the data analyst to select a subset of variables, including using algorithms which simultaneously extract segments and select variables, such as biclustering or the variable selection procedure for clustering binary data (VSBD).
Geographic segmentation:
Geographic segmentation is the result of using geo- graphic information as segmentation variable(s). Examples include postcodes, country of origin (frequently used in tourism market segmentation) or travel patterns recorded using GPS tracking.
Global stability:
Global stability is a measure of the replicability of an overall market segmentation solution across repeated calculations. Very high levels of global stability point to the existence of natural market segments. Very low levels of global stability point to the need for constructive market segmentation. Global stability is visualised using a global stability boxplot.
Hierarchical clustering:
Distance-based method for the extraction of market segments. Hierarchical methods either start with the complete data set and split it up until each consumer represents their own market segment; or they start with each consumer being a market segment and merge the most similar consumers step- by-step until all consumers are united in one large segment. Hierarchical methods provide nested partitions as output which are visualised in a so-called dendrogram. The dendrogram can guide the selection of number of market segments to extract in cases where data sets are well structured.
k-means clustering:
k-means clustering is the most commonly used distance- based partitioning clustering algorithm. Using random consumers from the data sets as starting points, the standard k-means clustering algorithm iteratively assigns all consumers to the cluster centres (centroids, segment representatives), and adjusts the location of the cluster centres until cluster centres do not change anymore. Standard k-means clustering uses the squared Euclidean distance. Generalisations using other distances are also referred to as k-centroid clustering.
Knock-out criteria:
Criteria a market segment must comply with to qualify as a target segment, including homogeneity (similarity of members to one another), dis- tinctness (difference of members of one segment to members of another segment), sufficient size to be commercially viable, match with organisational strengths, identifiability (recognisability of segments members), and reachability.
Marker variable:
Marker variables are subsets of segmentation variables that discriminate particularly well between market segments. They serve as key char- acteristics in the profiling of market segments.
Market segment:
A group of similar consumers. A market segment contains a subset of consumers who are similar to one another with respect to the segmentation criterion, for example, a characteristic that is relevant to the purchase of a certain product. Optimally, members of different market segments are very different from one another.
Market segmentation analysis:
The process of grouping consumers into naturally existing or artificially created segments of consumers who share similar product needs.
Masking variable:
Masking variables – also referred to as noisy variables – are segmentation variables that do not help the segmentation algorithm to extract market segments. Rather, they blur the true structure of the data. By not contributing any information relevant to the segmentation analysis, masking variables increase the number of segmentation variables and, in so doing, make the segment extraction task unnecessarily difficult.
Mosaic plot:
The mosaic plot visualises the joint distribution of categorical (nominal or ordinal) variables based on their cross-tabulation. The mosaic plot allows to compare the distribution of a nominal or ordinal variable across market segments. A shaded mosaic plot colours the cells according to the standardised residuals obtained from comparing the observed cell size with the expected cell size if the variables are not associated, and thus allows easy identification of differences in the distributions across market segments. It is a useful tool for the description of market segments using nominal descriptor variables (such as gender, country of origin, preferred brand), or ordinal variables (such as age groups, the agreement with a range of statements).
Natural segmentation:
The concept of natural segmentation can be used when natural market segments exist in the data. Such natural market segments are distinct and well-separated. Being able to extract them repeatedly across multiple independent calculations is an indicator of their existence. Natural segmentation is the textbook case of market segmentation, but natural segments rarely occur in consumer data.
Natural market segments:
Groups of similar consumers existing naturally in the market. Such market segments rarely exist in consumer data. High stability of segmentation solutions when repeated is an indicator of the existence of natural market segments.
Noisy variable:
See masking variable.
Partitioning clustering:
Distance-based method for the extraction of market segments. Partitioning methods aim at finding the optimal partition with respect to some criterion and thus require the number of market segments to be specified in advance.
Post hoc market segmentation:
See a posteriori market segmentation.
Principal components analysis:
Principal components analysis (PCA) finds prin- cipal components in data sets containing multiple variables. These principal compo- nents differ from the original variables in two ways: they are uncorrelated and they are ordered by information contained (the first principal component contains the most information about the data). As long as the full set of principal components is retained, the components offer a different angle of looking at the data. If, however, only a small number of principal components are used as segmentation variables – which typically occurs when data analysts are faced with too many original variables as segmentation variables – a substantial amount of information collected from consumers is typically lost. It is therefore preferable to use the original variables for segment extraction. If the number of segmentation variables is too high, principal components (or expert assessment) can guide the selection of a subset of available variables to be used for segment extraction, or algorithms like biclustering can be used.
Psychographic segmentation:
Psychographic segmentation is the result of using psychological traits of consumers or their beliefs or values as segmentation criterion. Examples include travel motives, benefits sought when purchasing a product, personality traits, and risk aversion.
Rand index:
The Rand index measures how similar two market segmentation solutions are. It takes values between 1 and 0, where 1 indicates that the two segmentation solutions are identical.
Recursive partitioning:
Recursive partitioning can be used as a regression or classification algorithm; it generates a decision tree also referred to as classification or regression tree. The algorithm aims at identifying homogeneous subsamples with respect to the outcome variable by stepwise splitting of the sample into subsamples based on the independent variables. The trees obtained using recursive partitioning are easy to interpret and allow for convenient visualisation. The disadvantage of recursive partitioning is that the trees are unstable and their predictive performance is often outperformed by other regression or classification algorithms.
Regression:
Regression is the statistical problem of learning a prediction algorithm where the predicted variable is a metric variable. Regression is also referred to as supervised learning in machine learning. Linear regression or recursive partitioning algorithms are examples for regression algorithms. Regression is used as segment- specific model in model-based clustering using a mixture of regression models.
Reproducible segmentation:
The concept of reproducible market segmentation is used when natural, distinct, and well-separated market segments do not exist, yet the segmentation variables underlying the analysis are not entirely unstructured. The existing (unknown) structure of the data can be harvested to extract relatively stable segments. Stable segments are segments which re-emerge in similar form across repeated calculations. In reproducible market segmentation, it is essential to conduct a thorough data structure analysis to gain as much insight as possible about the data before extracting segments. Reproducible market segmentation is the most common case when extracting segments from consumer data (Ernst and Dolnicar 2018).
Sample size:
The number of people whose information is contained in the data set which forms the basis of the market segmentation analysis. Sample size require- ments for market segmentation analysis increase with the number of segmentation variables used. As a rule of thumb, the sample size should be at least 100 times the number of variables (Dolnicar et al. 2016).
Segment attractiveness criteria:
Once market segments have been extracted, they have to be assessed in terms of their attractiveness as target markets for an organisation. Segment attractiveness criteria have to be selected and weighted by the users of the market segmentation solution (the managers considering to pursue a market segmentation strategy). Optimally, this occurs before data is collected. After segments have been extracted from the data, segment attractiveness criteria are used to develop a segment evaluation plot that assists users in choosing one or a small number of target segments.
Segment evaluation:
After market segments have been extracted from consumer data, profiled and described, users – typically managers in the organisation con- sidering to adopt a segmentation strategy – have to select one or a small number of market segments for targeting. To do this, market segments have to be evaluated. This is achieved by agreeing on desirable segment characteristics, assigning weights to them, and using the summated values to create a segment evaluation plot. The plot guides the discussion of users as they select one or a small number of target segments.
Segment evaluation plot:
The segment evaluation plot visualises the decision matrix assisting users of market segmentation solutions (managers) to compare market segments before selecting one or a small number of target segments. The segment evaluation plot depicts the attractiveness of each segment to the organisation on one axis, and the attractiveness of the organisation’s product or service to each of the segments on the other axis. The values for both of these axes result from both the segment extraction stage as well as managers’ evaluation of which segment attractiveness criteria matter most to them. The bubble size of the segment evaluation plot can be used to visualise another key feature of each segment, such as an indicator of their profitability.
Segmentation criterion:
This is a general term for the nature of the segmentation variables chosen; it describes the construct used as the basis for grouping consumers. Travel motives or expenditure patterns, for example, are segmentation criteria.
Segmentation variable:
Segmentation variables are used to extract segments. Market segments can be based on one single segmentation variable (such as age or gender) or on many segmentation variables (such as a set of travel motives, or patterns of expenditure for a range of different products). The approach using one single (or a small number of) segmentation variable(s) inducing a segmentation solution which is known in advance is referred to as commonsense segmentation. The approach using many segmentation variables where segments need to be extracted is referred to as data-driven segmentation.
Segment level stability across solutions (SLSA):
Segment level stability across solutions (SLSA) indicates how stable one market segment is across repeated calculations of market segmentation solutions containing different numbers of segments. It can best be understood as the stubbornness with which a market segment reappears across repeated calculations with different numbers of segments. Segment level stability across solutions (SLSA) is visualised using a segment level stability across solutions (SLSA) plot.
Segment level stability within solutions (SLSW):
Segment level stability within solutions (SLSW) indicates how stable a market segment is across repeated calcu- lations of market segmentation solutions containing the same number of segments. Very high levels of segment level stability within solutions (SLSW ) for a market segment point to this market segment being a natural market segment. Very low levels for a market segment indicate that this segment is likely to be artificially constructed. Segment level stability within solutions (SLSW ) is visualised using a segment level stability within solutions (SLSW ) plot.
Segment profile plot:
The segment profile plot is a refined bar chart visualising a market segmentation solution. The segment profile plot requires less cognitive effort to process than a table containing the same information. As a consequence, the segment profile plot makes it easier for users of market segmentation solutions (managers) to gain insight into the key characteristics of market segments. Segment profile plots portray market segments using segmentation variables only.
Segment separation plot:
The segment separation plot allows to assess a segmen- tation solution. The plot consists of a projection of the data into two dimensions (using, for example, principal components analysis); colouring the data points according to segment memberships; and indicating segment shapes using cluster hulls. The plot is overlayed with a neighbourhood graph indicating the segment representatives (cluster centres) as nodes, and their similarity through the inclusion of edges and adapting edge widths. For simplicity, data points can be omitted.
Socio-demographic segmentation:
Socio-demographic segmentation is the result of using socio-demographic information about consumers as segmentation vari- able(s). Examples include age, gender, income, and education level.
Stability analysis:
Stability analysis provides insight into how reproducible mar- ket segmentation analyses are. Stability can be assessed at the overall level for the entire market segmentation solution (global stability), or at the segment level (segment level stability within solutions (SLSW ), segment level stability across solutions (SLSA)). Stability information points to the most appropriate market seg- mentation concept (natural segmentation, reproducible segmentation or constructive segmentation); assists in choosing the number of segments to extract; and identifies stable segments.
Target segment:
The target segment is the market segment that has been selected by an organisation for targeting.
Validity:
See data structure analysis.
Contact.
Have a question? Feel free to send an email or if you prefer a virtual meeting