Target High-Value Audience Using Look-Alike Model

Eva Wang
May 26, 2021
6 min read

Updated: Jul 18, 2021

In marketing, the business constantly wants to find the audiences who behave the like the "best audience/customer" who spends most, purchase often, and never return. Does that sound hard to achieve? Today I want to share the like-alike models that can help find such audiences.

A look-alike model uses a data mining process to find the target audience through similarity scores. A similarity score describes the similarity between an instance and the known good or bad instance. The model itself is a formula and can be repeatedly applied to new datasets.

Let's assume that we want to identify the postal codes within which customers have the potential to spend like the customers in the best postal code. The best postal code is the one in which customers spend most on average among all postal codes. I choose average spending rather than total spending here. The reason is that the advertisement fees often are paid clicks based rather than the size or population of the geographic area where the marketing campaigns cover.

The datasets are from the database SQLbook, provided with Data Analysis Using SQL and Excel, and written by Gordon S. Linoff.

Find the best postal code

The below SQL code finds the best postal code among "all" the postal codes. There are some orphaned records in the database, so we need to make sure that the best postal code appears in both Orders and ZipCenses table, because the table ZipCenses will be used to identify corrected attributes later.

Notice: the format of the SQL code is pretty messy due to the limitation of the website editor function. You can copy and paste the code to Notepad++. It will help you format it nicely.

USE SQLBook;
GO

WITH Seed AS(
    SELECT o.ZipCode,
        o.TotalPrice
    FROM   Orders o INNER JOIN ZipCensus z
    ON z.zcta5=o.ZipCode
),


aggregation AS(
    SELECT  ZipCode,
        AVG (TotalPrice) as AvgSpending,
        ROW_NUMBER() OVER (ORDER BY AVG (TotalPrice)) as RowNumber          
    FROM Seed
    GROUP BY ZipCode    
)
SELECT ZipCode
FROM aggregation
WHERE RowNumber=1

The best postal code is found to be 91746 after running the above code.

Identify facts related to the average spending

The similarity scores are calculated based on the distances between attributes of the best postal code and attributes of the other postal codes. Here the distance is not physical. It is more like a similarity level. The smaller the distance is, the more similar a postal is to the best postal code. The attribute could be mean income, average age, or gender, but they must be related to the average spending in postal code level. The attribute needs to be identified.

There are many attributes in the ZipCensus table. We choose eight attributes only to test. They are college percentage, child percentage, median age, male percentage, work_at_home percentage, Asian percentage, female full-time worker percentage, and Median Earning.

There are many postal codes in the database. If we use a scatter plot to find the relationship between attributes, it would be tough to read. However, we can group the postal codes using NTILE. Then, we calculate the average of the selected attribute. After reusing the code from above and add more, we get below:

USE SQLBook;
GO


WITH Seed AS(
        SELECT  o.ZipCode,
		z.pctInCollege,
		z.pctChild,
		z.MedianAge,
		z.pctMales,
		z.pctWorkAtHome,
		z.pctAsian1,
		z.pctEmployedFemales,
		z.MedianEarnings,
		o.OrderDate,
		o.TotalPrice
        FROM Orders o INNER JOIN ZipCensus z
	ON z.zcta5=o.ZipCode
),
aggregation AS(
       SELECT   ZipCode,
		AVG (TotalPrice) as AvgSpending,
		NTILE (10) OVER (ORDER BY AVG (TotalPrice) ) as Decile,	
		ROW_NUMBER() OVER (ORDER BY AVG (TotalPrice) ) as RowNumber		
       FROM Seed
	   GROUP BY ZipCode	   
),
Ntile AS(
       SELECT 
	        a.Decile,
	        ROUND(AVG(s.pctInCollege),2) as CollegePcnt,
	        AVG(s.pctChild) as pctChild,
		AVG(s.MedianAge) as MedianAge,
		AVG(s.pctMales) as pctMales,
		AVG(s.pctWorkAtHome) as WorkatHomePcnt,
		AVG(s.pctAsian1) as pctAsian1,
		AVG(s.pctEmployedFemales) as EmployedFemalesPcnt,
		AVG(s.MedianEarnings) as MedianEarnings
       FROM Seed s INNER JOIN aggregation a
       ON s.ZipCode=a.ZipCode
       GROUP BY a.Decile
)
SELECT *
FROM Ntile
ORDER BY Decile

The Decile column was created using NTILE and ranked based on the total cost.

Based on the above SQL result, it is hard to identify which attribute/attributes are mostly related to the average spending. We can plot the results in Excel. For each attribute and the Decile column, we plot one line chart. We get 8 line charts as below. The 1 to 10 on the x-axis are the deciles. The bigger the decile is, the higher the average spending is.

Using the above charts, I find that college percentage, Asian percentage, and median earnings seem mostly corrected with average spending. These three attributes will be used to calculate the distance for similarity scores.

Standardize variances using Z-score

As median income has a different scale as the other two, using them to calculate the square root of the total variance ( as below) would give too much weight to median earning. To solve this, we can calculate the z-score of each attribute. The z-scores will be used to calculate the standard variance later.

/* z-score calculation as an example only */
SQRT(SQUARE (pctAsian1- p.pctAsian1)+SQUARE (MedianEarnings- p. MedianEarnings)+SQUARE (pctInCollege- p. pctInCollege))

Below is to transfer the average Asian percentage, median earning, and college percentage into z-scores, which is a statistical concept.

WITH Seed AS ( < the major code from Identify facts related with average spending part>),

AggragationRelated AS(
       SELECT   o.ZipCode,			
		AVG(z.pctAsian1) as AsianPcnt,			
		AVG(z.MedianEarnings) as MedianEarnings,	
		AVG(z.pctInCollege) as CollegePcnt			
       FROM Orders o INNER JOIN ZipCensus z
       ON z.zcta5=o.ZipCode
       GROUP BY o.ZipCode
),
ozm AS (
      SELECT    AVG(AsianPcnt) as AvgAsianPcnt,
	        STDEV(AsianPcnt) as StdAsianPcnt,
		AVG(MedianEarnings) as AvgMedianEarnings,
	        STDEV(MedianEarnings) as StdMedianEarnings,
		AVG(CollegePcnt) as AvgCollegePcnt,
	        STDEV(CollegePcnt) as StdCollegePcnt
      FROM  AggragationRelated
),
ozs AS(
      SELECT    a.ZipCode,
	        (a.AsianPcnt-ozm.AvgAsianPcnt)/ozm.StdAsianPcnt as  Zscore_AsianPcnt,
		(a.MedianEarnings-ozm.AvgMedianEarnings)/ozm.StdMedianEarnings as Zscore_MedianEarnings,
		(a.CollegePcnt	-ozm.AvgCollegePcnt	)/ozm.StdCollegePcnt  as  Zscore_CollegePcnt	
     FROM AggragationRelated a CROSS JOIN ozm
)
SELECT * FROM ozs

This is the z-score results running the above script.

Calculate standardized distances

After getting the z-scores, we can calculate the square root of the aggregated variance, which is the standardized distance. Here in below, ozs.Zscore_AsianPcnt and p.Zscore_AsianPcnt are a pair. Both are the result after standarization. ozs.Zscore_AsianPcnt represents the z-score of the average Asian percentage in postal code level. p.Zscore_AsianPcnt represents the z-score of the best postal code. Same for the other two pairs.

SQRT(SQUARE (ozs.Zscore_AsianPcnt- p.Zscore_AsianPcnt)+SQUARE (ozs.Zscore_MedianEarnings- p.Zscore_MedianEarnings)+SQUARE (ozs.Zscore_CollegePcnt- p.Zscore_CollegePcnt)).

With the above stopes, we are ready to calculate the standardized distance.

USE SQLBook;
GO


WITH Seed AS(
        SELECT   o.ZipCode,
	         z.pctInCollege,
	         z.pctChild,
		 z.MedianAge,
		 z.pctMales,
		 z.pctWorkAtHome,
		 z.pctAsian1,
		 z.pctEmployedFemales,
		 z.MedianEarnings,
		 o.OrderDate,
		 o.TotalPrice
        FROM Orders o INNER JOIN ZipCensus z
	ON z.zcta5=o.ZipCode
),
aggregation AS(
       SELECT    ZipCode,
		 AVG (TotalPrice) as AvgSpending,
		 NTILE (10) OVER (ORDER BY AVG (TotalPrice)) as Decile,	
		 ROW_NUMBER() OVER (ORDER BY AVG (TotalPrice)) as RowNumber		
       FROM Seed
       GROUP BY ZipCode	   
),
Ntile AS(
       SELECT    a.Decile,
		 AVG(s.pctInCollege) as CollegePcnt,
	         AVG(s.pctChild) as pctChild,
		 AVG(s.MedianAge) as MedianAge,
		 AVG(s.pctMales) as pctMales,
	         AVG(s.pctWorkAtHome) as WorkatHomePcnt,
		 AVG(s.pctAsian1) as pctAsian1,
		 AVG(s.pctEmployedFemales) as EmployedFemalesPcnt,
		 AVG(s.MedianEarnings) as MedianEarnings
       FROM Seed s INNER JOIN aggregation a
       ON s.ZipCode=a.ZipCode
       GROUP BY  a.Decile
),
AggragationRelated AS(
       SELECT   o.ZipCode,			
		AVG(z.pctAsian1) as AsianPcnt,			
		AVG(z.MedianEarnings) as MedianEarnings,	
		AVG(z.pctInCollege) as CollegePcnt			
        FROM Orders o INNER JOIN ZipCensus z
		ON z.zcta5=o.ZipCode
		GROUP BY o.ZipCode
),
ozm AS (
      SELECT    AVG(AsianPcnt) as AvgAsianPcnt,
		STDEV(AsianPcnt) as StdAsianPcnt,
		AVG(MedianEarnings) as AvgMedianEarnings,
	        STDEV(MedianEarnings) as StdMedianEarnings,
		AVG(CollegePcnt) as AvgCollegePcnt,
	        STDEV(CollegePcnt) as StdCollegePcnt
      FROM AggragationRelated
),
ozs AS(
       SELECT  a.ZipCode,
	        (a.AsianPcnt-ozm.AvgAsianPcnt)/ozm.StdAsianPcnt as  Zscore_AsianPcnt,
		(a.MedianEarnings-ozm.AvgMedianEarnings)/ozm.StdMedianEarnings as Zscore_MedianEarnings,
		(a.CollegePcnt	-ozm.AvgCollegePcnt	)/ozm.StdCollegePcnt  as  Zscore_CollegePcnt	
      FROM AggragationRelated a CROSS JOIN ozm
),
distance AS (
       SELECT   ozs.ZipCode AS PostalCode,
	        p.ZipCode AS TopPostalCode,
                SQRT(SQUARE (ozs.Zscore_AsianPcnt- p.Zscore_AsianPcnt)+SQUARE (ozs.Zscore_MedianEarnings- p.Zscore_MedianEarnings)+SQUARE (ozs.Zscore_CollegePcnt- p.Zscore_CollegePcnt)) as Distance
       FROM ozs CROSS JOIN (
	         SELECT ozs. * FROM ozs
		 WHERE ozs.ZipCode=(SELECT ZipCode FROM aggregation WHERE RowNumber=1)) p  
)
SELECT  d.Distance,d.PostalCode,d.TopPostalCode,FORMAT(a.AsianPcnt,'P2') as Asican_Percentage,a.MedianEarnings,	FORMAT(a.CollegePcnt,'P2') as CollegePercentage	
FROM distance d
INNER JOIN AggragationRelated a
ON d.PostalCode=a.ZipCode
WHERE d.Distance IS NOT NULL
ORDER BY d.Distance ASC;
GO

After all the steps, we get the postal codes that are most similar as the best postal code in term of average spending. These postal code should be the target ones we need to put advertisement to.

Present the target postal code on map

If you want to present the postal codes on map, you can replace the code inside the red square from the above paragraph with the below one.

Here, let's assume that we only want to select the top 10 records based on distance.

SELECT TOP 10 PERCENT d.Distance as Distance,d.PostalCode, z.Latitude, z.Longitude
FROM distance d
INNER JOIN ZipCensus z
ON z.zcta5=d.PostalCode
WHERE d.Distance IS NOT NULL
ORDER BY d.Distance ASC;

Here is the result:

Now we can use Excel 3-D mapping to plot the postal codes as in below.

The height of the graph used is average of distances. By tilting the map and adjusting down the size of thickness of the bars, we can get below. From it, we can see, the most desirable customers locate in Florida, New York, and California.

Conclusion

Look-alike models are very logical models that can be used to identify the larger audience from a small sample set audience. It can be used to effectively target high value customers. As businesses grow, it can become increasingly urgent to find a substantial pool of high-quality prospects to target. Look-alike models help target new and unique prospects that behave and purchase like the best customer.