Using Machine Learning and Big Data to Understand Micro Markets Vernon H. Budinger, CFA November 3, 2023

Overview

This paper focuses on a data/AI toolkit that marketing managers can use to understand the demographics in their market. While many AI projects rely on enormous data sets and new intimidating neural algorithms, Machine Learning provides informative, detailed assessments of consumer markets in economic micro-regions using large pools of free data and free software.  Such “humble” AI efforts can dramatically improve the reach and efficiency of marketing campaigns.

Targeted Market Segments.

There are many avenues for creating region-specific content and delivering that message to a specific census tract.

·        Facebook, Instagram, Twitter, and LinkedIn allow you to target a small region with posts.

·        Direct mail companies can create mail campaigns by census tract

·        Advertisements can be placed in community calendars and local news media.

·        Billboards reach specific market segments

The key is to segment your audience based on demographics, interests, and needs.  The Census Bureau now provides APIs (Application Programming Interface) to pull Census Data by tract or census block (smaller than a tract).  In addition, the American Community Survey is a U.S. Census service that updates the regional information yearly.  This data is free and is tagged with geolocation information.

The key is to segment your audience based on demographics, interests, and needs.  The Census Bureau now provides APIs (Application Programming Interface) to pull Census Data by tract or census block (smaller than a tract).  In addition, the American Community Survey is a U.S. Census service that updates the regional information yearly.  This data is free and is tagged with geolocation information.

This paper demonstrates how the U.S. Census/ACS study can used with freeware statistical/graphing packages to explore the ACS data and develop profiles of the consumers in the area.  The paper will show statistical/Machine Learning analysis of the data that provides deep insights into the demographic characteristics of each census tract.

Using Facebook to send content to target markets

Section II Market Analysis of North Bay California Counties

These tools were chosen because the data and analytics are accessible to small and medium-sized companies. They offer a better understanding of geo-demographic trends, improve customer experience, and build stronger relationships with the client base.  The analysis produces detailed quantitative measures of economic and demographic status as well as consumer behavior for micro-regions.

This analysis was adapted from Chapter 8 of Kyle Walker’s book “Analyzing U.S. Census Data.” This study focuses on median home value from the U.S. Census Bureau’s 5-year American Community Survey (5-year ACS).  While this data depends on estimates, it is more current than the Decennial Survey and has more data than the 1-year ACS, which only covers areas of population equal to or greater than 65,000. The more current 1-Year ACS would not cover any of the cities except Santa Rosa in Sonoma. 

Five County Demographic Comparison

Sonoma, Lake, Marin, Mendocino, & Napa Counties 

The data table for the five counties illustrates the challenges that marketing operations face in this complex region. Sonoma and Marin Counties in the southern part of the North Bay Region are wealthier and more densely populated than Mendocino and Lake Counties to the north. 

In reviewing the table, the income disparity is shocking - the median household income for Sonoma County is about 64% higher than neighboring Mendocino County.  Marin County, with its proximity to San Francisco, is the wealthiest county in terms of median household income and per capita income.  The 5-county average ratio of 1.8 between median household income and per capita income suggests that most households have two sources of income. Napa’s land area is slightly bigger than Marin County’s, but its economy is roughly half the size. Mendocino and Lake Counties are clearly rural with a population density that is a fraction of the other counties.  Mendocino County has twice as much land as the next biggest county, but the population is a fraction of the densities for Marin and Sonoma.

Moreover, Mendocino and Lake County are poor by California North Bay standards with a respective 16.1% and 16.5% of the two counties living under the poverty level compared to Napa (9%), Marin (7.8%), and Sonoma (9.1%).   The United States Department of Agriculture classifies any county with a poverty rate of more than 20% as “high poverty.”   The Dissimilarity Index and the GINI Index paint a similar picture. This dissimilarity index measures segregation in the counties; the GINI index measures the income disparity.

Dissimilarity Index for White vs Hispanic and GINI Index for Income

The initial reaction of a marketing professional might be to classify Marin, Sonoma, and Napa Counties as rich and Mendocino and Lake Counties as poor.   However, we will see that each of the North Bay counties has pockets of wealth and poverty.

This example focuses on the median home value as a measure of opportunity for measuring the wealth of the region.  Home value is often the single largest family investment and a measure of wealth. However, we learn about more than wealth; this data set is rich with other variables for study and provides detailed data to understand the inferences from Machine Learning Tools.

 

Consumer Market Analysis Using Unsupervised Learning

Unsupervised Learning, which includes Principal Components Analysis, provides statistics for reducing the “dimensions” of the data. This tool is especially adept at identifying common factors in datasets with thousands of variables without using labels — and therefore is considered unsupervised.

County-level data does not really give us a refined picture of the population and smaller regional economies.  Are there common factors for each county or are the counties completely different?  The American Community Survey provides a detailed breakdown of the social and economic microclimates in the counties. We can see from the two maps of Aggregate Income that the picture is complex.  Small businesses can take advantage of this knowledge by marketing to specific microclimates through targeted social media and other marketing channels.

The two maps below provide some insight into data available in Mendocino and Sonoma Counties.  The county subdivisions are the U.S. Census Bureau’s Census Tracts for organizing the Decennial Census.  These tracts can be further divided into Census Blocks for additional micro-region detail.

Note that the legend for Sonoma tops out at $400 million whereas the maximum for Mendocino County’s legend is $200 million.  One of the poorest regions in Sonoma County borders Mendocino County but is 2 times the income of Mendocino’s neighboring tract.

When we combine the two counties, we see there is an abrupt change in income levels on the borders of the counties, but there are also many areas of the counties that are similar. This section solves many of these puzzles using unsupervised machine learning to provide detailed insights about microeconomic climates that astute marketing managers use to tailor specific messages that connect with local populations.

Principal Component Analysis (PCA)

Benefit: PCA provides a tool to reduce the number of features (variables) that we need to consider while maintaining most of the information from those features.  As will be discussed in the next few pages, the component information provides deep insights into the key items that unite or separate populations.

This Principal Components Analysis identifies the factors that drive the demographics of the area. Principal Components are vectors of numbers used to reduce the number of features (variables) in analysis but still describe a census tract with great mathematical detail. Each component has factor loadings that further break down the variables associated with each factor. This can be useful for micro-economic research, as it can help to identify the key factors that drive economic activity in different tracts.

Eighty-six percent of the variance in median home values can be explained by the first 8 principal components of this dataset (PC1 to PC8). As explained above, each principal component provides the factor loadings for the variables.

Ranking of the most important Principal Components by contribution (the first 10 provide 92.42% of the information):

             Contribution                       Cumulative Contribution

PC1:       34.82%                                34.82%

               PC2:       23.41%                                58.23%

               PC3:       10.23%                                68.46%

               PC4:         5.58%                                74.03%

               PC5:        4.06%                                78.09%

               PC6:        3.42%                                81.51%

               PC7:        3.06%                                84.57%

               PC8:       2.96%                                 87.52%

               PC9:       2.64%                                 90.17%

               PC10:     2.25%                                 92.42%

When we look at the map of the factor loadings for each principal component, we begin to understand how they reduce the dimensions without losing the ability to model volatility.

 Principal Components for the North Bay Region

Each principal component has several factor loadings.  The factor loading is positive if the green bar juts to the right and negative if it juts to the left.  Each component is composed of various combinations of factor loadings or exposure to the variables – examples are:

               College Education

               Foreign Borne

               Renter Occupied Housing

               Population Density

               Median Age of the Structures

               Median Age of the Population

               Hispanic

               Asian

 

Principal Component 1, which explained 34.82% of the volatility in the data, is heavily positively loaded for the following key factors:

               White

               Total Population

               Living in the same house last year

               Owner occupied

               Median Income

               Higher Aggregate Income for the tract and by household

 

 

Principal Component 2 explains 23.41% of the volatility and contrasts strongly with #1:

               Populated areas (same as #1)

               Renting housing

               Low percent white

               High foreign-born

               Low owner-occupied housing

               Low income

               Highest Hispanic loading

 

Principal Component 3 explains 10.4% of the volatility:

               Negative exposure to White

              Most Positive Wages to Social Security

               Foreign Born

               High Percent College

               Negative weighting on Owning House

               Negative on Living in Same House Last Year

               Positive loading for Aggregate Income per Person

 

The Principal Components can then be used to construct a mathematical model of the census tract.

Waits for respective principal components

With Principal Components, marketing can develop very precise mathematical descriptions for target neighborhoods. The weights for each Principal Component are assigned to each tract and serve to mathematically characterize the location in detail.  For instance, the weight for PC1 for Covelo is -4.381 because it is the site of the Roundtree Indian Reservation, and the white population is a relatively small percentage of the population.  However, Covelo’s weighting for PC7, with the strong factor loading for Native Americans, is 6.850.  West Novato in Marin County, on the other hand, is a neighborhood with many whites, its weight for PC1 is 7.105, while the weight for PC7 is 0.295.  These two tracts contrast with East San Rafael with one of the highest exposures to PC2 that is heavily loaded for Renters and Hispanics and very few college graduates. This is only the beginning of insight into these tracts and the possible combinations provide deep insight into the demographics of the tract.

We can now map the importance of the component to each census tract. NOTE: This paper will only look at the top 3 principal components.

PC1 Loads are heavily influenced by factors associated with the white population (see the yellow and light green tracts):

PC2 Loads Hispanic and associated variables (Note that once again, tracts that have a high Hispanic contribution and are yellow-green):

               Positive:

Percent Foreign Born

                              Renter Occupied

                              Population Density

                              Hispanic

               Negative:

                              White

                              Percent College Graduate

PC3 Loads Wealth, foreign-born, college education, and negative for receipts of social security: PC3 is heavily influenced by income from wages, is the only factor where race is not a major loading, and tends to be more important in the South.

Principal Components Regression: Supervised Learning Applied to Unsupervised Learning Results

Benefit: Principal Components Regression provides another view of the data, like looking at a house from the front and then walking to view from the side.

The previous PCA focused on component-by-component analysis.  The PCA regression gives a tool to incorporate all the components in one equation to evaluate a tract.  Note: The PCs can be used as indices and equations can be used to develop a score for each tract.

There is more to the Principal Components story. Principal Components can be used for principal components regression, in which the derived components themselves are used as model predictors. Generally, components should be chosen that account for at least 90 percent of the original variance in the predictors, though this will often be up to the discretion of the analyst. In the example below, we will fit a model using the first six principal components which represent 80% of the model variance and the outcome variable is once again the log of the median home value.

Principal Components Regression

Principal Components Regression Analysis

With an R-squared value of 71.46%, the model fit is close to the first regression model fit of 75.05% earlier in this paper. The PCA model is also statistically significant. We can think of principal components as indices that measure the economic activity in the region.  The advantage of this analysis is that we can also examine the contributions of the factors to each census tract based on the factor loadings. 

Table of Selected Observations from Map

This regression provides an economic index of the well-being of a census tract.  The average is the intercept, 13.42.  One of the higher scores (the score is calculated by multiplying the Estimates by the factor loading for each Principal Component) is 14.81 for Tiburon in Marin County with a Poverty Rate of 5%.  One of the lowest scores is 12.35 for Kelseyville in Lake County with a Poverty Rate of 21%. 

The regression scores are based on the following combination of variables.

PC1:  A strong positive contribution to the median value of housing for the entire region.

PC2: Negative factor in housing valuation and the second most significant variable.

PC3: As noted above, this component measures the factors associated with high income in a region. It makes sense that this component would have the highest estimate (0.1615) and would be the most statistically significant (t value of 15.864).

PC4: This component loads heavily for a high percentage of owner-occupied houses, low number of renters, low percentage White, high percentage Asian, large number of rooms in the house, and occupied by the same person last year.

PC 5: Not significant (small estimate and low t value).

PC 6: This factor loads heavily for residents of Pacific Island Descent and most of the locations are in Napa Valley.  It also has a heavy factor loading for the age of the structure and a negative loading for Other Races and Hispanics.

PC 7: There are two positive main loadings for this component: percentage Native American and percentage Black.

PC 8: There are two positive main loadings for this component:  median structure age and other race..

PC 9: There are three positive main loadings for this component: negative exposure to Pacific islanders, other races, and Native Americans.

Supervised Learning: Geographically Weighted Regression

Benefit:  The linear regressions estimate global relationships between the dependent variable (variable being predicted) and the independent variables (used to predict).  Per Walker, “This lends itself to conclusions like ‘In the Dallas-Fort Worth metropolitan area, higher levels of educational attainment are associated with higher median home values.’ However, metropolitan regions like Dallas-Fort Worth are diverse and multifaceted. It is possible that a relationship between a predictor and the outcome variable that is observed for the entire region on average may vary significantly from neighborhood to neighborhood. This type of phenomenon is called spatial non-stationarity, and can be explored with geographically weighted regression, or GWR (Brunsdon, Fotheringham, and Charlton 1996).”

In the following analysis, we map the Median Home Value and then compare that to the local R-squared to find the local variations from the global conclusions that we reached using PCA and PC Regression.

Geographically Weighted Regression Map

The below shows the predicted values for the log of the Median Home Value and the results are very similar to the Principal Component Analysis.

With the base R-squared of 75% on the legend, this map shows how the R2 deviates by census tract. The model performs well across the region but is better in some of the more rural areas of the region, especially in the very south, the eastern, and northern census tracts where the R-squared ranges from 80% to above 90%. Note: the model deviates most in the rural regions of Mendocino and Lake Counties.

The map below shows the relationships of Percent of Owner-Occupied Housing in local tracts to the overall model. Recall that the relationship in the percentage of Owner-Occupied Housing (OOH) to home value is negative for the region. The dark purple areas on the map are those areas where the global relationship in the model reflects the local relationship, as local parameter estimates are negative. The areas that stand out include the high-density area of lower Marin County, where median home values are very high. However, in the mostly northern rural tracts of the region, the estimate is zero indicating that the local percentage of owner-occupied housing does not affect the value.

The population density parameter estimate was positive for the entire equation. The tracts in Marin County in the south, in southern Mendocino County in the center, and Lake County in the central east have no local beta. Once again, the key wine-growing regions in Napa and Sonoma go against the overall trend — the property values increase with lower population density.

Cluster Analysis: Unsupervised Learnin

Benefit: Cluster Analysis identifies economic characteristics that explain the spatial distribution of economic activity and groups the tracts into data sets or clusters with similar characteristics versus clusters that differ significantly. Cluster analysis on PCs provides insights into economic opportunities in micro-regions. Does the tract present qualities for economic growth or is it characterized by a low-income or impoverished economy?

While Cluster Analysis can be run on raw data, many data scientists apply PCA to the data before analyzing the data with clustering algorithms.  This two-step procedure reduces the noise in the clustering results. While PCA and Cluster Analysis are similar, the two techniques have different goals.  If a study has 100 features (variables), PCA tries to condense that information into a smaller number of features that really matter.  Cluster Analysis, on the other hand, seeks to represent the 100 features into a set of clusters that are internally the same but significantly different from the other clusters.

After several iterations, I found that 6 clusters provided the best fit and separated the PCs into distinct groups.

In review of the factor loadings in the dot plots from previous sections, PC 1 is the component that represents the White population with higher income, some college graduates, and owner-occupied housing.  PC2 represented a mostly Hispanic population that was dominated by renters.  The dots represent the tracts, and the color identifies the cluster assigned.  Tracts that are to the right of zero on the horizontal axis are weighted positively toward PC1 as with Clusters 2, 3 and 5.   Tracts that are above zero on the vertical axis are weighted to PC2; Clusters 1 and 4.

Plotting PC1 against PC3 (income - no race component) shows that there are several distinct income groups. Cluster 1 (Hispanic, renter in the denser south region) ranks positively in PC3 as does Cluster 3 (Rural, White, Small Towns) and Cluster 5 (Wealthier denser populations in the south). Cluster 2 (Rural population centers with mixed PC1 and PC2), Cluster 4, and Cluster 6 do not rank as high in wealth.

Cluster #1(Red): Rural areas, firmly Hispanic (PC2) with few Whites (PC1). In general, these regions are in pockets that lie between bigger regions, like the 101 corridor from Cloverdale to Healdsburg, a small tract on the southeast side of Santa Rosa, and in Napa Valley.

Cluster #2(Blue): Cluster 2 includes heavily commercial areas in the south that have positive exposure to both PC1 and PC2, where both White and Hispanic residents are strong.  In the south, this represents the Highway 101 corridor.  Most of these areas have the densest populations in the region and are significantly more populated than the surrounding census tracts in the region.  The big blue tracts in the north are Willits, Ukiah, Kelseyville, and Cloverdale.

Cluster #3(Green): Represents agricultural and wine-growing regions with positive exposure to PC1 (White, owner-occupied).  The green areas are northern Santa Rosa to Healdsburg regions and include some of the Russian River wine region and key wine regions of Napa.  Like Cluster 5, this cluster has no exposure to PC2 – the Hispanic-dominated Principal Component.

Cluster #4(Purple): Represents the poorer mixed races in rural, and agricultural regions.

Cluster #5(Orange): This cluster has a negative weighting for PC1 and for PC2, meaning the residents are predominantly white and in the higher-income areas of Marin and Sonoma County.

Cluster #6(Yellow): Equally PC1 and PC2, but low exposure to PC3 (wealth component). This factor covers the Covelo tract in the northeastern corner, some of the poorer neighborhoods around Clear Lake in Lake County, and the Point Arena and Navarro/Boonville regions of western Mendocino County. These areas are sparsely populated and either agricultural or heavily forested.

Summary

While this study delivered some deep insights into the demographic breakdown of the North Bay Region, it is a preliminary case study or a first step that small and medium-sized companies can take to understand customers and improve customer experience.

AI and Machine Learning Tools can spearhead an effective defense against bigger companies and competition from new, disruptive technology.  Despite the length of this paper, it only addressed a small group of customer preferences, and, in many ways, it raises as many questions as it answers.

Some of the solutions that the analysis highlighted:

1.      A financial company might want to advertise the highest interest rates on deposits to the wealthy, older communities in Marin.

2.      Send out advertisements in Spanish to the heavily Hispanic Communities.

3.      Promote community programs in the poorer tracts of Lake County and Mendocino

4.      A finance company might want to promote home equity loans in regions with high home ownership

5.      On the other hand, the same finance company would promote affordable home loan programs in Spanish to tracts with a large percentage of the Hispanic population who rent.

As the paper shows, AI and Machine Learning give Small and Medium-Sized businesses the tools to counter disruptive market developments with a deep understanding of their market and an intense commitment to improving the customer experience, from a seamless delivery of products to attention to customers’ specific needs.

In the bigger picture, the ability to use AI also depends on a company’s corporate structure. The modern firm needs to transform itself into a digital, agile organization that can share AI throughout the firm to survive the coming market disruptions.

 

Neural Profit Engines provides a suite of chief financial officer services under the brand name Neural Financial Officer. 

·       Big Data studies to aid planning and financial analysis

·       Business strategy based on AI and Machine Learning

·       Data cleaning - labeling and preparation

·       AI and Machine Learning analysis of big data and trends

·       Planning and financial analysis

·       Bookkeeping and accounting services

·       Company training for ChatGPT and Bard

 

 

Vernon H. Budinger, CFA

Chief Executive Officer

vernon@neuralprofitengines.com

www.neuralprofitengines.com

+1(707) 513-0880

President Biden's Executive Order on Artificial Intelligence

President Biden and the White House published an Executive Order on AI that highlighted the potential risks and benefits of the technology and outlined the administration's plans to ensure its safe, secure, and trustworthy development and use.

Many AI professionals had expected controls centered on the large AI leaders in the U.S. that could have crippled the ability of smaller players to compete.  However, the document was more benign and was mostly aimed at developing safety standards and establishing security protocols.

The statement notes that AI has the potential to transform many aspects of our lives, from the way we work and learn to the way we interact with the world around us. However, it also warns that AI can pose risks, such as discrimination, bias, and job displacement.

The Fact Sheet explains that The Executive Order directs the following actions:

·       Require that developers of the most powerful AI systems share their safety test results and other critical information with the U.S. government.

·       Develop standards, tools, and tests to help ensure that AI systems are safe, secure, and trustworthy. 

·       Protect against the risks of using AI to engineer dangerous biological materials.

·       Protect Americans from AI-enabled fraud and deception by establishing standards and best practices for detecting AI-generated content and authenticating official content.

·       Establish an advanced cybersecurity program to develop AI tools to find and fix vulnerabilities in critical software.

·       Order the development of a National Security Memorandum that directs further actions on AI and security, to be developed by the National Security Council and White House Chief of Staff. 

To address these risks, the administration is taking a number of steps, including (italics are direct quotes from the Fact Sheet):

  • Issuing an executive order that establishes new standards for AI safety and security, protects Americans' privacy, advances equity and civil rights, stands up for consumers and workers, promotes innovation and competition, and advances American leadership around the world.

  • Securing voluntary commitments from leading AI companies to help move toward safe, secure, and transparent development of AI technology.

  • Investing in research and development to develop new AI technologies that are safe, secure, and trustworthy.

  • Working with international partners to develop global norms and standards for the responsible development and use of AI.

To Promote Innovation and Competition – the Executive Order provides the following instructions:

  • Catalyze AI research across the United States through a pilot of the National AI Research Resource—a tool that will provide AI researchers and students access to key AI resources and data—and expanded grants for AI research in vital areas like healthcare and climate change.

  • Promote a fair, open, and competitive AI ecosystem by providing small developers and entrepreneurs access to technical assistance and resources, helping small businesses commercialize AI breakthroughs, and encouraging the Federal Trade Commission to exercise its authority.

  • Use existing authorities to expand the ability of highly skilled immigrants and nonimmigrants with expertise in critical areas to study, stay, and work in the United States by modernizing and streamlining visa criteria, interviews, and reviews.

The Biden-Harris Administration will continue working with other nations to support safe, secure, and trustworthy deployment and use of AI worldwide. To that end, the President directs the following actions:

·       Expand bilateral, multilateral, and multistakeholder engagements to collaborate on AI. The State Department, in collaboration with the Commerce Department, will lead an effort to establish robust international frameworks for harnessing AI’s benefits, managing its risks, and ensuring safety. 

  • Accelerate development and implementation of vital AI standards with international partners and in standards organizations, ensuring that the technology is safe, secure, trustworthy, and interoperable.

  • Promote the safe, responsible, and rights-affirming development and deployment of AI abroad to solve global challenges, such as advancing sustainable development and mitigating dangers to critical infrastructure.

The statement concludes by emphasizing the administration's commitment to working with the private sector, civil society, and academia to ensure that AI is used for good.

Finally, the Executive Order issued the following instructions to U.S. Government agencies:

  • Issue guidance for agencies’ use of AI, including clear standards to protect rights and safety, improve AI procurement, and strengthen AI deployment.  

  • Help agencies acquire specified AI products and services faster, more cheaply, and more effectively through more rapid and efficient contracting.

  • Accelerate the rapid hiring of AI professionals as part of a government-wide AI talent surge led by the Office of Personnel Management, U.S. Digital Service, U.S. Digital Corps, and Presidential Innovation Fellowship. Agencies will provide AI training for employees at all levels in relevant fields.

 

In short, the Biden administration is taking a comprehensive approach to address the risks and opportunities of AI. The administration's goal is to ensure that AI is used to benefit all Americans while protecting them from potential harm.  Companies that work with AI should consider some of the same measures.

Source:  White House Fact Sheet, October 30, 2023

The Recession Revealed: It is on schedule and there will not be a soft landing!

The Recession Revealed: It is on schedule and there will not be a soft landing!

Who dares to mention the word recession anymore? Now, we have another robust jobs report just when the economy seemed to be cooling. Fortunately, there is a consistent explanation. The velocity of money. The velocity of money fell to almost 1.0 during the pandemic, the absolute floor. Velocity was never going to stay at one in a fractional banking system. This paper shows the strong relationship between velocity and real interest rates and the loan-to-deposit ratio.


Natural Toxicity in LLMs: Left-Brain Antics Requires Right-Brain Control

By Vernon Budinger, CFA

New studies reveal that the performance of ChatGPT is declining; some results favor ChatGPT 3.5 over 4. Many AI pundits are speculating that ChatGPT 4.0 has become the victim of the constraints of human-supervised learning. The developers of Large Language Models claim that supervision is necessary because, left unsupervised, LLMs tend to gravitate toward toxic language and behavior.

As I study and use LLM and Generative AI, I am struck by assertions from industry professionals that unsupervised Generative LLMs tend to respond with “toxic” information without Reinforced Learning from Human Feedback, or RLHF (the AI crowd’s term for Supervised Learning to correct and direct the responses from AI models).

This paper explores the toxicity in LLMs and then compares this to the behavior of humans that rely on the left side of their brain because of damage to the right brain. This paper aims to suggest a new path that improves the Generative LLM experience for users with less censorship and more common sense in controlling toxicity.

How Do LLMs Function?

First, an overview of the mechanics of ChatGPT and other Generative LLMs.  A human provides a “prompt” to ask for information or request some other action. Generative LLMs first “tokenize” the prompt, this is the process of converting each word or sentence fragment into packets of numbers called tokens.  The tokens are then fed to ChatGPT which analyzes (decodes) the prompt and uses an autoregressive process to generate the response one word at a time. 

Autoregressive means that the first word generated is then used as the seed for the second word, the first and second words are combined and fed back to the model to provide the seed to generate the next word – this is observable as ChatGPT or Bard are answering questions. This autoregressive process repeats itself until ChatGPT has answered the prompt.

Another statistical process named “attention” is used to keep the autoregressive process moving in the direction provided in the prompt.   Attention, in this case, is a statistical formula that applies weights to the keywords to keep ChatGPT from “drifting” away from the main topic.

Source: “Attention is All You Need

 

Toxicity

I was curious why a class of AI models would tend to generate “toxic” content. This does not seem, at first pass, something that would be expected from any model.  I use quotes around toxic because this seems to be a term without a clear definition in the AI dictionary. Furthermore, wouldn’t the model’s tendencies and responses to questions be a function of finding balanced training data and employing a balanced training process?  

Since commercial Generative LLMs are self-aware (they can explain how they work); I asked the model (Bard – more honest than ChatGPT in my opinion) to explain this phenomenon and this was the response:

“There are a number of reasons why LLMs can generate toxic responses. One reason is that the training data that they are exposed to may contain toxic language. Another reason is that LLMs are trained to predict the next word in a sequence, and they may not be able to distinguish between toxic and non-toxic language. Finally, LLMs may be able to generate toxic responses if they are prompted to do so.

Some examples of toxic responses that LLMs can generate include:

·        Hate speech: This includes language intended to denigrate or attack a person or group based on their race, religion, ethnicity, sexual orientation, or other protected characteristics.

·        Discriminatory language: This includes language that is intended to promote or justify discrimination against a person or group based on their race, religion, ethnicity, sexual orientation, or other protected characteristics.

·        Violent language: This includes language that is intended to incite violence or harm against a person or group.

·        Sexually explicit language: This includes language that is intended to be sexually suggestive or offensive.”

Transformer-based LLMs also have a couple of other undesirable traits:

·       They hallucinate and fantasize

·       They sometimes miss the big picture

Model trainers should be able to control parts of the toxic tendency with “clean” training data and control over prompts to prevent users from tricking LLMs into generating toxic responses.   However, Bard’s focus on word-by-word text generation stands out because this is a process employed by transformers and it is also a tendency of left-brain dominant individuals.  Even more stunning, the results are similar in that left-brain humans also face problems with an attraction to fantasy and bizarre thinking.

Left Brain vs Right Brain: Lessons in Behavior

What light can studies of the left and right brain shed on this behavior? To clarify, the term left-brain dominant does not mean someone who is more logical versus artistic and emotional.  It refers to humans with real physical problems with the right brain, such as lesions or physical trauma that damage the right brain and force the individual to depend almost exclusively on the left brain. These true left-brain dominants may provide some key insights into the root cause of toxicity, hallucinations, and other unwanted characteristics in transformer-based LLMs.

The basis of the split-brain theory stems from outdated research by Roger Sperry that found that people are either left-brained or right-brained, meaning that one side of their brain is dominant. If you're mostly analytical and methodical in your thinking, the theory says that you're left-brained. If you tend to be more creative or artistic, then you're right-brained.

The most recent research offers a more nuanced version of Sperry’s observations, Iain McGilchrist has compiled an impressive review of recent research on the left and right brain in his book “The Master and His Emissary.” 

McGilchrist writes:

“The conventional neuropsychology literature distinguishes five types of attention: vigilance, sustained attention, alertness, focused attention, and divided attention.”

“The right hemisphere is responsible for every type of attention except focused attention.”

The Master and His Emissary, Iain McGilchrist, Chapter 2 pages 38- 39.

McGilchrist goes on to say:

“There is evidence of left-hemisphere dominance for local, narrowly focused attention and right-hemisphere dominance for broad, global and flexible attention.48  The scope of the right hemisphere world is broad.49 Patients with a right hemisphere lesion (therefore relying on their intact left hemisphere) start with pieces and put them together to get to the overall picture, whereas those with a left-hemisphere lesion (relying on their right hemisphere) prefer a global approach “

The Master and His Emissary, Iain McGilchrist, Chapter 2 pages 39- 40.

McGilchrist then examines the behavior of individuals with damage to one side of the brain or the other.

“Patients with right hemisphere damage don’t seem to be able to adjust the ‘spotlight’ of their attention: they suffer ‘an excessive and more or less permanent narrowing of their attentional window.” 51 That’s what happens when we rely on the left-hemisphere attention on its own.”

The Master and His Emissary, Iain McGilchrist, Chapter 2 page 40

“The right hemisphere prioritizes whatever actually is, and what concerns us. It prefers existing things, real scenes, and stimuli that can be made sense of in terms of the lived world.”

“At the same time, the left hemisphere is more at home dealing with distorted, non-realistic, fantastic - ultimately artificial - images. This may be because they invite analysis by parts rather than as a whole.”

The Master and His Emissary, Iain McGilchrist, Chapter 2 page 56.

Now we can see why it is important to understand the inner workings of AI and especially LLM models.  As I explained above, the “transformer” in Generative Pre-trained Transformer (GPT) generates the material word by word just like the workings of the left hemisphere of the brain.  

The parallel is stunning; even Bard points to word-by-word thought processing as a possible source of “toxic behavior.” Both left-brain-dominated humans and transformer-driven LLMs generate responses in pieces, and both seem attracted to distorted reality and fantastic images. There is too much similarity between procedure and results for serious AI professionals to let this go unexamined. 

Is RLHF Degrading ChatGPT?

Even more alarming is the possibility that we may be compounding the problem with the use of RLHF to solve the “toxicity” problem. Could it be that RHLF is more left-brain medicine that doesn’t work?

ChatGPT was expected to learn and become smarter. However, UC Berkeley scientists’ studies indicate that ChatGPT performance is not improving as expected and may be deteriorating in some areas. The paper makes a compelling case for using ChatGPT 3.5 (the free version) rather than paying a monthly fee for ChatGPT 4.0. 

The scientists reached the following conclusions in the paper “How Is ChatGPT’s Behavior Changing over Time?” 

·       LLM drift explains how a model can change over time: “Perhaps surprisingly, substantial LLM drifts emerge on this simple task. As shown in Figure 2(a), GPT-4’s accuracy dropped from 84.0% in March to 51.1% in June, and there was a large improvement of GPT-3.5’s accuracy, from 49.6% to 76.2%.” The researchers also found that “chain-of-thought” behaviors used for mathematics were less effective.

·       Less Adherence to Formatting Instructions: ChatGPT failed to follow formatting instructions in the prompt.

·       Poor prompt stability: ‘GPT-4 in March was actually able to find the correct answer to the question of the political affiliation of two test subjects: they both were Democrats. However, the LangChain agent expected a specific format: the generation from LLM must be “[action]+text”, which was encoded in its prompts. Unfortunately, GPT-4 in March failed to follow this format, and thus the LangChain agent simply generated an error message “could not parse LLM Output”.’

·       There were small improvements in visual reasoning for both models

Source: ChatGPT’s performance is slipping

 

Many AI professionals wonder if the widespread use of RLHF to continually implement stricter controls on ChatGPT has impeded the pace at which ChatGPT learns and improves.

Conclusion

McGilchrist observes that the West, and especially Europe, believes that the left brain rules our thought process because the West values logic and methodical thinking. However, his studies show that the right brain is the master, and the left brain is the emissary. 

Maybe this is the answer to the problem of toxicity in LLM models, instead of using RLHF to modify weights calculated during training, maybe the LLM model should be constructed with two opposing “minds.”  One mind would be an autoregressive model that pieces the concepts together one piece at a time with focused attention. The other brain would use a right-brain process with more general attention instead of RLHF. The second model would be a more overarching model (RNN) to control thinking using two forms of attention in much the same format as the human brain.

 

Unleashing the Power of Data Science and Financial Expertise: Fueling Growth and Profitability

Unleashing the Power of Data Science and Financial Expertise: Fueling Growth and Profitability

Reach out to me today to discuss how I can assist your business in harnessing the full potential of data science, AI, R programming, and the rich resources of the Census Bureau. Together, let's unlock a new era of profitable growth.


Question for ChatGPT: How do ChatGPT tokens work?

Tokens are a fundamental concept in ChatGPT and other language models. A token can be as short as a single character or as long as a word, but in general, it represents a discrete unit of text. The model processes text in chunks or batches, and each chunk is made up of a certain number of tokens.

In ChatGPT, both input and output consume tokens. The total number of tokens affects the cost, response time, and even whether a conversation fits within the model's maximum limit. The specific details regarding tokens in ChatGPT are as follows:

  1. Token count: Each message, including user messages and assistant responses, is tokenized into a sequence of tokens. For example, the sentence "How are you?" would be tokenized into four tokens: ["How", "are", "you", "?"].

  2. Token limit: ChatGPT has a maximum token limit. If a conversation exceeds this limit, you will need to truncate or reduce the text to fit. Note that very long conversations are more likely to receive incomplete responses due to the token limit.

  3. Cost: When using the OpenAI API, you are billed based on the total number of tokens in both input and output. Both messages sent to the API and received from it contribute to the token count for billing purposes.

  4. Response time: Generating a response takes more time as the number of tokens increases. Longer responses generally require more computational resources and may take more time to receive a reply.

  5. Interaction constraints: The total token count affects the ability to have back-and-forth conversations. Longer conversations can consume a significant portion of the available tokens, leaving less room for context and meaningful responses.

To manage tokens effectively, it's important to be mindful of the token limit, prioritize essential information, and be concise in your communication to ensure optimal performance and cost efficiency.

From ChatGPT: ATTENTION is Key to Good Prompts

Question for ChatGPT: How can I use knowledge of AI transformer attention to improve the effectiveness of my prompts for ChatGPT?

1.     Provide Sufficient Context: Transformer models use attention to weigh the importance of each word in the input when generating a response. (This is the purpose of role-playing.) Therefore, providing clear and sufficient context in your prompts will help the model generate more relevant and accurate responses. Keep in mind that if a conversation is very long, some of the older parts may be ignored due to the model's token limit (e.g., 2048 tokens for GPT-3), so the most relevant information should be closer to your current prompt.

2.     The Sequence of Information: Since attention mechanisms weigh all words in the input but give more importance to recent ones, the sequence in which you present information can impact the response. Important information is often better placed toward the end of your prompt.

3.     Avoid Ambiguity: Transformer models can struggle with ambiguous prompts because they cannot ask clarifying questions. They will make their best guess based on the attention scores, which may not align with what you intended. So, strive for clarity in your prompts.

4.     Use Instructive Prompts: Instructive prompts, where you tell the model what kind of answer you're looking for, can help get the desired output. For example, if you want a summarized answer, you can say, "In a short summary, explain...".

5.     Experiment with Different Prompts: Different prompt styles can lead to different responses because of how attention weights are assigned. If you're not getting the results you want, try rephrasing your prompt or presenting the information in a different order.

6.     Temperature & Max Tokens: You can adjust parameters like 'temperature' and 'max tokens’ to influence the response. Higher temperature values make the output more random, while lower values make it more deterministic.

7.     The 'max tokens' parameter limits the length of the response. Token Limitations: There is a maximum limit on the number of tokens the model can handle. For GPT-3, this is 2048 tokens, but it may vary for later versions. If a conversation exceeds this limit, you might need to truncate your text, this could cause the model to lose context.

8.     Incremental Queries: If your query is complex, you can break it down into multiple simpler queries. This can often result in better responses, as the model can handle each part separately.

9.     These suggestions are based on general principles and might need some experimentation & adjustments to get the best results for your specific case. The model doesn't genuinely understand the context or have real-world knowledge beyond its training data, manage your expectations.

Sources:

ChatGPT sources: "Attention is All You Need" by Vaswani et al., 2017: This paper introduced the concept of the Transformer architecture, which is the foundation of models like GPT-3 and GPT-4. Understanding how attention works in these models can help you craft effective prompts.

OpenAI's GPT-3 paper: This paper presents the model's capabilities and suggests some strategies for generating effective prompts.

"The Illustrated Transformer" by Jay Alammar is a helpful resource for understanding how the Transformer model works, with visualizations to assist comprehension.

Various blog posts, articles, and tutorials by the AI community: Many practitioners share their experiences and insights online, providing practical tips on using AI language models effectively.

 

The temperature parameter is used in the sampling process of a language model's output. It determines how focused or diverse the output will be. Here's how you adjust it: When you use an API like OpenAI's to interact with a model like GPT-4, you can specify the temperature as a parameter in your API call.

The temperature parameter controls the randomness of the model's responses:

  • High temperature values (close to 1.0) make the output more diverse and creative, but also more likely to be inconsistent or make errors.

  • Low temperature values (close to 0.0) make the output more focused and deterministic, often sticking closely to the style and content of the input. However, the responses can also be more repetitive and less creative.

You can adjust the temperature value according to the needs of your application. If you want more creativity and are okay with a bit of randomness, use a higher temperature. If you want more consistency and focus, use a lower temperature.

Remember, as of my knowledge cut-off in September 2021, the actual effect of the temperature parameter can depend on various factors, including the specific model version and the complexity of the input prompt. For up-to-date information, always refer to the latest documentation from the API provider.

 

Gathering Financial Information with ChatGPT

ChatGPT does more than answer questions; it is closer to a search engine than a thinking app. However, ChatGPT does have uses when gathering Financial Data.

To gather financial planning and analysis information with ChatGPT, you can follow these steps:

Define the objectives: Determine the specific information you need for financial planning and analysis. It could include financial statements, budgeting, forecasting, performance analysis, variance analysis, key performance indicators (KPIs), or any other relevant metrics.

  1. Prepare questions: Create a set of questions or prompts that you can ask ChatGPT to obtain the desired financial information. Structure your questions to be specific, concise, and clear. For example, you could ask:

    • "What were the total revenues for the previous quarter?"

    • "Can you provide a breakdown of operating expenses by category?"

    • "What is the projected net income for the next fiscal year?"

    • "What are the key drivers affecting profitability?"

  1. Engage in a conversation: Interact with ChatGPT using natural language to ask your questions. Many AI platforms, including ChatGPT, provide an API that allows you to integrate the model into your application or system for seamless communication.

  2. Interpret responses: Analyze the responses provided by ChatGPT and extract the relevant financial information. Ensure that the information aligns with your expectations and matches the context of your financial analysis requirements. Exercise critical judgment and verify the accuracy of the information obtained.

  3. Refine and iterate: If the initial responses from ChatGPT are not satisfactory or require further detail, refine your questions and prompts to seek the desired information. Experiment with different phrasing and approaches to improve the effectiveness of your communication with the model.

  4. Provide context and constraints: When seeking financial planning and analysis information, it can be helpful to provide additional context or constraints to guide the AI model. For example, you can specify a particular time period, assumptions, or specific data sources to narrow down the response.

  5. Cross-validate with other sources: While ChatGPT can provide insights, it's advisable to cross-validate the information obtained from the model with other trusted sources or financial data. Use the AI-generated information as a reference point and validate it against established financial reports, internal data, or inputs from domain experts.

  6. Iteratively improve the model: Provide feedback on the responses given by ChatGPT, particularly when it misinterprets or misunderstands your questions. This feedback can help improve the model's performance over time.

Remember that ChatGPT is a language model that relies on the data it was trained on, and it may have limitations or biases. Therefore, it's important to exercise critical thinking, validate the information, and use the AI-generated insights as a supplement to human judgment and expertise.

Using NEAT To Scan Documents, Receipts, and Transactions Into QuickBooks

The advent of innovative technology has revolutionized the way we manage our finances. Data input is a critical part of the financial management process and apps like NEAT (Navigating Expenses and Accounting Tasks) allow QuickBooks users to scan and categorize transactions and send the image and the data to QuickBooks.

This paper explores the integration of the NEAT application with QuickBooks Online. I use NEAT because it does more than take an image of your receipt. NEAT combines the power of TWAIN Optical scanning technology with AI character recognition to automate expense tracking in QuickBooks. 

NEAT utilizes advanced image processing technology and optical character recognition (OCR) to scan and extract information from receipts. While NEAT's exact technology stack and algorithms are proprietary, it likely incorporates a combination of computer vision and OCR techniques to process receipt images and extract relevant data.

Here's a general overview of the technology NEAT employs for scanning receipts:

  1. Image Capture: NEAT uses the built-in cameras of mobile devices (such as smartphones or tablets) to capture images of receipts. Users can simply take a picture of a receipt using the NEAT mobile application.

  2. Image Processing: NEAT applies image processing techniques to enhance the quality and clarity of the receipt image. This may include adjusting brightness, contrast, and resolution to optimize the image for OCR processing.

  3. Optical Character Recognition (OCR): NEAT employs OCR technology to recognize and extract text from the receipt image. OCR algorithms analyze the image, identify characters, and convert them into machine-readable text. This allows NEAT to extract information such as vendor name, date, amount, and other relevant details from the receipt.

  4. Data Extraction and Categorization: Once the text is extracted from the receipt image, NEAT's algorithms interpret and categorize the extracted data. NEAT likely employs pattern recognition and data analysis techniques to identify specific fields within the extracted text and assign them to relevant categories, such as expense type or vendor.

  5. Data Storage and Integration: NEAT stores the extracted receipt data in a structured format, associating it with relevant metadata. The application may also provide options to integrate with other financial management tools or accounting software, allowing users to seamlessly transfer the receipt data to their preferred platforms.

Overall, NEAT's technology stack likely combines image processing, OCR, and data analysis techniques to automate the scanning and extraction of information from receipts. This allows users to streamline their expense tracking processes and maintain accurate financial records.

How Does A Neural Network Think?

A neural network is a computational model inspired by the structure and functioning of the human brain. It consists of interconnected artificial neurons organized in layers, and it learns to perform tasks by adjusting the weights for the connections between these neurons based on input data.

When we say "how a neural network thinks," it's important to note that neural networks don't have consciousness or subjective experiences. They are purely mathematical models designed to process information and make predictions based on patterns in the data they are trained on.

Here's a simplified explanation of how a neural network processes information:

1.     Input Layer: The neural network receives input data, which could be anything from images to text or numerical values. Each input is represented as a numerical feature vector.

2.     Weighted Connections: Each input feature is multiplied by a corresponding weight and passed through an activation function. The weights determine the strength of the connections between neurons and are initially assigned randomly.

3.     Hidden Layers: The output of the activation function in the input layer is passed to one or more hidden layers. Hidden layers contain additional neurons that perform similar calculations with their inputs, applying weights and activation functions to produce output.

4.     Activation Functions: Activation functions introduce non-linearities into the neural network, allowing it to learn and represent complex relationships in the data. Common activation functions include the sigmoid, tanh, and ReLU functions.

5.     Output Layer: The final hidden layer feeds into the output layer, which produces the network's prediction or output. The number of neurons in the output layer depends on the nature of the task. For example, a neural network for image classification might have a neuron for each possible class label.

6.     Loss Function: The network's output is compared to the desired output (labels) using a loss function. The loss function quantifies the difference between the predicted output and the actual output, providing a measure of how well the network is performing.

7.     Backpropagation: The neural network uses an optimization algorithm called backpropagation to adjust the weights based on the calculated loss. The goal is to minimize the loss function, thereby improving the network's accuracy. Backpropagation involves propagating the error back through the network, updating the weights using gradient descent or similar techniques.

8.     Training: The neural network repeats the process of forward propagation (input to output) and backpropagation (error calculation and weight adjustment) iteratively over a large dataset. This training process allows the network to learn the patterns and relationships present in the data.

9.     Inference: Once the neural network has been trained, it can be used for inference on new, unseen data. The input is passed through the network, and the output is generated based on the learned weights and connections. The network's prediction is typically the output with the highest activation or highest probability, depending on the specific architecture and task.

In summary, a neural network "thinks" by performing a series of mathematical computations on input data, adjusting the connections between neurons through training to learn patterns, and generating predictions based on the learned information. However, it's important to note that this thinking process is purely computational and does not involve consciousness, awareness, or understanding in the same way as a human brain.

Interest Rate Outlook Nov 2022: Implications for the US Economy

Interest rate analysis 11/6/2022:

  1. Interest rates are not high, they are returning to averages seen over the last 30 to 60 years.

  2. Interest rates have been much too low for too long over the past 5 years.

  3. The real interest rate story is the inversion of the yield curve, U.S. Treasury Bonds maturing in 2 years now yield more than Treasuries maturing in 10 years.

  4. A recession has occurred every time the yield curve has inverted and yields have risen more than 2% over the last 50 years.

  5. The average time lapse between a yield curve inversion and a recession is 13 months, the yield curve inverted in July of this year - this means that we should see the economy enter a recession by August 2023.