The Climate of Climate Change

Introduction

The political scene of the United States of America is a jungle of diverse, opposing and often controversial claims and opinions. One especially sensitive topic is climate change. A lot is said on this, but how much and by who? We hope to follow the breadcrumb trail of data to find the true bellwethers that lead the ostentatious public opinion of the US.

So why should we care?

The short yet illustrious history of the United States of America has left its mark not only on the human world, but the natural one as well. Leveraging the data from CAIT v2.0 we see the true historic impact on the climate of the world from the actions of this country.

With almost a quarter of all emissions, the US represents a major player. It is because of this that public opinion on this problem is of international importance. In order to shed more light on the delicate topic, we need to gain a better understanding of the mechanisms which are at work.

Where does the trail of breadcrumbs begin?

To answer the questions of interest we will take a look inside the Quotebank., a dataset of 178 million quotations. However, when only observing US politicians the set is reduced to around 5.7 million quotes which is still an ocean of claims. In order to find the needed answers, we must define what it means for a quote to be related to climate change. Superficially the problem might seem negligible, but it is inherently difficult and necessary for us to narrow down the data closer to our problem space.

How do we know if a quote is about climate change?

To solve this problem we will use the power of data to define the data. We can do this in two ways. Firstly, we can let the data speak for itself by using an unsupervised technique. Secondly, we can base our definition on an external dataset that has been labeled by a human and rely on their definition of the problem. Let us look at both techniques.

FastText (unsupervised)

This model is able to train on a given corpus of text and learn sentence representation in a vector format. The quotes had to be first cleaned using stop word removal, lower-casing and removing punctuation. After training the model we need to select what is known as a "query". This is a sentence which would be embedded and compared using cosine similarity to find the most similar quotes. The query selected for the purpose of extracting climate change related quotes was "climate change" based on the results on the validation set.

BERT (supervised)

As previously stated we must have a labeled dataset for a supervised approach. Luckily, there is such a dataset, ClimaText. As the paper did not provide a trained model for this task we need to train one. The model used for this is the Huggingface implementation of distilled BERT for sequence classification. After training, the model returns the probability of a quote being related to climate change.

Comparison

Qualitative Quantitative

FastText is more flexible in terms of the query, not necessarily for classifying climate change

BERT infers from whole sentences and not a bag of words

Precision/Recall trade-off is more continuous in FastText and, therefore, can be chosen for a given problem by setting a correct threshold

BERT returns probabilities with a clear separation between 0 and 1 outputs i.e. very few outputs are in between giving a natural way of choosing a threshold.

In order to compare the difference in performance, we run both models on the "claims" test set provided in ClimaText. This dataset consists of claims extracted from online articles and was chosen because of its convenient resemblance to the Quotebank dataset. Moreover, it is balanced with 500 claims labeled as related to climate change and equally as many labeled as unrelated.

Two precision/recall curves

The general performance of the supervised technique is higher, for this reason, we decide to use this model to estimate the probability of a quote to be related to climate change.

We continue by narrowing the dataset of 5.7 million quotes to a set of around 47000 by selecting only those classified as being about climate change.

WOW! That was a loooong introduction, so congratulations on getting this far. Now, let's cut to the chase and get our hands dirty!
In the words of the former US president: What the hell is going on with Global Waming?

How much is said?

Let's take a closer look at our dataset!

First of all, we should know the size of our data.

Total number of politicians

Total number of quotes

Total number of climate change quotes

Let's warm up with a simple question - demographic analysis

Gender analysis

Total number of quotes by gender

Number of climate change related quotes by gender

          ============================================================================
                                              total_num     climate_change    ratio
          ----------------------------------------------------------------------------
          Male                                4796931       38173            0.79%
          Female                               942930       8977             0.95%
          =============================================================================

At the first glance, the male gender has more climate change related quotes and a bigger total number of quotes than the females. However, when comparing the ratio of climate_change_quotes / all_quotes for both men (ratio = 0.79%) and women (ratio = 0.95%) it seems as if the women are more aware of this global problem.
Interesting...

Age analysis

Secondly, we will analyse data with respect to the age. Is awareness more present in certain age groups?

Number of climate change quotations by age

So, we can see that number of quotes with respect to age parameter follows a Gaussian-like distribution, with small anomalies.
Now, let's see if the age structure changes when we add the aforementioned gender dimensionality.

Age analysis with added gender dimensionality for politicians

Age analysis with added gender dimensionality for climate change quotes

The visual representation of the age analysis with added gender dimensionality for climate change quotes (i.e. the second plot), as well as the mean values, are similar - we could ask ourselves if these two distributions are coming from the same underlying distribution. Since the std values are similar, we can apply the ttest.

After getting a p-value = 1.79e-05 (which is less than the ordinary 0.01), we can conclude that the difference in the mean values is significant, therefore these two distributions most likely come from two different distributions - where the mean for females is a bit lower than the one for males.

This could lead us to the conclusion that older female politicians talk less about climate change than their male colleagues. However, taking the first plot into the account, we can observe a much bigger number of older male politicians than females - making the older female politicians, to some extent, underrepresented.

Political affiliation analysis

The two major political parties in the US are The Republican Party and The Democratic Party. Let's explore the distribution of quotes in regards to these two political parties.

Quotes grouped by party affiliation

Climate change quotes grouped by party affiliation

The inspection of these pie charts unfolds some peculiar things. The Democratic party, although having fewer quotes in total, has a bigger number of quotes related to climate change. Does it mean that Democrats are more conscious?
On the other hand, ratio values (i.e. climate_change_quotes / all_quotes) are very small, so it may lead us to believe that political affiliation is irrelevant?

Genuinely peculiar...

But what about the individuals from those parties? Let's observe the most cited ones.

Top Climate Change Speakers

As expected, the majority of climate change top speakers are indeed affiliated with the Democratic party.

All together now (Linear Regression)

Okay, that's enough of analysing the isolated parameters. We have observed some interesting statistics, but is it relevant to our problem?
Recall that our analysis should be based on different political and demographic parameters of speakers. So, let's see which of the aforementioned parameters really matter. As Robert West would say, 'Linear regression gives us free p-values'. So why shouldn't we take advantage of this present?

We are trying to predict the number of climate change related quotes attributed to a single person. The features related to a person used for this regression analysis can be divided into 5 categories:

Demographic features

gender_name: categorical, the gender of the politician
age: integer, the age of the politician

Media features

totalNumOccurences: integer, total number of citations by the media
total_num_quotes: integer, total number of quotes in general

Political features

party_name: categorical, the politicians affiliation
was_candidate: boolean, presidential candidate
was_in_congres: boolean, congress membership
party_count: integer, number of parties the politician was in

Educational features

degree_num: ordinal, degree of education (bachelor, master, phd)

Voilà!
Feel free to take a sneak peek at the result of linear regression:

                                                        OLS Regression Results
                            ==============================================================================
                            Dep. Variable:              quotation   R-squared:                       0.949
                            Model:                            OLS   Adj. R-squared:                  0.949
                            Method:                 Least Squares   F-statistic:                     4929.
                            Date:                Thu, 16 Dec 2021   Prob (F-statistic):               0.00
                            Time:                        03:04:16   Log-Likelihood:                -13979.
                            No. Observations:                3167   AIC:                         2.798e+04
                            Df Residuals:                    3154   BIC:                         2.806e+04
                            Df Model:                          12
                            Covariance Type:            nonrobust
                  =====================================================================================================
                                                          coef    std err          t      P>|t|      [0.025      0.975]
                  -----------------------------------------------------------------------------------------------------
                  Intercept                           -20.2614      3.060     -6.622      0.000     -26.261     -14.262
                  party_name[T.Republican Party]       -3.9679      0.734     -5.406      0.000      -5.407      -2.529
                  gender_name[T.Male]                   1.2639      0.885      1.429      0.153      -0.471       2.998
                  gender_name[T.Transgender female]     3.7035     11.620      0.319      0.750     -19.080      26.487
                  was_candidate[T.True]                 5.5953      2.323      2.409      0.016       1.041      10.149
                  was_in_congres[T.True]                6.8878      0.835      8.251      0.000       5.251       8.525
                  was_president[T.True]              -882.1337     35.320    -24.976      0.000    -951.386    -812.881
                  degree_num                           14.8775      1.616      9.208      0.000      11.710      18.045
                  age                                  -0.0306      0.027     -1.151      0.250      -0.083       0.021
                  party_count                          25.9043      2.569     10.082      0.000      20.866      30.942
                  numOccurrences                        0.1794      0.001    135.605      0.000       0.177       0.182
                  totalNumOccurrences                   0.0004   8.72e-05      4.377      0.000       0.000       0.001
                  total_num_quotes                     -0.0010   4.77e-05    -21.944      0.000      -0.001      -0.001
                            ==============================================================================
                            Omnibus:                     2529.379   Durbin-Watson:                   2.026
                            Prob(Omnibus):                  0.000   Jarque-Bera (JB):          2639467.474
                            Skew:                           2.466   Prob(JB):                         0.00
                            Kurtosis:                     144.343   Cond. No.                     1.49e+06
                            ==============================================================================

The model achieved an astonishing R² value of 0.949 implying that we explained almost all of the variance. But are the residuals centered?

Yes! We can conclude that R² measure is relevant.

Moreover, the p-values of all features except the demographic ones entail significance. This is unexpected, linear regression proved to not be gender biased. As we have seen before women are more inclined to talk about climate change, so how is this possible?

This must mean that another feature is the influential one, and that the gender feature is just masking it. We have a strong hunch that this mystical feature is none other than the party name (which indeed is a significant feature). Let's take a closer look at some more charts.

Gender distribution in political parties

Clearly, the proportion of women is much bigger in the Democratic party - there are almost twice as much women in the Democratic Party than in the Republican Party. Hence, it is not the gender that's dominating, but simply the party.

What are they saying?

it's all about emotion...

Sentiment analysis

It is common for people to talk about the disastrous effects of climate change as well as claim that it is nothing to worry about. For this reason it might be interesting to study the patterns of sentiment in what people say. The sentiment of a quote reveals very little about someones opinions on climate change and the message they convey. To illustrate, a person can be in support of the fight against climate change and still make positive as well as negative quotes. For example, talking about how it is as great threat for humanity will return a negative sentiment, but a supporter can also claim that there is hope if we act immediately. Moreover, sentiment analysis can help us inquire into the overall atmosphere of one's statements and possibly see if there are any patterns that may have a subliminal dimension.

To determine the sentiment we can use a pretrained model from Huggingface. After running this model we append a values between -1 and 1 to each quote. Where negative values imply a negative sentiment and vice versa. This model achieves state-of-the-art accuracy of 92.7% .

Let us inspect the positive and negative sentiment of climate change quotes in respect to political parties.

Sentiment climate change quotes for Democratic Party

Sentiment climate change quotes for Republican Party

Here we can see that here that there is a slightly higher tendency of democrats to make positive claims about climate change than Republicans. Nevertheless, the difference is not too prominent to make a valid statement.

Misinformation analysis

It is a well known fact that a lot of controversy is built around climate change. The non-scientific popular media is filled with misconceptions and manipulative claims on the topic. It would be useful to look into how this phenomenon influenced our dataset. In order to achieve this we can yet again rely on contemporary models.

The key feature for this process has been extracted using CARDS (Computer-assisted detection and classification of misinformation about climate change) model. This feature takes integer values from the range [0,17]. The value 0 refers to quote not being classified as misinformation, while integer values from rage [1-17] refer to different classes of misinformations. For our analysis, we will focus only on the question if misinformation exists, and, therefore, we will consider only 2 classes: true quote (i.e. value of 0) and misinformation (i.e. any integer value from range [1-17].
So let's start.

Total number of quotes per party

Number of true climate change quotes

Number of misinformative climate change quotes

There is a noticeable distinction in the way parties make claims. Even though most claims on the topic are made by the democrats most of the misinformation was pushed through by the Republicans.

So far, regarding sophisticated sentiment analysis, we have noticed that Republicans are more negative and more deceiving.
Not a lovely characteristics.

But wait! Do those sentiments truly describe the politicians' attitude? Do the broadcasting services (i.e. media) provide an influence also?

The Media Dimension

Let's analyse media-dimension of sentimet:

From the plot above, we can conclude that the climate change related quotes said by Democrats are more likely to be presented by media in a positive atmosphere, while the climate change related quotes said by Republicans are more likely to be negatively broadcasted. We can see a little bit of polarisation of the media, so we may conclude that the overall sentiment of a quote, in addition to speaker-related, contains a media-related component as well.

So what about misconseptions?

Similarly to the previous conclusion, we can see that there is a polarization in this context also. Since Republican's quotes, compared to Democrat's quotes, are more likely to be broadcasted as misinformation, we may also conclude that the overall degree of misinformation contains speaker-related and media-related aspects.
Since we are not able to determine which of the components, i.e. speaker-related or media-related, is prevailing, we cannot make conclusions regarding the sentiment/misinformation part of quotes with respect to political parties.

When are they talking?

Some more analysis...

Timeline Analysis

Another interesting question we wanted to give answer to is which politicians are prone to fast forget issues related to our topic.
For this type of analysis, we will consider only politicians with most quotes about climate change (Barack Obama, Bernie Sanders, Donald Trump, etc.), since they are the most representative individuals.

Firtsly, let's consider former presidents, Barack Obama and Donald Trump.

Timeline for Barack Obama

Timeline for Donald Trump

Note one interesting thing - Barack Obama said the most climate change related quotes in the period before the year 2017, while Donald Trump became more cited in the year 2017 and onwards.
As you probably know, 2016-2017 is the period of change; Barack Obama finished his career as US president and Donald Trump began his. Could this be influencing the results, i.e. Barack Obama became irrelevant after the end of his career (and vice-versa for Donald Trump)? Recall that in linear regression analysis, we noted the importance of media relevant features.

Let's take another speakers, important figures but who were not elected as US presidents during their careers. Ideal one are Bernie Sanders and Hillary Clinton.

Timeline for Bernie Sanders

Bernie Sanders mostly talked about this issue in two periods: end of the year 2015 and end of the year 2019. Those are the periods of his presidential campaigns. But did he magically become more aware of climate change during these periods? Or is there another explanation?
Let's see the situation in case of Hillary Clinton.

Timeline for Hillary Clinton

Interesting, a trend of awareness of climate change problems is also emphasized here in the period of political campaigns.

So, now we became more suspicious of the fact that politicians abuse this global issue, so they could gain on popularity needed in the elections process.

Matching

Our assumption is that during the period of the political campaigns (treatment), politicians will focus on the problem of climate change and how they would address them during their mandate, resulting in more climate change quotes that period (target).

How can we test this? Well, simply we will observe number of quotes speaker had in the periods of time when campaigns were active, and when they were not. And, the most interesting part, we will match speakers with themselves. So, we will observe how many quotes one had in the period without campaigns (i.e. without treatment), and in the period during the campaigns (i.e. with treatment). And what about confounders? Well, we are matching them with themselves, so the confouners may only be events which match the periods during election campaigns (i.e. climate disasters that spanned the campaign periods). Since the campaign period is long then we do not expect disasters to occur any more frequently during them then outside of them. Based on the informations provided in the article, we will use as the campaign period the timespan of one year and a half before the elections. The treated group represents politicians under the campaign period and the control group are the same politicians outside of campaigning. Since the target is a real number we will show the distribution of its values.

The graph is the CCDF function on a log-log scale showing that the distributions are power law distributions.
Now will shall run this analysis separately for Democrats and Republicans.

As we can see for Democrats election period is one for which the distribution of quotes is shifted to higher values. That is, Democrats tend to focus on climate change more during elections. Adversely, Republicans tend to shy away from that topic during the campaign.

Conclusion

Quotes truly are powerful. Given the right context, of course. So, what is the context of climate change in a country which is one of, if not the biggest exporter of both CO2 and media citations?

How much are they saying?

Naive analysis of the demographic related features (gender and age) could have deceived us into believing that those features are significant, but linear regression showcased the opposite. Once again, linear regression has expressed its true power! In the end, we have proven that the awareness (i.e., number of climate change related quotes) depends mostly on 3 groups of characteristics: political, educational and media-related ones.

So, what are they saying?

As we have seen the sentiment tells us very little in terms of people's opinions. However, it reveals the existence of media bias in the data. Because of this, we cannot be sure if certain outcomes are the result of a politician's stance on a topic or just how the media represents them. Moreover, we have exposed the existence of a tendency of Republicans to spread misinformation.

So, when are they talking?

The media dimension has a profound impact on time-series data, especially during election campaigns. At last, we were able to find campaigns that also influence the tendencies of politicians and media houses when talking about climate change.

Meet the team

Natalija Mitic

Computer Science

Edvin Maid

Computer Science

Radenko Pejic

Computer Science

Filip Carevic

Computer Science