What the hell is going on with Global Waming?
First of all, we should know the size of our data.
Let's warm up with a simple question - demographic analysis
============================================================================ total_num climate_change ratio ---------------------------------------------------------------------------- Male 4796931 38173 0.79% Female 942930 8977 0.95% =============================================================================
At the first glance, the male gender has more climate change related quotes and a bigger total number of
quotes than
the females. However, when comparing the ratio of climate_change_quotes / all_quotes for both men
(ratio = 0.79%) and women (ratio = 0.95%)
it seems as if the women are more aware of this global problem.
Interesting...
Secondly, we will analyse data with respect to the age. Is awareness more present in certain age groups?
So, we can see that number of quotes with respect to age parameter follows a Gaussian-like distribution, with
small
anomalies.
Now, let's see if the age structure changes when we add the aforementioned gender dimensionality.
The visual representation of the age analysis with added gender dimensionality for climate change quotes (i.e. the second plot), as well as the mean values, are similar - we could ask ourselves if these two distributions are coming from the same underlying distribution. Since the std values are similar, we can apply the ttest.
After getting a p-value = 1.79e-05 (which is less than the ordinary 0.01), we can conclude that the difference in the mean values is significant, therefore these two distributions most likely come from two different distributions - where the mean for females is a bit lower than the one for males.
This could lead us to the conclusion that older female politicians talk less about climate change than their male colleagues. However, taking the first plot into the account, we can observe a much bigger number of older male politicians than females - making the older female politicians, to some extent, underrepresented.
The two major political parties in the US are The Republican Party and The Democratic Party. Let's explore the distribution of quotes in regards to these two political parties.
The inspection of these pie charts unfolds some peculiar things. The Democratic party, although having fewer
quotes in total, has a bigger number of quotes related to climate change. Does it mean that Democrats
are more conscious?
On the other hand, ratio values (i.e. climate_change_quotes / all_quotes) are very small, so it
may lead us to believe that political affiliation is
irrelevant?
Genuinely peculiar...
But what about the individuals from those parties? Let's observe the most cited ones.
As expected, the majority of climate change top speakers are indeed affiliated with the Democratic party.
Okay, that's enough of analysing the isolated parameters. We have observed some interesting statistics, but
is it relevant to our problem?
Recall that our analysis should be based on different political and demographic parameters of speakers. So,
let's see which of the aforementioned parameters really matter. As Robert West would say, 'Linear regression
gives us free p-values'. So why shouldn't we take advantage of this present?
We are trying to predict the number of climate change related quotes attributed to a single person. The features related to a person used for this regression analysis can be divided into 5 categories:
Voilà !
Feel free to take a sneak peek at the result of linear regression:
OLS Regression Results ============================================================================== Dep. Variable: quotation R-squared: 0.949 Model: OLS Adj. R-squared: 0.949 Method: Least Squares F-statistic: 4929. Date: Thu, 16 Dec 2021 Prob (F-statistic): 0.00 Time: 03:04:16 Log-Likelihood: -13979. No. Observations: 3167 AIC: 2.798e+04 Df Residuals: 3154 BIC: 2.806e+04 Df Model: 12 Covariance Type: nonrobust ===================================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------------------- Intercept -20.2614 3.060 -6.622 0.000 -26.261 -14.262 party_name[T.Republican Party] -3.9679 0.734 -5.406 0.000 -5.407 -2.529 gender_name[T.Male] 1.2639 0.885 1.429 0.153 -0.471 2.998 gender_name[T.Transgender female] 3.7035 11.620 0.319 0.750 -19.080 26.487 was_candidate[T.True] 5.5953 2.323 2.409 0.016 1.041 10.149 was_in_congres[T.True] 6.8878 0.835 8.251 0.000 5.251 8.525 was_president[T.True] -882.1337 35.320 -24.976 0.000 -951.386 -812.881 degree_num 14.8775 1.616 9.208 0.000 11.710 18.045 age -0.0306 0.027 -1.151 0.250 -0.083 0.021 party_count 25.9043 2.569 10.082 0.000 20.866 30.942 numOccurrences 0.1794 0.001 135.605 0.000 0.177 0.182 totalNumOccurrences 0.0004 8.72e-05 4.377 0.000 0.000 0.001 total_num_quotes -0.0010 4.77e-05 -21.944 0.000 -0.001 -0.001 ============================================================================== Omnibus: 2529.379 Durbin-Watson: 2.026 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2639467.474 Skew: 2.466 Prob(JB): 0.00 Kurtosis: 144.343 Cond. No. 1.49e+06 ==============================================================================
The model achieved an astonishing R2 value of 0.949 implying that we explained almost all of the variance. But are the residuals centered?
Yes! We can conclude that R2 measure is relevant.
Moreover, the p-values of all features except the demographic ones entail significance. This is unexpected, linear regression proved to not be gender biased. As we have seen before women are more inclined to talk about climate change, so how is this possible?
This must mean that another feature is the influential one, and that the gender feature is just masking it. We have a strong hunch that this mystical feature is none other than the party name (which indeed is a significant feature). Let's take a closer look at some more charts.
Clearly, the proportion of women is much bigger in the Democratic party - there are almost twice as much women in the Democratic Party than in the Republican Party. Hence, it is not the gender that's dominating, but simply the party.
Let us inspect the positive and negative sentiment of climate change quotes in respect to political parties.
Here we can see that here that there is a slightly higher tendency of democrats to make positive claims about climate change than Republicans. Nevertheless, the difference is not too prominent to make a valid statement.
The key feature for this process has been extracted using CARDS (Computer-assisted detection and
classification of misinformation about climate change) model. This feature takes integer values from the range
[0,17]. The value 0 refers to quote not being classified as misinformation, while integer values from rage
[1-17] refer to different classes of misinformations. For our analysis, we will focus only on the question if
misinformation exists, and, therefore, we will consider only 2 classes: true quote (i.e. value of 0) and
misinformation (i.e. any integer value from range [1-17].
So let's start.
There is a noticeable distinction in the way parties make claims. Even though most claims on the topic are made by the democrats most of the misinformation was pushed through by the Republicans.
So far, regarding sophisticated sentiment analysis, we have noticed that Republicans are more negative and
more deceiving.
Not a lovely characteristics.
But wait! Do those sentiments truly describe the politicians' attitude? Do the broadcasting services (i.e. media) provide an influence also?
Let's analyse media-dimension of sentimet:
From the plot above, we can conclude that the climate change related quotes said by Democrats are more likely to be presented by media in a positive atmosphere, while the climate change related quotes said by Republicans are more likely to be negatively broadcasted. We can see a little bit of polarisation of the media, so we may conclude that the overall sentiment of a quote, in addition to speaker-related, contains a media-related component as well.
So what about misconseptions?
Similarly to the previous conclusion, we can see that there is a polarization in this context also. Since
Republican's quotes, compared to Democrat's quotes, are more likely to be broadcasted as misinformation, we
may also conclude that the overall degree of misinformation contains speaker-related and media-related aspects.
Since we are not able to determine which of the components, i.e. speaker-related or media-related, is
prevailing, we cannot make conclusions regarding the sentiment/misinformation part of quotes with respect to
political parties.
Another interesting question we wanted to give answer to is which politicians are prone to fast
forget
issues related to our topic.
For this type of analysis, we will consider only politicians with
most
quotes about climate change (Barack Obama, Bernie Sanders, Donald Trump, etc.), since they are the
most
representative individuals.
Firtsly, let's consider former presidents, Barack Obama and Donald Trump.
Note one interesting thing - Barack Obama said the most climate change related quotes in the period
before the year 2017, while Donald Trump became more cited in the year 2017 and onwards.
As you probably know, 2016-2017 is the period of change; Barack Obama finished his career as US
president and Donald Trump began his. Could this be influencing the results, i.e. Barack Obama
became irrelevant after the end of his career (and vice-versa for Donald Trump)? Recall that in
linear regression analysis, we noted the importance of media relevant features.
Let's take another speakers, important figures but who were not elected as US presidents during
their careers. Ideal one are Bernie Sanders and Hillary Clinton.
Bernie Sanders mostly talked about this issue in two periods: end of the year 2015 and end of the
year 2019. Those are the periods of his presidential campaigns. But did he magically become more
aware of climate change during these periods? Or is there another explanation?
Let's see the situation in case of Hillary Clinton.
Interesting, a trend of awareness of climate change problems is also emphasized here in the period of
political campaigns.
So, now we became more suspicious of the fact that politicians abuse this global issue, so they
could
gain on popularity needed in the elections process.
Our assumption is that during the period of the political campaigns (treatment), politicians will focus on
the problem of climate change and how they would address them during their mandate, resulting in more
climate change quotes that period (target).
How can we test this? Well, simply we will observe number of quotes speaker had in the periods of time when
campaigns were active, and when they were not. And, the most interesting part, we will match speakers with
themselves. So, we will observe how many quotes one had in the period without campaigns (i.e. without
treatment), and in the period during the campaigns (i.e. with treatment). And what about confounders? Well,
we are matching them with themselves, so the confouners may only be events which match the periods during
election campaigns (i.e. climate disasters that spanned the campaign periods). Since the campaign period is
long then we do not expect disasters to occur any more frequently during them then outside of them. Based on
the informations provided in the article, we will use as the campaign period the timespan of one year and a
half before the elections.
The treated group represents politicians under the campaign period and the control group are the same
politicians outside of campaigning.
Since the target is a real number we will show the distribution of its values.
The graph is the CCDF function on a log-log scale showing that the distributions are power law
distributions.
Now will shall run this analysis separately for Democrats and Republicans.
As we can see for Democrats election period is one for which the distribution of quotes is shifted to higher values. That is, Democrats tend to focus on climate change more during elections. Adversely, Republicans tend to shy away from that topic during the campaign.
Naive analysis of the demographic related features (gender and age) could have deceived us into believing that those features are significant, but linear regression showcased the opposite. Once again, linear regression has expressed its true power! In the end, we have proven that the awareness (i.e., number of climate change related quotes) depends mostly on 3 groups of characteristics: political, educational and media-related ones.
As we have seen the sentiment tells us very little in terms of people's opinions. However, it reveals the existence of media bias in the data. Because of this, we cannot be sure if certain outcomes are the result of a politician's stance on a topic or just how the media represents them. Moreover, we have exposed the existence of a tendency of Republicans to spread misinformation.
The media dimension has a profound impact on time-series data, especially during election campaigns. At last, we were able to find campaigns that also influence the tendencies of politicians and media houses when talking about climate change.
Computer Science
Computer Science
Computer Science
Computer Science