Exploratory Analysis

statistics quantitative data analysis final project IT Army of Ukraine

DACSS 603 Final Project Work: “Analyzing Data”

Kristina Becvar
2022-04-24

Data Sources

DDOS User Observations

The primary data is a set of observations of users of a novice “hacking tool” to engage in DDOS (denial of service) attacks against Russian targets in March 2022. The data contains a total of users cumulatively for each day of the series March 2 through March 11, and the users represent participants from 98 counties.

WVS/EVS

I will also be using a data set of observations from the World Values Survey conducted from 2017-2021 as a joint project between the World Values Survey and the European Values Studies. This data was released in July 2021, and contains responses from ~135,000 respondents among 95 countries.

Spike/Newswhip

The third is a data set of media coverage (media articles and social media mentions) of the Ukrainian minister’s call for volunteers for the “IT Army of Ukraine” to help fight the invasion of Russia on the digital front.

Data Analysis

DDOS Users

I moved the data into various forms to best explore ways to analyze it.

DDOS Daily Observations

Show code
#load the data
ddos_daily <- read_csv("ddos_observations.csv")
#assign column names to represent variables accurately
colnames(ddos_daily) <- c("Country", "Population", "Region", "March2", "March3", "March4", "March5", "March6", "March7", "March8", "March9", "March10", "March11")
#summarize the data
options(scipen = 999)
head(ddos_daily)
# A tibble: 6 x 13
  Country  Population Region March2 March3 March4 March5 March6 March7
  <chr>         <dbl> <chr>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Aland         29789 Europe      1      1      1      1      1      1
2 Albania     3088385 Europe     19     22     22     23     32     44
3 Algeria    43576691 Africa      0      8      8      8      8      8
4 Andorra       85645 Europe      2      6      6      6      6      6
5 Argenti~   45864941 South~      9      9      9     11     11     11
6 Armenia     3011609 Asia        1      7      7      9      9     13
# ... with 4 more variables: March8 <dbl>, March9 <dbl>,
#   March10 <dbl>, March11 <dbl>

The total DDOS users as of the first day of observations, March 2, 2022, and the last day available for observation, March 11, 2022 began at 7,850 and grew to a total of 48,879.

Show code
sum(ddos_daily$March2)
[1] 7850
Show code
sum(ddos_daily$March11)
[1] 48879

DDOS Cumulative Observations

However, I am not going to examine the panel data; I am only going to look at the cumulative data - or the count of users on the last day of observations, March 11. So this looks at:

Show code
#load the data
ddos_cumulative <- read_csv("ddos_cumulative.csv")
#summarize the data
options(scipen = 999)
head(ddos_cumulative)
# A tibble: 6 x 4
  country   population region        users
  <chr>          <dbl> <chr>         <dbl>
1 Aland          29789 Europe            1
2 Albania      3088385 Europe           57
3 Algeria     43576691 Africa           10
4 Andorra        85645 Europe            6
5 Argentina   45864941 South America    11
6 Armenia      3011609 Asia             16

DDOS Regional Observations

It is still important to be able to visualize the dramatic change in user count over time, even if I am not analyzing the time series in this analysis. I experimented with displaying the increase as a whole and the increase by region. So this looks at:

Show code
ddos_regions <- read_csv("ddos_by_region.csv", 
    col_types = cols(Date = col_date(format = "%m/%d/%Y")))
ddos_regions <- as_tibble(ddos_regions) 
ddos_regions
# A tibble: 10 x 10
   Date       Africa  Asia Europe Middle_East North_America Oceania
   <date>      <dbl> <dbl>  <dbl>       <dbl>         <dbl>   <dbl>
 1 2022-03-02     15   180   4863          72          1208      90
 2 2022-03-03     32   419   6994         115          1723     119
 3 2022-03-04     39   467   9069         137          1905     135
 4 2022-03-05     59   604  17392         163          2416     177
 5 2022-03-06     77   694  18447         184          2653     195
 6 2022-03-07     88   867  20999         206          3057     245
 7 2022-03-08    129  1143  27081         306          4028     363
 8 2022-03-09    137  1171  27996         320          4245     580
 9 2022-03-10    156  1308  30141         353          4548     623
10 2022-03-11    164  1443  34439         390          5245     718
# ... with 3 more variables: South_America <dbl>,
#   Southeast_Asia <dbl>, Ukraine <dbl>
Show code
ggplot(ddos_regions, aes(x = Date)) +
  geom_line(aes(y = Africa, colour = "Africa")) +
  geom_line(aes(y = Asia, colour = "Asia")) +
  geom_line(aes(y = Europe, colour = "Europe")) +
  geom_line(aes(y = Middle_East, colour = "Middle East")) +
  geom_line(aes(y = North_America, colour = "North America")) +
  geom_line(aes(y = Oceania, colour = "Oceania")) +
  geom_line(aes(y = South_America, colour = "South America")) +
  geom_line(aes(y = Southeast_Asia, colour = "Southeast Asia")) + 
  geom_line(aes(y = Ukraine, colour = "Ukraine")) +
  scale_colour_discrete((name = "Region")) +
  xlab("Dates") +
  ylab("Users") +
  ggtitle("Increase in Regional Users by Date") +
  theme_minimal()

If we eliminate the most significant location of users (Europe) it is simply easier to get an idea of how the users from the remaining regions increased over time.

Show code
ggplot(ddos_regions, aes(x = Date)) +
  geom_line(aes(y = Africa, colour = "Africa")) +
  geom_line(aes(y = Asia, colour = "Asia")) +
  geom_line(aes(y = Middle_East, colour = "Middle East")) +
  geom_line(aes(y = North_America, colour = "North America")) +
  geom_line(aes(y = Oceania, colour = "Oceania")) +
  geom_line(aes(y = South_America, colour = "South America")) +
  geom_line(aes(y = Southeast_Asia, colour = "Southeast Asia")) + 
  geom_line(aes(y = Ukraine, colour = "Ukraine")) +
  scale_colour_discrete((name = "Region")) +
  xlab("Dates") +
  ylab("Users") +
  ggtitle("Increase in Non-European Users by Date") +
  theme_minimal()

And the total users over time.

Show code
ddos_time <- read_csv("daily_observations.csv", 
    col_types = cols(Date = col_date(format = "%m/%d/%Y")))
ddos_time <- as_tibble(ddos_time) 
gg <- ggplot(ddos_time, aes(x = Date)) +
  geom_line(aes(y = Total)) +
  xlab("Dates") +
  ylab("Users") +
  ggtitle("Increase in Total Users by Date") +
  theme_minimal()
gg

Population & User Data Only

I’ll start with a basic visualization of the relationship between the population of the countries and the number of users of DDOS attacks from the corresponding countries:

Show code
#create plot
ggplot(ddos_cumulative, aes(x = log(population), y = log(users), color = region)) +
  geom_point () +
  facet_wrap("region")

Linear Model of Population and Users

What I want to look at is the linear model of the relationship between the population of each country with participating users and the corresponding sample of users from that country.

I’ll first simplify my data set to only contain the columns I am looking at here.

Show code
pop_users <- ddos_cumulative %>% 
  select(c(population, users))
gg1 <- ggplot(pop_users, aes(x=population, y=users)) +
   geom_point() +
   geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="cornflowerblue") +
   labs(title= "Population and Users",
        x= "Population",
        y = "Users") +
    theme_minimal_hgrid()
gg1

That’s a mess. I want to take the log() of the data to achieve a better look at the model

gg1b <- ggplot(pop_users, aes(x=log(population), y=log(users))) +
  geom_point() +
  geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="cornflowerblue") +
   labs(title= "Log: Population and Users",
        x= "Population (log)",
        y = "Users (log)") +
   theme_minimal_hgrid()

gg1b

On first look at this relationship, it seems clear that there is no correlation between a country’s population and the number of users of the DDOS tool.

Show code
pop_users_lm <- lm(population~users, data = pop_users)
summary(pop_users_lm)

Call:
lm(formula = population ~ users, data = pop_users)

Residuals:
       Min         1Q     Median         3Q        Max 
 -68122950  -59609854  -52417541  -18720617 1334465479 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 63129462   21142912   2.986  0.00359 **
users           4092      12789   0.320  0.74972   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 199600000 on 96 degrees of freedom
Multiple R-squared:  0.001065,  Adjusted R-squared:  -0.009341 
F-statistic: 0.1024 on 1 and 96 DF,  p-value: 0.7497

IVS Data

The next data source I want to explore is the IVS data set.

Reading in Data

This brings in an overwhelming 135,000 observations of 231 variables. I selected the columns I am interested in working with and saved as a .csv file, which I will read in for the rest of the analysis.

A full accounting of the variables and descriptions are in the “About” tab of this GitHub Page.

To make matching easier, I used the “countrycode” package to assign proper country names to the ISO-3 numeric code from the data set.

Show code
#read in .dta file
#library(haven)
#ivs_data <- read_dta("data/ivs/ZA7505_v2-0-0.dta")
#head(ivs_data[33:42])
#write.csv(ivs_data, file = "./data/ivs/ivs_data.csv", row.names = TRUE)
#select relevant data
#ivs_subset <- select(ivs_data,10,34,35,40:50,106,109:114,119:138,150:162,166,188:196,199,201,210:214,222,224,225,230,231)
#ivs_df <- as.data.frame(ivs_subset)
#load package for converting country codes
#library(countrycode)
#ivs_df$country.name <- countrycode(ivs_df$cntry, origin = "iso3n", destination = "country.name")

ivs_clean <- read.csv("ivs-df-clean.csv")
ivs_clean <- as_tibble(ivs_clean)
names(ivs_clean)[1] <- 'country'
head(ivs_clean)
# A tibble: 6 x 72
  country weight imp_family imp_friends imp_leisure imp_politics
  <chr>    <dbl>      <int>       <int>       <int>        <int>
1 Albania  0.697          2           1           2            3
2 Albania  0.697          1           1           4            4
3 Albania  0.697          1           2           2            4
4 Albania  0.697          1           2           2            4
5 Albania  0.697          1           1           2            4
6 Albania  0.697          1           3           3            4
# ... with 66 more variables: imp_work <int>, imp_religion <int>,
#   sat_happiness <int>, sat_health <int>, sat_life <int>,
#   sat_control <int>, willingness_fight <int>,
#   interest_politics <int>, prop_petition <int>,
#   prop_boycotts <int>, prop_demonstrations <int>,
#   prop_strikes <int>, self_position <int>, conf_churches <int>,
#   conf_armed <int>, conf_press <int>, conf_unions <int>, ...

Transforming IVS Data

Preprocessing

In the original data in the IVS datasets, there are some meaningless choices in the value labels such as “Not asked,” “NA,” and “DK.” Additionally, some response have negative serial numbers. Furthermore, I excluded variables that have a response structure that do not follow the structures that are congruous to the structure of the majority of the responses.

Grouping by Mean

There are some changes needed to make the data more manageable. I have cleaned up some of the data by assigning all negative values representing various codes for no available observation to “NA” when applicable. I took means when applicable saved the resulting means by country as a series of data sets I saved offline that I will import.

Example of how I manipulated the data before saving:

Show code
#select relevant data
#ivs_important <- select(ivs_clean,1:8)

#find mean of each column
#important <- ivs_important %>%
  #group_by(country) %>%
  #summarise(
    #family = mean(imp_family, na.rm = TRUE),
    #friends = mean(imp_friends, na.rm = TRUE),
    #leisure = mean(imp_leisure, na.rm = TRUE),
    #politics = mean(imp_politics, na.rm = TRUE),
    #work = mean(imp_work, na.rm = TRUE),
    #religion = mean(imp_religion, na.rm = TRUE)
    #)

Looking at data frames representing country means:

Show code
important <- read_csv("important.csv")
head(important)
# A tibble: 6 x 8
     X1 country   family friends leisure politics  work religion
  <dbl> <chr>      <dbl>   <dbl>   <dbl>    <dbl> <dbl>    <dbl>
1     1 Albania     1.02    1.73    2.01     3.30  1.20     2.15
2     2 Andorra     1.12    1.54    1.42     2.94  1.53     2.97
3     3 Argentina   1.09    1.54    1.81     2.81  1.47     2.39
4     4 Armenia     1.11    1.74    1.99     2.79  1.47     1.83
5     5 Australia   1.11    1.48    1.65     2.41  2.00     2.90
6     6 Austria     1.20    1.45    1.63     2.51  1.67     2.64

Matching Data

When eliminating the countries who did not have a profile in the IVS dataset from my observation data, I lost approximately 2,000 observations and have 67 countries to compare. I created a data frame of this information to use going forward.

Show code
all_data <- read_csv("integrated_data.csv")
head(all_data)
# A tibble: 6 x 23
  country   population region    users family friends leisure politics
  <chr>          <dbl> <chr>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
1 Albania      3088385 Southern~    57   1.02    1.73    2.01     3.30
2 Andorra        85645 Southern~     6   1.12    1.54    1.42     2.94
3 Argentina   45864941 South Am~    11   1.09    1.54    1.81     2.81
4 Armenia      3011609 Western ~    16   1.11    1.74    1.99     2.79
5 Australia   25809973 Oceania     717   1.11    1.48    1.65     2.41
6 Austria      8884864 Western ~  3276   1.20    1.45    1.63     2.51
# ... with 15 more variables: work <dbl>, religion <dbl>,
#   willingness <dbl>, petition <dbl>, boycott <dbl>,
#   demonstration <dbl>, strikes <dbl>, identity <dbl>,
#   marital <dbl>, parents <dbl>, children <dbl>, household <dbl>,
#   education <dbl>, income <dbl>, scaled_weights <dbl>

Using Scaled Data

Normalization

Some of the variables have different value labels and maximum values, even within the same family of topics. For example, I may want to normalize the user scale when looking at, for example, the first set of variables that have responses on a scale of 1 to 4 accordingly?

Show code
all_data <- read.csv("integrated_data.csv")
scale_4 <- rescale(all_data$users, to=c(1,4))
summary(scale_4)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.004   1.019   1.159   1.109   4.000 
Show code
scale_users_4 <- as.data.frame(scale_4)
head(scale_users_4)
   scale_4
1 1.012742
2 1.001138
3 1.002275
4 1.003413
5 1.162912
6 1.745165

Linear Regression: Scaled Data

Show code
#Join scaled values of users to summary data
all_data$scaled_users <- scale_4
#Linear regression of "importance" variables + scaled user variable  
lm_imp <- lm(scaled_users ~ family + friends + leisure + politics + work + religion, data = all_data, na.action = na.exclude)
summary(lm_imp)

Call:
lm(formula = scaled_users ~ family + friends + leisure + politics + 
    work + religion, data = all_data, na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.38183 -0.13888 -0.07790  0.02971  2.60239 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.90748    1.14576   0.792    0.431
family       0.76647    1.01582   0.755    0.453
friends     -0.16610    0.33585  -0.495    0.623
leisure      0.03493    0.28318   0.123    0.902
politics    -0.31986    0.19966  -1.602    0.114
work         0.23951    0.32225   0.743    0.460
religion     0.03842    0.11651   0.330    0.743

Residual standard error: 0.4202 on 60 degrees of freedom
Multiple R-squared:  0.1275,    Adjusted R-squared:  0.04025 
F-statistic: 1.461 on 6 and 60 DF,  p-value: 0.2069

Linear Regression: Unscaled Data

Compare that to the un-scaled user data. I’m not sure that scaling will make a difference in the data integrity using regression analysis going forward.

However, this is very informative to me as a novice user of linear models to understand how scaling affects the degrees of freedom, but not the adjusted R-squared or p-values.

Show code
#Linear regression of "importance" variables + unscaled user variable  
lm_imp2 <- lm(users ~ family + friends + leisure + politics + work + religion, data = all_data, na.action = na.exclude)
summary(lm_imp2)

Call:
lm(formula = users ~ family + friends + leisure + politics + 
    work + religion, data = all_data, na.action = na.exclude)

Residuals:
    Min      1Q  Median      3Q     Max 
-1678.1  -610.4  -342.4   130.6 11437.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   -405.6     5035.6  -0.081    0.936
family        3368.6     4464.5   0.755    0.453
friends       -730.0     1476.1  -0.495    0.623
leisure        153.5     1244.6   0.123    0.902
politics     -1405.8      877.5  -1.602    0.114
work          1052.6     1416.3   0.743    0.460
religion       168.8      512.1   0.330    0.743

Residual standard error: 1847 on 60 degrees of freedom
Multiple R-squared:  0.1275,    Adjusted R-squared:  0.04025 
F-statistic: 1.461 on 6 and 60 DF,  p-value: 0.2069

Citation

For attribution, please cite this work as

Becvar (2022, April 24). IT Army: Exploratory Analysis. Retrieved from https://kbec19.github.io/it-army/posts/exploratory-analysis/

BibTeX citation

@misc{becvar2022exploratory,
  author = {Becvar, Kristina},
  title = {IT Army: Exploratory Analysis},
  url = {https://kbec19.github.io/it-army/posts/exploratory-analysis/},
  year = {2022}
}