DACSS 603 Final Project Work: “Analyzing Data”
The primary data is a set of observations of users of a novice “hacking tool” to engage in DDOS (denial of service) attacks against Russian targets in March 2022. The data contains a total of users cumulatively for each day of the series March 2 through March 11, and the users represent participants from 98 counties.
I will also be using a data set of observations from the World Values Survey conducted from 2017-2021 as a joint project between the World Values Survey and the European Values Studies. This data was released in July 2021, and contains responses from ~135,000 respondents among 95 countries.
The third is a data set of media coverage (media articles and social media mentions) of the Ukrainian minister’s call for volunteers for the “IT Army of Ukraine” to help fight the invasion of Russia on the digital front.
I moved the data into various forms to best explore ways to analyze it.
#load the data
ddos_daily <- read_csv("ddos_observations.csv")
#assign column names to represent variables accurately
colnames(ddos_daily) <- c("Country", "Population", "Region", "March2", "March3", "March4", "March5", "March6", "March7", "March8", "March9", "March10", "March11")
#summarize the data
options(scipen = 999)
head(ddos_daily)
# A tibble: 6 x 13
Country Population Region March2 March3 March4 March5 March6 March7
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aland 29789 Europe 1 1 1 1 1 1
2 Albania 3088385 Europe 19 22 22 23 32 44
3 Algeria 43576691 Africa 0 8 8 8 8 8
4 Andorra 85645 Europe 2 6 6 6 6 6
5 Argenti~ 45864941 South~ 9 9 9 11 11 11
6 Armenia 3011609 Asia 1 7 7 9 9 13
# ... with 4 more variables: March8 <dbl>, March9 <dbl>,
# March10 <dbl>, March11 <dbl>
The total DDOS users as of the first day of observations, March 2, 2022, and the last day available for observation, March 11, 2022 began at 7,850 and grew to a total of 48,879.
However, I am not going to examine the panel data; I am only going to look at the cumulative data - or the count of users on the last day of observations, March 11. So this looks at:
# A tibble: 6 x 4
country population region users
<chr> <dbl> <chr> <dbl>
1 Aland 29789 Europe 1
2 Albania 3088385 Europe 57
3 Algeria 43576691 Africa 10
4 Andorra 85645 Europe 6
5 Argentina 45864941 South America 11
6 Armenia 3011609 Asia 16
It is still important to be able to visualize the dramatic change in user count over time, even if I am not analyzing the time series in this analysis. I experimented with displaying the increase as a whole and the increase by region. So this looks at:
ddos_regions <- read_csv("ddos_by_region.csv",
col_types = cols(Date = col_date(format = "%m/%d/%Y")))
ddos_regions <- as_tibble(ddos_regions)
ddos_regions
# A tibble: 10 x 10
Date Africa Asia Europe Middle_East North_America Oceania
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2022-03-02 15 180 4863 72 1208 90
2 2022-03-03 32 419 6994 115 1723 119
3 2022-03-04 39 467 9069 137 1905 135
4 2022-03-05 59 604 17392 163 2416 177
5 2022-03-06 77 694 18447 184 2653 195
6 2022-03-07 88 867 20999 206 3057 245
7 2022-03-08 129 1143 27081 306 4028 363
8 2022-03-09 137 1171 27996 320 4245 580
9 2022-03-10 156 1308 30141 353 4548 623
10 2022-03-11 164 1443 34439 390 5245 718
# ... with 3 more variables: South_America <dbl>,
# Southeast_Asia <dbl>, Ukraine <dbl>
ggplot(ddos_regions, aes(x = Date)) +
geom_line(aes(y = Africa, colour = "Africa")) +
geom_line(aes(y = Asia, colour = "Asia")) +
geom_line(aes(y = Europe, colour = "Europe")) +
geom_line(aes(y = Middle_East, colour = "Middle East")) +
geom_line(aes(y = North_America, colour = "North America")) +
geom_line(aes(y = Oceania, colour = "Oceania")) +
geom_line(aes(y = South_America, colour = "South America")) +
geom_line(aes(y = Southeast_Asia, colour = "Southeast Asia")) +
geom_line(aes(y = Ukraine, colour = "Ukraine")) +
scale_colour_discrete((name = "Region")) +
xlab("Dates") +
ylab("Users") +
ggtitle("Increase in Regional Users by Date") +
theme_minimal()
If we eliminate the most significant location of users (Europe) it is simply easier to get an idea of how the users from the remaining regions increased over time.
ggplot(ddos_regions, aes(x = Date)) +
geom_line(aes(y = Africa, colour = "Africa")) +
geom_line(aes(y = Asia, colour = "Asia")) +
geom_line(aes(y = Middle_East, colour = "Middle East")) +
geom_line(aes(y = North_America, colour = "North America")) +
geom_line(aes(y = Oceania, colour = "Oceania")) +
geom_line(aes(y = South_America, colour = "South America")) +
geom_line(aes(y = Southeast_Asia, colour = "Southeast Asia")) +
geom_line(aes(y = Ukraine, colour = "Ukraine")) +
scale_colour_discrete((name = "Region")) +
xlab("Dates") +
ylab("Users") +
ggtitle("Increase in Non-European Users by Date") +
theme_minimal()
And the total users over time.
ddos_time <- read_csv("daily_observations.csv",
col_types = cols(Date = col_date(format = "%m/%d/%Y")))
ddos_time <- as_tibble(ddos_time)
gg <- ggplot(ddos_time, aes(x = Date)) +
geom_line(aes(y = Total)) +
xlab("Dates") +
ylab("Users") +
ggtitle("Increase in Total Users by Date") +
theme_minimal()
gg
I’ll start with a basic visualization of the relationship between the population of the countries and the number of users of DDOS attacks from the corresponding countries:
#create plot
ggplot(ddos_cumulative, aes(x = log(population), y = log(users), color = region)) +
geom_point () +
facet_wrap("region")
What I want to look at is the linear model of the relationship between the population of each country with participating users and the corresponding sample of users from that country.
I’ll first simplify my data set to only contain the columns I am looking at here.
pop_users <- ddos_cumulative %>%
select(c(population, users))
gg1 <- ggplot(pop_users, aes(x=population, y=users)) +
geom_point() +
geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="cornflowerblue") +
labs(title= "Population and Users",
x= "Population",
y = "Users") +
theme_minimal_hgrid()
gg1
That’s a mess. I want to take the log() of the data to achieve a better look at the model
gg1b <- ggplot(pop_users, aes(x=log(population), y=log(users))) +
geom_point() +
geom_smooth(method=lm,se=TRUE,fullrange=TRUE,color="cornflowerblue") +
labs(title= "Log: Population and Users",
x= "Population (log)",
y = "Users (log)") +
theme_minimal_hgrid()
gg1b
On first look at this relationship, it seems clear that there is no correlation between a country’s population and the number of users of the DDOS tool.
Call:
lm(formula = population ~ users, data = pop_users)
Residuals:
Min 1Q Median 3Q Max
-68122950 -59609854 -52417541 -18720617 1334465479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63129462 21142912 2.986 0.00359 **
users 4092 12789 0.320 0.74972
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 199600000 on 96 degrees of freedom
Multiple R-squared: 0.001065, Adjusted R-squared: -0.009341
F-statistic: 0.1024 on 1 and 96 DF, p-value: 0.7497
The next data source I want to explore is the IVS data set.
This brings in an overwhelming 135,000 observations of 231 variables. I selected the columns I am interested in working with and saved as a .csv file, which I will read in for the rest of the analysis.
A full accounting of the variables and descriptions are in the “About” tab of this GitHub Page.
To make matching easier, I used the “countrycode” package to assign proper country names to the ISO-3 numeric code from the data set.
#read in .dta file
#library(haven)
#ivs_data <- read_dta("data/ivs/ZA7505_v2-0-0.dta")
#head(ivs_data[33:42])
#write.csv(ivs_data, file = "./data/ivs/ivs_data.csv", row.names = TRUE)
#select relevant data
#ivs_subset <- select(ivs_data,10,34,35,40:50,106,109:114,119:138,150:162,166,188:196,199,201,210:214,222,224,225,230,231)
#ivs_df <- as.data.frame(ivs_subset)
#load package for converting country codes
#library(countrycode)
#ivs_df$country.name <- countrycode(ivs_df$cntry, origin = "iso3n", destination = "country.name")
ivs_clean <- read.csv("ivs-df-clean.csv")
ivs_clean <- as_tibble(ivs_clean)
names(ivs_clean)[1] <- 'country'
head(ivs_clean)
# A tibble: 6 x 72
country weight imp_family imp_friends imp_leisure imp_politics
<chr> <dbl> <int> <int> <int> <int>
1 Albania 0.697 2 1 2 3
2 Albania 0.697 1 1 4 4
3 Albania 0.697 1 2 2 4
4 Albania 0.697 1 2 2 4
5 Albania 0.697 1 1 2 4
6 Albania 0.697 1 3 3 4
# ... with 66 more variables: imp_work <int>, imp_religion <int>,
# sat_happiness <int>, sat_health <int>, sat_life <int>,
# sat_control <int>, willingness_fight <int>,
# interest_politics <int>, prop_petition <int>,
# prop_boycotts <int>, prop_demonstrations <int>,
# prop_strikes <int>, self_position <int>, conf_churches <int>,
# conf_armed <int>, conf_press <int>, conf_unions <int>, ...
In the original data in the IVS datasets, there are some meaningless choices in the value labels such as “Not asked,” “NA,” and “DK.” Additionally, some response have negative serial numbers. Furthermore, I excluded variables that have a response structure that do not follow the structures that are congruous to the structure of the majority of the responses.
There are some changes needed to make the data more manageable. I have cleaned up some of the data by assigning all negative values representing various codes for no available observation to “NA” when applicable. I took means when applicable saved the resulting means by country as a series of data sets I saved offline that I will import.
#select relevant data
#ivs_important <- select(ivs_clean,1:8)
#find mean of each column
#important <- ivs_important %>%
#group_by(country) %>%
#summarise(
#family = mean(imp_family, na.rm = TRUE),
#friends = mean(imp_friends, na.rm = TRUE),
#leisure = mean(imp_leisure, na.rm = TRUE),
#politics = mean(imp_politics, na.rm = TRUE),
#work = mean(imp_work, na.rm = TRUE),
#religion = mean(imp_religion, na.rm = TRUE)
#)
important <- read_csv("important.csv")
head(important)
# A tibble: 6 x 8
X1 country family friends leisure politics work religion
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Albania 1.02 1.73 2.01 3.30 1.20 2.15
2 2 Andorra 1.12 1.54 1.42 2.94 1.53 2.97
3 3 Argentina 1.09 1.54 1.81 2.81 1.47 2.39
4 4 Armenia 1.11 1.74 1.99 2.79 1.47 1.83
5 5 Australia 1.11 1.48 1.65 2.41 2.00 2.90
6 6 Austria 1.20 1.45 1.63 2.51 1.67 2.64
When eliminating the countries who did not have a profile in the IVS dataset from my observation data, I lost approximately 2,000 observations and have 67 countries to compare. I created a data frame of this information to use going forward.
all_data <- read_csv("integrated_data.csv")
head(all_data)
# A tibble: 6 x 23
country population region users family friends leisure politics
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Albania 3088385 Southern~ 57 1.02 1.73 2.01 3.30
2 Andorra 85645 Southern~ 6 1.12 1.54 1.42 2.94
3 Argentina 45864941 South Am~ 11 1.09 1.54 1.81 2.81
4 Armenia 3011609 Western ~ 16 1.11 1.74 1.99 2.79
5 Australia 25809973 Oceania 717 1.11 1.48 1.65 2.41
6 Austria 8884864 Western ~ 3276 1.20 1.45 1.63 2.51
# ... with 15 more variables: work <dbl>, religion <dbl>,
# willingness <dbl>, petition <dbl>, boycott <dbl>,
# demonstration <dbl>, strikes <dbl>, identity <dbl>,
# marital <dbl>, parents <dbl>, children <dbl>, household <dbl>,
# education <dbl>, income <dbl>, scaled_weights <dbl>
Some of the variables have different value labels and maximum values, even within the same family of topics. For example, I may want to normalize the user scale when looking at, for example, the first set of variables that have responses on a scale of 1 to 4 accordingly?
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.004 1.019 1.159 1.109 4.000
scale_users_4 <- as.data.frame(scale_4)
head(scale_users_4)
scale_4
1 1.012742
2 1.001138
3 1.002275
4 1.003413
5 1.162912
6 1.745165
Call:
lm(formula = scaled_users ~ family + friends + leisure + politics +
work + religion, data = all_data, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-0.38183 -0.13888 -0.07790 0.02971 2.60239
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.90748 1.14576 0.792 0.431
family 0.76647 1.01582 0.755 0.453
friends -0.16610 0.33585 -0.495 0.623
leisure 0.03493 0.28318 0.123 0.902
politics -0.31986 0.19966 -1.602 0.114
work 0.23951 0.32225 0.743 0.460
religion 0.03842 0.11651 0.330 0.743
Residual standard error: 0.4202 on 60 degrees of freedom
Multiple R-squared: 0.1275, Adjusted R-squared: 0.04025
F-statistic: 1.461 on 6 and 60 DF, p-value: 0.2069
Compare that to the un-scaled user data. I’m not sure that scaling will make a difference in the data integrity using regression analysis going forward.
However, this is very informative to me as a novice user of linear models to understand how scaling affects the degrees of freedom, but not the adjusted R-squared or p-values.
Call:
lm(formula = users ~ family + friends + leisure + politics +
work + religion, data = all_data, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1678.1 -610.4 -342.4 130.6 11437.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -405.6 5035.6 -0.081 0.936
family 3368.6 4464.5 0.755 0.453
friends -730.0 1476.1 -0.495 0.623
leisure 153.5 1244.6 0.123 0.902
politics -1405.8 877.5 -1.602 0.114
work 1052.6 1416.3 0.743 0.460
religion 168.8 512.1 0.330 0.743
Residual standard error: 1847 on 60 degrees of freedom
Multiple R-squared: 0.1275, Adjusted R-squared: 0.04025
F-statistic: 1.461 on 6 and 60 DF, p-value: 0.2069
For attribution, please cite this work as
Becvar (2022, April 24). IT Army: Exploratory Analysis. Retrieved from https://kbec19.github.io/it-army/posts/exploratory-analysis/
BibTeX citation
@misc{becvar2022exploratory, author = {Becvar, Kristina}, title = {IT Army: Exploratory Analysis}, url = {https://kbec19.github.io/it-army/posts/exploratory-analysis/}, year = {2022} }