How do you install a package in R? | install.packages("name_of_package") |
True or False: You need to "library" a package each time you start a new R session. | True |
True or False: You need to install a package every time you start a new R session. | False |
Name an R package which can be used for data imputation. | imputeTS (for time series data); imputeR |
What packages can be used for data mining in R? | ggplot2: visualization
tm: text mining
lm: linear regression
arules: association rules mining
caret: machine learning |
What are scatterplots useful for and how can you create one in R? | Method 1, using base graphs: plot(airquality$Ozone, airquality$Wind)
Method 2, using ggplot2: ggplot(airquality, aes(x=Ozone, y=Wind)) + geom_point() |
How do you create a boxplot in R? | boxplot(mpg ~ am, data=mtcars) |
How can the mean of a df column be calculated? | mean(mtcars$mpg) |
What is the difference between a histogram and bar chart? | Histograms are used for continuous variables; bar graphs are used for discrete/categorical variables.
ggplot(data = mtcars,aes(x=mpg))+geom_histogram()
ggplot(data = mtcars,aes(x=gear))+geom_bar() |
What is a factor variable and how can you create one in R? | A factor variable is a variable that can take on a limited number of discrete values, i.e. a categorical variable.
mtcars$gear_factor<-as.factor(mtcars$gear) |
How do you declare a variable in R? | my_value <- 5
my_str <- "Hello world"
my_vector <- c(5,65,23,1)
names <- c("Ann", "Bob", "Clyde", "Lu")
my_df <- data.frame(names, my_vector)
my_df$names <- as.character(my_df$names) |
How would you check the distribution of a categorical variable? | table(mtcars$gear_factor) |
What is R? | R is an open-source language for statistical computing and data science. It can be used in command-line mode or with "R scripts;" in its stand-alone version (base R), or in its integrated development environment (IDE) - RStudio. RStudio is also available on the cloud - RStudio Cloud. |
What is the basic syntax in R? | <- is the "assignment operator," used to declare new variables and assign values to them (technically, = can be used for assignment too)
# in the beginning of a line of code is used to mark that line as a comment (aka "comment it out")
name_of_function() - you can identify functions in R by the parentheses following them. For example, mean(name_of_df_column) is applying the mean() function to all numbers in a dataframe column, i.e. the function arguments, or what you want to apply the function to, go inside the parentheses; in this case, the mean() function returns a single value, the average of the numbers in the dataframe column
new_df <- df[df$likelihood_to_recommend == 8, ] - this is a typical way of "subsetting" from a dataframe called df. In this case, new_df is a subset of df containing all of df's columns (because there is nothing following the comma inside the square brackets - remember, the comma is used to separate the rows we want - before the comma, from the columns - after the comma), but only certain rows - the rows for which the likelihood_to_recommend column in df has a value of exactly 8. You can modify this condition - e.g. you can change == to >, in which case only rows with likelihood_to_recommend values greater than 8 will be included in the new dataframe.
$ - this operator is used for "getting inside" a dataframe. E.g. df$likelihood_to_recommend means we want to access the likelihood_to_recommend column in the df dataframe. df$text means we want to access another column in that dataframe - the column called "text." |
What are some of the advantages and disadvantages of R? | +
Open-source
Runs on all major platforms
Large and active R user community = ample online resources
Developed by statisticians specifically for data analysis
One of the top programming languages for data science
-
Its performance depends on your machine's memory resources (in particular, your RAM)
Because of that, it may be slower than Python for data-intensive operations
Some of us experienced difficulties loading certain packages - package compatibility issues and conflicts between different packages (e.g. tidyverse and ggplot2) are a drawback |
What are some common data types in R? | Logical (TRUE or FALSE)
Numeric (e.g. 5, 0.643, 1.e+9)
Character (e.g. "a", "abc", "Hello", "This is my code") |
What are some common data objects in R? | Single data values (e.g. 6, 23455, "What is this?", y)
Vectors
Data frames
Matrices |
Why is R useful for data science? | R was created specifically for the purposes of statistical analysis which makes it a great candidate for data science data manipulations since it offers great functionality when it comes to data cleaning, model building and evaluation, and data visualization. There are R packages specifically geared towards data science such as caret. |
How do you get the name of the current working directory in R? | The working directory is the folder on your computer R checks for a file whenever you want to import data into R. For example, you can set your Downloads folder as your working directory, and then you'll only need to supply the name of the file you want to import instead of the full path to that file:
df <- read_csv("myFile.csv")
instead of:
df <- read_csv("C:\\User\\Downloads\\myFile.csv")
To see what your current working directory is, type:
getwd()
And to change it:
setwd("path\\to\\new\\working\\directory") |
How do you access the element in the 2nd column and 4th row of a dataframe named D? | D[4,2] |