In recent years, non-relational data have attracted increasing attention. Roughly speaking, all datasets that are hard to put into a table with rows and columns are non-relational, or non-tabular datasets.
R has great capability in dealing with tabular data. The built-in class data.frame
is powerful enough to represent a wide variety of relational data stored in rectangular tables. Packages such as data.table and dplyr are developed to make it easier and faster to work with data frames and derived classes with more features. For example, a tabular data may look like
Name | Gender | Age | Major |
---|---|---|---|
Ken | Male | 24 | Finance |
Ashley | Female | 25 | Statistics |
Jennifer | Female | 23 | Computer Science |
It is very easy to deal with such kind of data due to its regularity. In fact, each column has the same type and same length of values.
Although dealing with data frames has been made much easier and faster than ever, we still lack native tools in R to easily deal with non-relational data like the following:
Name | Age | Interests | Expertise |
---|---|---|---|
Ken | 24 | reading, music, movies | R:2, C#:4, Python:3 |
James | 25 | sports, music | R:3, Java:2, C++:5 |
Penny | 24 | movies, reading | R:1, C++:4, Python:2 |
It is obvious that different records have different set of interests with different lengths, and they also have different set of expertise as well.
If we are forced to squeeze the data into tabular form, there should be multiple tables and a number of relationships between them. Suppose we have a longer list of people (see here) and we want to answer the following question:
What is the age distribution for the most popular 3 interest classes of those who use both R and Python for at least one year?
It would require some efforts to translate such question to a SQL query to send to the database. But with rlist, the question would be easy to answer:
library(rlist)
library(pipeR)
url <- "https://renkun-ken.github.io/rlist-tutorial/data/people.json"
people <- list.load(url)
people %>>%
list.filter(Expertise$R >= 1 & Expertise$Python >= 1) %>>%
list.class(Interests) %>>%
list.sort(-length(.)) %>>%
list.take(3) %>>%
list.map(. %>>% list.table(Age))
# $music
# Age
# 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36
# 1 2 1 2 5 9 6 5 9 6 4 4 1 1 1
#
# $hiking
# Age
# 21 22 23 24 25 26 27 28 29 30 31 32 33
# 2 1 1 3 4 4 6 4 7 13 5 4 2
#
# $reading
# Age
# 19 22 23 24 25 26 27 28 29 30 31 32 33 35
# 1 1 2 1 4 6 2 3 9 11 3 3 3 1
The code uses pipeR's %>>%
operator to organize code into fluent style. Even if you are not familiar with the functions and operators, you would probably be correct if you guess what the code does.
Let's break it down:
people
list by the prerequisites: Using both R and Python for at least one year.The output is exactly the answer to the question.
It should be clear now that rlist defines a collection of functions to manipulate list objects. Although each function only does a simple job, the combination of them can be very powerful. This tutorial will cover most functionality in detail and provide example solutions to commonly encountered data processing problems.