rlist Tutorial

File formats

In the previous pages, we have pointed out that rlist is designed to deal with non-tabular data, that is, data that does not well fit a tabular form. To stress the difference between them, we recall the examples we used.

The following table represents a tabular data:

Name	Gender	Age	Major
Ken	Male	24	Finance
Ashley	Female	25	Statistics
Jennifer	Female	23	Computer Science

The table can be easily stored into either text-based file or relational database. The most commonly used text-based file format to store such type of data is CSV which often uses comma (,) to divide columns. In this format, the data can be written in the following form:

Name,Gender,Age,Major
Ken,Male,24,Finance
Ashley,Female,25,Statistics
Jennifer,Female,23,Computer Science

It is obvious that each line represents a record and columns can be distinguished by comma. In R reading a csv file is simple: read.csv() can handle it easily.

However, when it comes to non-tabular data, standard CSV format and the reader functions do not handle it so well as it does with tabular data. Recall the following table representing a non-tabular dataset:

Name	Age	Interests	Expertise
Ken	24	reading, music, movies	R:2, C#:4, Python:3
James	25	sports, music	R:3, Java:2, C++:5
Penny	24	movies, reading	R:1, C++:4, Python:2

You may have to try to write a CSV file to represent the data, but the outcome would not be satisfactory: the number of values of Interests column is not fixed, and the values of Expertise column are also different in names.

Alternatively, you may also try to build a relational database to contain the data. The structure of the database, however, would be a bit tricky: More than one tables are to be created, each is restricted by one type of structure. To query the data with flexibility, one has to work with multiple tables by joining them.

JSON is a powerful format to represent such flexible data. It certainly has more notations but does not make the representation too complex. The following text is the JSON format of the table above.

[
  {
        "Name" : "Ken",
        "Age" : 24,
        "Interests" : [
            "reading",
            "music",
            "movies"
        ],
        "Expertise" : {
            "R": 2,
            "CSharp": 4,
            "Python" : 3
        }
    },
    {
        "Name" : "James",
        "Age" : 25,
        "Interests" : [
            "sports",
            "music"
        ],
        "Expertise" : {
            "R" : 3,
            "Java" : 2,
            "Cpp" : 5
        }
    },
    {
        "Name" : "Penny",
        "Age" : 24,
        "Interests" : [
            "movies",
            "reading"
        ],
        "Expertise" : {
            "R" : 1,
            "Cpp" : 4,
            "Python" : 2
        }
    }
]

You may find that the JSON text above fully replicates the information in the table but using notations such as [], {} and "key" : value. Here is a simplified introduction to these notations:

[] creates a unnamed node array.
{} creates a named node list.
"key" : value creates a key-value pair where value can be a number, a string, a [] array, or a {} list.

These notations allow the use of nested lists or arrays, just like how list object in R can be nested. Therefore, this similarity briges the use of JSON and R. rlist package imports jsonlite package to read/write JSON data.

Another file format that is also widely used is YAML. The following text is a YAML format representation (stored here) of the non-tabular data:

- Name: Ken
  Age: 24
  Interests:
  - reading
  - music
  - movies
  Expertise:
    R: 2
    CSharp: 4
    Python: 3
- Name: James
  Age: 25
  Interests:
  - sports
  - music
  Expertise:
    R: 3
    Java: 2
    Cpp: 5
- Name: Penny
  Age: 24
  Interests:
  - movies
  - reading
  Expertise:
    R: 1
    Cpp: 4
    Python: 2

Note that YAML representation is much cleaner than JSON format. rlist also imports yaml package to read/write YAML data.

In the coming tutorial pages, we will mainly use JSON data to demonstrate the features and examples of rlist package.