Python

How to convert a Python Jupyter notebook into an RMarkdown file

Pinterest LinkedIn Tumblr

1. Introduction

NOTE: If you want to jump right into my code, go straight to section 3 (“From Jupyter to the RMarkdown world”).

Python was my first love when I started my journey in the programming world a couple of years ago, and it is still my favorite language. However, for the last few months, I’ve been more and more into R, due to work and academic reasons. And I must admit it: R is super fun too! The more I study both languages, the more certainty I have that the polyglot path in Data Analytics and Data Science is what I want for me.

As a matter of fact, programming languages should be treated simply as tools and be used depending on the task and the specific context of the problem you need to solve. Indeed, there are some nice resources available in Python that might not be present in R yet, and vice versa. I believe that the RMarkdown ecosystem is one of the resources that Python users should consider getting to know, in order to build reports like this one or this other one using only Markdown and Python (and just five lines of R code in a Google Colab).

2. Program panoramic view

I wrote a short program that aims to help a Python user with no prior R knowledge to join the RMarkdown party. So let’s start the party! The main goal here is to make this passage from one programming world to the other as easy as possible.

This program is writen in Python and converts an entire Python Jupyter notebook into an RMarkdown file. Then we will move on and open a simple Google Colaboratory notebook with an R kernel, run five lines of R code and generate, from our RMarkdown file, documents in as many formats as we want: HTML output pages, PDF, EPUB, Microsoft Word, Power Point presentations, xaringan presentations, bookdown documents and so on. All that coming from the same single source file, which increases dramatically both reprodutibility and maintainability.

Our Python code will focus on how to convert a Jupyter notebook to an Rmarkdown file that, when executed with the markdown::render() function from R, generates an HTML output file, with a nice sidebar menu and layout design. If you are interested in other types of outputs, I strongly recommend that you read the first chapters from R Markdown: The Definitive Guide, written in bookdown code by Yihui Xie, J. J. Allaire, and Garrett Grolemund.

I must confess that I haven’t been able to find a resource in Python similar to what RMarkdown has to offer; that was the main reason behind my effort in making RMarkdown more accessible to the Python community by publishing this article. If you know any Python package that gets similar results as RMarkdown, please let me know, I would like to explore them as well.

3. From Jupyter to the RMarkdown world

Some of its users might not know, but a Jupyter notebook is simply a JSON file that is rendered by some engine and then is presented to us with that familiar interactive layout of markdown cells and code cells. But all the notebook information is organized in a JSON structure, where a dictionary with four keys(cells , metadata , nbformat , and nbformat_minor) store all the notebook data.

The cells key is definitely the most important one, since it holds the code a user has written in the notebook. The value to the cells key is a list of dictionaries, where each dictionary inside this list corresponds to one cell in the notebook.

For didatic purposes only, I created a short Python Jupyter notebook called test_notebook.ipynb and sent it to my Github page. If you check its raw version here, you will notice how the JSON file is structured and then you hopefully will understand a little better what I meant in the previous two paragraphs. The test_notebook.ipynb file is the Jupyter notebook we will convert into RMarkdown later. Of course, you can use one of your own notebooks to do this first test. I only advise you to make a copy of the original file beforehand, just in case.

The code to do that transformation (Jupyter to RMarkdown) is located in another notebook (ConvertToRMarkdown.ipynb), available in this Github repositoryand discussed below.

We will start by importing the json module, opening the test_notebook.ipynbfile with open(), and using the json.load() function to transform the json object into a Python dictionary. We will also save this dictionary into the data variable.https://towardsdatascience.com/media/efaddcd5e968e8984b8fc2f5346458bb

The only information that we will need from the metadata key is the programming language name used in the notebook kernel. We can do that with just one line of code:https://towardsdatascience.com/media/bc64ace0b188b7a1c24a7aa1d60683e2

Now we move our attention to the cells key from the JSON main dictionary, called by the data["cells"] code. Each item in this list is a dictionary representing an individual cell from the notebook.

Moreover, each cell dictionary has two keys that will be very important for us: cell_type, whose value is, in the case we are analysing, either "markdown" or "code" ; and source, that has a list of strings as its value. These strings are the actual code written in the notebook cell, line by line, which we will need to assemble in the final RMarkdown file later.

I used a list comprehension to save this information into a structure of nested lists, as shown below:

https://towardsdatascience.com/media/a98cb22cc02f95e996421b4149ba8df3

During a later for loop in our code, we will save the value from x["cell_type"] in a temporary variable called cell_type, on one hand; and the value from x["source"] will be stored in a temporary variable called just source, on the other.

After having defined the cells variable as the result from the list comprehension above, we will create the mandatory RMarkdown file header now. It consists of text edited as a YAML file and contains all the RMarkdown configurations, such as the main title, author, date, and output types to the file (in our case, there will be only the html_document one).

In this code part, you should notice two important aspects:

  • Indentation matters a lot in YAML code, similarly to what we have in Python. So, if you miss just one space in the YAML indentation structure at the top of an RMarkdown file, you might face RMarkdown renderization problems at some point ahead. Should that happen, go back to the code and try discovering where the mistake is and then fix it. You can always open the RMarkdown file in your chosen code editor and modify it as needed;
  • In the title, author, and date options, I inserted some HTML tags to give a personal touch to the final layout. Note that the pipe symbol is necessary in these situations.

Here is not the place to give a lecture on all the different configurations one can add to this part of an RMarkdown file. If you want to know more about this subject, I recommend consulting the aforementioned R Markdown book. Most answers you will need on these topics and other ones can be found out by checking either that book or the RMarkdown package documentation.

Below is the code to this YAML part. You can change your own author and date information directly in the variables created to store those data. We will save this long f-string into the file_text variable.

https://towardsdatascience.com/media/df17d29f183f9d41a1f5398f1b09534d

We barely started discussing our code and one can already see its end. Now we only need to loop through the notebook cells content and append them to the file_text variable. Then we will write this file_text string to a brand new RMarkdown file, which must have the .Rmd extension (note the capital R in it).

https://towardsdatascience.com/media/84c32b8c5d7b952e8741e4ccf2ae3e9f

Inside the for loop above, notice that the temporary variables cell_type and source we mentioned before are finally created. The cell_type value will be used to check if the current notebook cell is a markdown or a code one. If it is the former, we only need to join the strings together, using the "".join(source) method; if it is a code cell, we also need to add the chunk block structure in RMarkdown (for more details, check this part from the book mentioned earlier). The and source part in the conditional statements will make the program skip markdown and code cells with blank values (in other words, with no source information).

The code chunk structure is what allows RMarkdown to run code, not only from R and Python, but also from many other languages. When we insert the word python inside the curly brackets, right after the ``` symbols that open the code chunk, we are telling to RMarkdown that this is a Python code and that R should use specific resources to run it (in this case, the reticulate package from R). Many other programming languages are supported by RMarkdown code chunks too, such as Julia, SQL, C, Javascript, etc. This system also allows easy chunck code reuse, which is a nice resource you definitely should look into someday.

Now that we created the converted_notebook.Rmd file, we are ready to move to the Google Colaboratory environment. However, let’s just make an important observation here: this next step in Google Colaboratory probably can’t be done locally by you with ease (just by running a magic statement to make a cell in a Python Jupyter notebook run R code, for example). You would need the Pandoc Program installed in your machine, besides doing further configuration in R commands to render the RMarkdown file properly using your local Pandoc copy.

A much easier approach to render locally an RMarkdown file would be to install the R program and use the RStudio IDE, which has Pandoc integrated in it. You should definitely test this option later, since RStudio is a very interesting IDE to get to know too. Nevertheless, the quickiest, best option for those people who don’t have prior R knowledge is definitely moving briefly to Google Colab, as I will explain next.

4. Generating the HTML output file

Now we need to open a new Google Colab notebook with R Kernel. In order to do that, paste one of the following links on your browser. You need to login into your Google account (or create one) to use Google Colaboratory:

Inside the Google Colaboratory notebook, upload the converted_notebook.Rmd file to the /content directory, which will be the current working directory once Colab starts. When the .Rmd file is in the right place, execute the following R code in one of the Google Colab code cells (it might take a while to finish running it. R Colab notebooks seem to be slower than the Python ones):

https://towardsdatascience.com/media/97d5e3aa261abfdef23502cdeebf4fce

install.packages() is the R equivalent to Python pip install. In the Google Colab with R, you will need to install these packages every time you start a new session. Next, you import both packages with the library() commands, which is similar to the Python import statements. Finally, the rmarkdown::render() function will convert the .Rmd file into a .html one. You will see this new file after you refresh the Google Colab files list.

5. Final thoughts

The last step will be downloading the .html file from Google Colab and opening it on your favorite browser. By doing that, you can check how different an HTML page report generated from an RMarkdown can look like when compared to a Jupyter notebook layout, and keep learning about the RMarkdown ecosystem, if you like what you saw.

The bookdown library is a specially interesting one, if you want to convert long reports with code into HTML documents with navigation between pages. You can also use CSS, HTML, and Javascript code to customize even further the report layout and behavior.

If you would like to know more about the RMarkdown history, evolution, and future, I invite you to watch a conference talk from Yihui Xie himself, presented just a few day ago, on September 9th. Although some organizer professors speak Portuguese a little at the beginning and the end of the video, all Yihui Xie’s presentation and Q&A part are in English. And I mention this video here too because it was one major inspiration for me writing this article. So, I want to thank Yihui and the organizers from UFPR (Paraná Federal University / Universidade Federal do Paraná), in Brazil, for putting together their 3rd R Day and sharing their knowledge with us.

And I thank you so much, dear reader, for having honored my text with your time and attention.

Happy coding!

Original Source

MBA on Data Science and Analytics at Universidade de São Paulo — USP (in progress) | Data Analyst | Python and R Programmer | Power BI Consultant

Write A Comment