Data visualization, i.e., the graphical representation of data, is a vital skill for every scientist to develop – aiding with the interpretation of data and providing an accessible way to communicate these data with others. In the scientific world, data visualization is used to produce eye-catching figures to share results with peers and the wider community. While these visualizations are achievable using no coding, they can be restricted by the dataset size, plotting style and overall cost of the software. Learning to code solves many of these issues and while the learning curve remains a barrier to use, programming is becoming a must-have skill in many fields. Python is one of the world’s most popular programming languages and is at the forefront of data analysis and visualization, producing clear, engaging and reproducible figures in all manner of styles. As biological datasets increase in size and number, the reproducibility and flexibility of Python result in an invaluable tool for scientific data visualization. This article will introduce the use of Python for data visualization and provide some guidance on how to get started.
Data visualization is crucial to aid in understanding and interpreting data. Most will be familiar with the basic graphs Excel and similar software generate, but these often aren’t enough to combine and thoroughly investigate large, complex datasets. While platforms such as Tableau exist that allow advanced data visualization the cost involved means these often aren’t accessible to everyone.
An alternative to using software is generating visualizations programmatically. Going this route provides many benefits, if you don’t mind learning a new skill to get there.
Most popular programming languages, such as Java and Python are open-source, i.e., free to use, modify and distribute. So there’s no cost for you or anyone you want to share the code with. The open-source nature has also resulted in a large online community of learners and developers meaning it is possible to learn how to code independently.
When programming, the biggest limitation for visualization is imagination as it is possible to generate all manner of styles in a couple of lines of code. Some methods will even enable you to animate, explore and/or zoom a plot. This has the added benefit of making you think deeply about your data to harness the visualization methods appropriately.
As experimental methods have increased in throughput and can generate gigabytes and even terabytes of information in one go, data visualization methods need to keep up. Luckily dealing with large datasets is something that programming languages are built for; see further viewing for a live example visualizing 1bn data points in just 30 lines of Python code.
Using programming to visualize data injects much-needed reproducibility into the process, as code can be re-run by you or others to check results and to visualize other datasets (massively cutting down analysis time). It’s possible to go one step further and integrate the visualization into an analysis pipeline thereby automating the whole process.
Despite the many advantages, there’s no getting around it; learning how to program is challenging. However, it is a lifelong skill that is becoming increasingly necessary in many fields, not least data analysis and visualization. For those that enjoy problem-solving, programming may even end up a fulfilling exercise.
Python is one of the world’s most popular programming languages, topping the IEEE Spectrum 2020 programming languages ranking and currently third on the TIOBE index (July 2021), only 0.67% away from first place as its popularity continues to increase.
You will find the Python language underpins much of the digital world, including YouTube, Dropbox, Reddit and much of NASA’s space exploration program. Much of Python’s popularity can be attributed to the simplicity and easy readability/writability of the language and its open-source nature has led to a large, active community of all skill levels with a goldmine of examples, tutorials and answers freely available on the internet.
Part of Python’s strength comes from the external libraries that can be included as needed to extend the language. Not only does this keep the core language small and easier to learn, but it also allows for the streamlining of code. This has resulted in libraries for pretty much everything, the most useful for the biological sciences being BioPython. Using BioPython’s SeqIO module means it is unnecessary to write custom code to process a FASTA nucleotide sequence every time. Here, data visualization is no exception. Popular libraries such as Pandas, Seaborn and MatPlotlib extend Python’s data analysis and visualization capabilities.
Aside from Python, two popular languages for data visualization are R and MATLAB (6th and 10th on the IEEE Spectrum 2020 ranking, respectively). Those that are familiar with them might be asking, why Python? Each language has its advantages, so perhaps unsurprisingly, this is not an easy question to answer.
The biggest difference between R and Python stems from other uses of the languages. R was built primarily as a statistical language and Python as a general programming language. Because of this, Python is everywhere in the digital world while R is concentrated in the statistical/data analysis world which does make Python a more transferable skill. However, when it comes to data visualization, there is now more in common than divides Python and R. With the libraries ggplot2 (R) and Seaborn (Python) amongst others, you should be able to achieve any data visualization methods in either language. For those more familiar with MATLAB, Python’s biggest draw will be the cost. Python is open-source, i.e., free. This has led to many people transitioning to the language, sharing code and building libraries for specialist methods as they go. As an example, packages such as SciKitLearn and Tensor flow have cemented Python as the lingua franca for machine learning.
Comparisons between languages could be an article unto itself, but importantly learning to program will benefit your (scientific) work, and you should choose the best language for you. However, you don't have to stick with one! Once you understand the thought process behind programming it’s easier to pick up another language. In this case, it’s good to learn Python first. Similar to how learning to drive a manual car in the UK enables you to drive an automatic but not vice-versa, learning Python will give a good introduction to the programming fundamentals while being freely accessible and retaining easy readability.
What can Python do?
Due to Python’s popularity, there are many different libraries available to produce all manner of visualizations. With only a couple of lines of code needed to generate beautiful plots, your imagination is the limit.
Pandas, MatPlotLib and Seaborn make up the trifecta of Python libraries vital for publication-ready data visualization. Pandas is a library for data analysis that introduces beneficial data structures like DataFrames into Python. It will also allow you to do quick visualizations to initially explore the data. The two other libraries, MatPlotLib and Seaborn, are data visualization specific. MatPlotLib is a library that allows basic data visualization using MATLAB-style syntax. Seaborn builds on top of MatPlotLib and includes new visualization styles that improve statistical data exploration. As seen in Figure 1, Seaborn also introduces plotting themes that enable the plots generated to be aesthetically pleasing and cohesive and also makes it easy to encode more information into a single plot.
Because Seaborn is built on MatPlotLib, it has abstracted some coding complexities, making the Seaborn syntax easier to write and understand. This simpler syntax combined with the expanded visualization options and themes makes Seaborn the first port of call to produce paper-worthy figures. However, these three libraries are highly interoperable, and you will use all three in your Python data visualization adventures (sometimes without realizing).
As seen in Figure 2, it is possible to generate more advanced figures using Seaborn with only one line of code or produce even more complex figures with a bit of initial data manipulation. Python Graph Gallery, linked in the further reading, is a great place to explore a whole range of Python data visualization examples.
The programmatic nature when building plots with Python allows for easy integration into analysis pipelines. The popular bioinformatics workflow manager Snakemake uses the Python language (see Figure 3), which makes it easy to slot visualization code into an already built analysis pipeline where this code could give a quick overview of the analysis results or produce paper-ready figures automatically. As a bonus, as Snakemake uses Python syntax and by learning Python you'll be well on your way with Snakemake too.
Another library, Plotly, allows for interactive visualizations enabling any viewer to select, filter, zoom and explore the data to their heart’s content. Along with producing them locally, these interactive plots can be deployed to the web as a dashboard using Plotly’s co-library Dash, which is a great way to share data with collaborators or the public. See further reading for an excellent example of one such dashboard.
Where to start?
Hopefully, by this point in the article, I've convinced some of you to give Python a try. There are many online courses available that will guide you through the basics and it’s worth finding an appropriate one for you. But once you grasp the basics, the best way of learning is getting stuck in, perhaps by recreating a plot you previously made.
For those just wanting to get started, some useful tools will help you on your way. Anaconda is a free open-source distribution of Python that includes most software and libraries (including Pandas, MatPlotLib and Seaborn – see Table 1 for a comparison of data visualization libraries) needed for data analysis and visualization in one installation with a good user interface. Software that allows you to run and edit code, known as an Integrated Development Environment (IDE), is also needed. The two most appropriate IDEs for data visualization are Jupyter and Spyder, which are included with Anaconda. Special mention here to Google Colaboratory, which will run Python code from a browser, no install or computing power needed.
The most important skill you will develop alongside programming is knowing how to ask the right questions. It is a running (albeit true) joke that most programmers would be lost without Google and StackOverflow (a popular programming forum). Rather than making someone a ‘bad’ programmer, Googling is just another tool that can enhance and speed up your programming – otherwise, it’s an awful lot to remember.
Using programming to produce your data visualization can be a powerful tool and it is well worth putting in the effort needed to harness it. Python is one such language that is well suited for data visualization on account of its easy readability, open community and strengths in other aspects of data analysis.■
Further Reading and Viewing
O'Donoghue, S., Gavin, A.C., Gehlenborg, N. et al. (2010) Visualizing biological data—now and in the future. Nat. Methods7, S2–S4. DOI: 10.1038/nmeth.f.301
Thinking about errors in your code differently – https://www.codecademy.com/resources/blog/errors-in-code-think-differently/
An example visualization dashboard for drug discovery, highlighting the power of Plot.ly/Dash https://dash-gallery.plotly.host/dash-drug-discovery/
A session from AnacondaCon 2018 – “Dashboards for Visualizing 1 Billion Datapoints in 30 Lines of Python” https://www.youtube.com/watch?v=k27MJJLJNT4
Annabel Cansdale is a Bioinformatician in Professor James Chong’s group in the Biology Department at the University of York. She specializes in microbiology, anaerobic digestion and metagenome assembly and analysis. In her day-to-day work, she develops analytical pipelines and methods to process and integrate large datasets. Mail: email@example.com/ twitter: @ac_1513