The DataSpell Blog : The IDE for Data Scientists | The JetBrains Blog

DataSpell 2023.2 EAP 2 Is Out!

Stanislav Garkusha — Tue, 20 Jun 2023 18:59:31 +0000

In the second EAP build for DataSpell 2023.2 we improved Jupyter Notebooks synchronization with external applications.

To catch up on all of the new features DataSpell 2023.2 will bring, check out our previous EAP blog posts.

The Toolbox App is the easiest way to get the EAP builds and keep both your stable and EAP versions up to date. You can also manually download the EAP builds from our website.

Download DataSpell 2023.2 EAP

Improved Jupyter Notebooks Synchronization with External Applications

In this release, we have made significant improvements to the synchronization of Jupyter notebook changes between DataSpell and external applications such as Git or the browser version of Jupyter. You can now effortlessly switch between DataSpell and external applications, and any changes made in either will be perfectly synchronized.

We encourage you to share your feedback on the new features on Twitter or in our issue tracker, where you can also report any bugs you find in the EAP versions.

We’re excited to hear what you think!

DataSpell 2023.2 EAP 1 Is Out!

Stanislav Garkusha — Tue, 16 May 2023 16:45:17 +0000

The Early Access Program (EAP) for DataSpell 2023.2 is now open! The EAP gives you access to pre-release versions of DataSpell, allowing you to evaluate new features, test issues that have been resolved, and provide feedback. EAP builds are free and don’t require a license. All you need is a JetBrains Account. You can learn more about how the program works in this blog post.

The first EAP build for DataSpell 2023.2 brings code completion and interactive data frames for Polars, easier access to columns type information and several bug fixes.

Download DataSpell 2023.2 EAP

Polars framework support

The Polars DataFrame library has recently become more popular because of its impressive high-performance capabilities. We have taken several steps towards making Polars a first-class citizen for DataSpell.

As an initial step, we’ve added support for column-name completion in Polars functions, making it easier for you to work with the library and data in DataSpell.

Additionally, DataSpell now offers interactive tables for Polars DataFrames, allowing you to sort, export, and view data with just a few clicks. The tables are supported both in Jupyter notebooks and Python consoles. It’s also possible to access tables via Python and Jupyter debuggers and variable viewers, as well as with DataVision.

You can expect more features for Polars support to be delivered in the future. Be sure to try them out!

Easier access to columns type information

Accurate data analysis requires a thorough understanding of the types of data in your dataset. Fortunately, DataSpell makes this process much easier: You can quickly tell the data type of a column by simply hovering the mouse over it. This works for Pandas, Polars data frames and series, and NumPy arrays.

These are the most important updates for DataSpell 2023.2 EAP 1. We encourage you to share your feedback on the new features on Twitter or in our issue tracker, where you can also report any bugs you find in the EAP versions.

We’re excited to hear what you think!

DataSpell 2023.1.1 Is Out!

Ciara Byrne — Thu, 04 May 2023 14:25:01 +0000

DataSpell 2023.1.1 provides more precise measurement of cell execution time, fixes for missing DataFrame table data, and more.

Download the new version from our website, directly from the IDE, via the free Toolbox App, or use snaps for Ubuntu.

Download DataSpell 2023.1.1

Execution time updates

Since some Jupyter Notebook cells run for a long time, it can be useful to know the execution time of a cell. DataSpell 2023.1 displays both the last time a code cell was executed and the execution time (duration) directly below every cell. DataSpell 2023.1.1 provides more precise measurement of execution time, displaying the number of days, hours, minutes, seconds, and milliseconds it took to execute a cell, instead of a single unit like minutes.

We’ve also fixed a number of bugs related to execution time, including a bug that caused the execution time to disappear when a Jupyter Notebook file was closed and reopened [DS-4510] and one that prevented the execution time from clearing in Jupyter Notebook metadata [DS-4769].

Missing DataFrame table data

DataSpell displays pandas DataFrames in tabular form. In DataSpell 2022.3.2 and later, the table for a dataFrame is sometimes not displayed [DS-4570], or a static or truncated table is displayed. Users often encountered the error “Table data could not be loaded”.

While these issues have not been completely eliminated, the probability that users will encounter them has been greatly reduced in DataSpell 2023.1.1.

Other notable fixes

When you select the output of a Jupyter Notebook cell, Copy Output, Save as, and Clear Output items are now available in the context menu that appears. Previously, you could only copy the output of a cell by manually selecting the text and using Edit | Copy from the main menu. [DS-2981]DS-2981]
The Jupyter Notebook debugger now works correctly when using a Python interpreter with WSL (Windows Subsystem for Linux) or SSH. [DS-3566]
DataSpell can now connect to a JupyterHub whose URL contains a prefix. [DS-4773]

Want to be the first to learn about new features and get DataSpell and data science tips? Subscribe to our blog and follow us on Twitter! If you encounter a bug or have a feature suggestion, please share it in our issue tracker.

The DataSpell Team

DataSpell 2023.1: Support for Multiple Projects, Notebook Productivity Boosters, and DataFrame Enhancements

Ciara Byrne — Thu, 30 Mar 2023 15:54:58 +0000

DataSpell 2023.1 brings you productivity boosting features for Jupyter Notebooks and pandas DataFrames, as well as a batch of user experience improvements.

Many DataSpell users requested the ability to organize their work into multiple, separate projects in line with other JetBrains IDEs, and in this release we have delivered! Speed up tedious tasks by automatically converting a Jupyter Notebook into a Python script and vice versa, drag and drop a CSV file to create a pandas DataFrame, view cell execution start time and duration, and more. Debugging and package management just got easier with an interactive debug console in the Jupyter Notebook debugger and a fully functional Python packages tool window.

Download the new version from our website, update directly from the IDE or via the free Toolbox App, or use snaps for Ubuntu.

Download DataSpell 2023.1

Use multiple projects or a single workspace

DataSpell 2022.3 has a single workspace, to which you can attach notebooks and other files, directories, and projects. Existing projects are attached to the workspace as directories. By default, all directories and projects in the workspace share its environment or Python interpreter. In version 2022.3, DataSpell’s workspace is, in essence, the default project.

By popular demand, DataSpell 2023.1 enables you to organize your work into multiple, completely separate projects, each of which has its own environment or Python interpreter. This model is more in line with other JetBrains IDEs like PyCharm. Alternatively, you can continue to use a single workspace with attached directories.

Jupyter Notebook to Python script

Switching back and forth between Jupyter Notebooks and Python scripts is a common workflow in data science. You can speed up this repetitive task with a new feature to convert a Jupyter Notebook (.ipynb file) to a Python script (.py file) and vice versa in just a few clicks.

CSV file to pandas DataFrame

Another common data science task is creating and populating a pandas DataFrame with the data in a CSV file. In DataSpell 2023.1, if you drag and drop a CSV file (.csv) into a Jupyter Notebook, a pandas DataFrame will be automatically created from the contents of the file.

Set default rows displayed for DataFrames

DataSpell displays the contents of pandas DataFrames in tabular form. To browse large DataFrame tables more comfortably, set the default number of rows displayed per page to your preferred page size using the Change Default dialog. This default is then used for all new DataFrames.

Cell execution start time and duration

Since Jupyter Notebook cells are often executed out of order and some run for a long time, it can be useful to know when a cell was last run and its execution time. In DataSpell 2023.1, both the last time a code cell was executed and the duration of the execution are displayed directly below every cell.

Better Jupyter Notebook code completion

The code completion provided by Jupyter Notebook is often ineffective, failing to complete pandas DataFrame column names in many cases. To improve the DataSpell user experience, Jupyter Notebook code completion has been disabled and we will gradually implement new and improved auto-completion for the most common cases. DataSpell 2023.1 jumpstarts this effort with DataFrame column name completion, autocompletion for dynamic classes, path completion for remote Jupyter servers, and more.

Interactive debug console

In previous versions of DataSpell, the debug console was disabled when debugging Jupyter Notebooks. In DataSpell 2023.1 the debug console is interactive and can be used to send commands to the Jupyter debugger and view outputs and error messages while debugging Jupyter Notebook cells.

Manage Python packages and interpreters

The Python Packages tool window is the quickest way to manage packages and preview package documentation for a particular environment or Python interpreter. In DataSpell 2023.1, it’s fully functional. You can also now add a new Python interpreter directly from the interpreter widget in DataSpell’s status bar.

New UI

In 2022, JetBrains introduced a new UI for its IDEs that is designed to reduce visual complexity, provide easy access to essential features, and progressively disclose complex functionality. The new UI has a simplified main toolbar, a new tool window layout, an updated icon set, new light and dark color themes, and more. Enable the new UI via Settings/Preferences | Appearance & Behavior | New UI.

We hope you enjoy the new features! Want to be the first to know about new features and get DataSpell and data science tips? Subscribe to our blog and follow us on Twitter! If you encounter a bug or have a feature suggestion, please share it in our issue tracker.

DataSpell 2023.1 EAP 2 is Out!

Ciara Byrne — Thu, 16 Mar 2023 19:45:42 +0000

Many DataSpell users requested the ability to organize their work into multiple projects with completely separate environments and in DataSpell 2023.1 EAP 2 we have delivered!

This second EAP build also contains new features to convert a Jupyter Notebook into a Python script and vice versa, to drag and drop a CSV to create a Pandas DataFrame, and to change the default number of rows displayed for a DataFrame. Finally, debugging and package management just got easier with an interactive debug console in the Jupyter Notebook Debugger and a fully functional Python Packages Tool Window.

The Toolbox App is the easiest way to get EAP builds and keep your stable and EAP versions up to date. You can also manually download the EAP builds from our website.

Download DataSpell 2023.1 EAP

Support Multiple Projects With Separate Environments

In DataSpell 2022.3, the IDE has a single workspace, to which you can attach notebooks and other files, directories, and projects. Existing projects are attached to the workspace as directories. By default, all directories and projects in the workspace share the same virtual environment or Python interpreter as the workspace. DataSpell’s workspace is, in essence, the default project.

However, many users requested the ability to manage multiple projects, each of which has its own separate virtual environment or interpreter rather than inheriting it from the workspace. This model is more in line with other JetBrains IDEs like PyCharm.

In DataSpell 2023.1 EAP 2, we introduce the option to continue to use the single workspace model with attached directories or to work with multiple completely separate projects each of which has its own virtual environment or Python interpreter and potentially a Git repository.

From DataSpell’s Welcome screen, select the Projects option in the left pane to see a list of existing projects, open projects or create new R or Python projects. You can also create and manage projects from the main File menu.

To continue to use DataSpell with a workspace, select the Quick Start option in the left pane of the Welcome screen, configure a default environment and click Launch DataSpell.

Convert a Jupyter Notebook to a Python Script and Vice Versa

In DataSpell 2023.1 EAP 2 you can convert a Python script (.py file) to a Jupyter Notebook (.ipynb file) and vice versa in a few clicks.

Convert a notebook to a Python script by opening the File context menu | Refactor | Convert To Python File. Convert a Python script to a notebook by opening the File context menu | Refactor | Convert To Jupyter Notebook.

Drag and Drop a CSV to a Create a Pandas DataFrame

Creating a Pandas DataFrame from the data in a CSV file is a common task in data science. In DataSpell 2023.1 EAP 2, you can drag and drop a CSV (.csv file type) into a Jupyter Notebook and a Pandas DataFrame will be automatically created from the contents of the file.

Change the Default Number of Rows Displayed for a DataFrame

In DataSpell 2022.3, when a DataFrame is displayed in tabular form, only the first 10 rows are shown and this default cannot be changed. You can now reset the default number of rows displayed per page by clicking on the number of rows at the top of the table to get a list of page sizes and then setting the desired number of rows using the Change Default dialog.

Interactive Debug Console in the Jupyter Notebook Debugger

In previous versions of DataSpell, the debug console was disabled when debugging Jupyter Notebooks. In DataSpell 2023.1 EAP 2, you can use the console to send commands to the Jupyter Debugger, and view outputs and error messages, while debugging Jupyter Notebook cells.

Replace Absent Kernel for a Jupyter Notebook

When the kernel listed in a Jupyter Notebook’s metadata is absent, e.g. when the Notebook has been copied from another machine, DataSpell 2023.1 EAP 2 ignores the metadata kernel and switches to using the default DataSpell kernel with the Notebook.

Notable Fixes

Cell Execution Start Time and Duration

In DataSpell 2023.1 EAP 1 we introduced a feature to show when a code cell was last executed and the duration of the cell execution. This information is displayed directly below Jupyter Notebook cells. DataSpell 2023.1 EAP 2 includes fixes which ensure that the correct cell execution duration is shown when cells are executed using Run All and when a file is closed and reopened.

DS-4597: Cell execution duration is wrong when execute cells using Run all
DS-4510: Execution time is dropped on editor reopening

Python Packages Tool Window

The Python Packages tool window is the quickest way to manage packages and preview package documentation for a particular environment or Python interpreter. DataSpell 2023.1 EAP 2 fixes an issue which caused incorrect packages to be shown for the Python interpreter of a project attached to the workspace.

The Python Packages tool window is enabled by default. You can find it in the lower group of the tool windows or open it from the main menu: Window | Tool Windows | Python Packages. You can preview package documentation in the documentation area.

PY-54800: “Python Packages” tool window doesn’t show packages for attached projects

These are the most important updates for DataSpell 2023.2 EAP 1. For the full list of improvements, check out the release notes. Share your feedback on the new features on Twitter or in our issue tracker, where you can also report any bugs you find in the EAP. We’re excited to hear what you think!

The DataSpell Team

Picking the Perfect Data Visualization: Barplots

Jodie Burchell — Tue, 14 Mar 2023 13:46:00 +0000

This blogpost is the second in a series where we explain the most common data visualization types and how you can best use them to explore your data and tell its story. In this post, we’ll cover barplots, which can give us great insight into how different groups behave relative to each other.

In this blogpost, we will use the “Airline Delays from 2003-2016” dataset by Priank Ravichandar licensed under CC0 1.0. This dataset contains information on flight delays and cancellations in US airports from 2003 to 2016. All of the code for this blog post can be found in this repo.

Barplot with a single group

Barplots (or barcharts) are ideal for contrasting how the average value of some variable varies between groups. The strength of this chart type is its simplicity. While using the average value leaves out a lot of detail, it also makes it very easy to spot differences and similarities between groups.

Let’s start by creating a barplot that shows the proportion of flights delayed for each airport over the entire data period. We’ll be using the lets-plot plotting library in Python to create each chart, which is a port of the popular ggplot2 library in R. We start, as in the last blogpost, by reading the data and removing the first and last years, as they only contain partial data.

import pandas as pd

airlines = pd.read_csv("data/airlines.csv")
airlines = airlines[~(airlines["TimeYear"].isin([2003, 2016]))]

We now need to create a summary DataFrame that gives us the total proportion of flights that were delayed per airport over the whole time period.

flights_delayed_by_airport = (
    airlines[["AirportCode", "FlightsDelayed", "FlightsTotal"]]
    .groupby(["AirportCode"])
    .sum()
    .assign(PropFlightsDelayed=lambda x:
            x["FlightsDelayed"] / x["FlightsTotal"])
    .reset_index()
    .sort_values("PropFlightsDelayed", ascending=False)
)

We’re now ready to make our barplot. As we want to compare delays between airports, we use AirportCode for our x-axis, and the proportion of all flights that were delayed, PropFlightsDelayed, for the y-axis. In order to make the pattern a little easier to see, we’re going to turn this plot into a horizontal barplot using coord_flip().

(
    ggplot(flights_delayed_by_airport,
           aes(x="AirportCode", y="PropFlightsDelayed"))
    + geom_bar(stat="identity", fill="#b3cde3")
    + coord_flip()
    + xlab("Airport Code")
    + ylab("Flights delayed (proportion)")
    + ggtitle("Proportion of flights delayed in US airports, 2004-2015")
)

This plot allows us to get a really good sense of how airports compare in terms of how many of their flights are delayed. Most airports have less than 20% of their flights delayed over the whole data period, but there are some clear outliers. Salt Lake City (SLC) has only 15% of flights delayed, while a whopping 29% of flights at Newark (EWR) were delayed.

Barplot with multiple groups

If you want to explore your data a bit more deeply, you can also group your barplots by an additional categorical variable. This can allow you to get further insight into why groups differ from each other. For instance, we might want to know why flights are getting delayed for each airport. To make this plot, we need to first create another summary DataFrame, this time finding the proportion of flights that were delayed by the airport and the cause of the delay.

delays_by_airport_and_cause = (
    airlines[["AirportCode", "NumDelaysLateAircraft",
              "NumDelaysWeather", "NumDelaysSecurity",
              "NumDelaysCarrier", "FlightsTotal"]]
    .groupby("AirportCode")
    .sum()
    .reset_index()
)

delays_by_airport_and_cause = (
    pd.melt(delays_by_airport_and_cause,
            id_vars=["AirportCode", "FlightsTotal"],
            value_vars=["NumDelaysLateAircraft", "NumDelaysWeather",
                        "NumDelaysSecurity", "NumDelaysCarrier"],
            var_name="TypeOfDelay",
            value_name="NumberDelays")
    .assign(TypeOfDelay=lambda x: x["TypeOfDelay"].str.replace("NumDelays", ""))
    .assign(PropFlightsDelayed=lambda x: x["NumberDelays"] / x["FlightsTotal"])
    .assign(PropTypeOfDelay=lambda x: x["NumberDelays"] / x.groupby("AirportCode")["NumberDelays"].transform("sum"))
)

We can now make our plot. Since there won’t be room to fit every airport on the chart, we’ll pick five airports: Salt Lake City, Newark, Denver (DEN), New York (JFK), and San Francisco (SFO). As with the previous barplot, we include AirportCode as the x-axis variable and PropFlightsDelayed as the y-axis variable, but this time we include TypeOfDelay under the argument fill, which tells lets-plot that we want to show separate bars for each of the delay reasons.

(
    ggplot(
        delays_by_airport_and_cause[(delays_by_airport_and_cause["AirportCode"].isin(
            ["EWR", "SLC", "DEN", "JFK", "SFO"]))],
        aes(x="AirportCode", y="PropFlightsDelayed", fill="TypeOfDelay")
    )
    + geom_bar(stat="identity", position="dodge")
    + xlab("Airport Code")
    + ylab("Flights delayed (proportion)")
    + ggtitle("Proportion of flights delayed by cause in US airports, 2004-2015")
    + scale_fill_brewer(type="qual", palette="Pastel1", name="Cause of delay",
                        labels=["Late aircraft", "Weather", "Security", "Carrier"])
    + ggsize(1400, 900)
)

For all airports, late aircraft are the biggest contributor to flight delays, and security-related delays are the least common. Weather-related delays are more common in northeastern airports (EWR and JFK) compared to ones located in the western part of the country. Interestingly, although EWR has the highest overall rate of delays, it has the lowest rate of carrier-related delays compared to other airports.

Stacked barplots

The grouped barplot gives us some interesting insights, but what if we want to directly compare the proportion of delayed flights by cause between airports? It’s a bit hard to see this on the grouped barplot, but we can use another type of barplot called a stacked barplot to see this better. In this barplot, we can break down the proportion of total delays for each airport by their cause, with the understanding that each airport’s bar adds up to 100%. Let’s see how this works.

(
    ggplot(delays_by_airport_and_cause,
           aes(x="AirportCode", y="PropTypeOfDelay", fill="TypeOfDelay"))
    + geom_bar(stat="identity")
    + xlab("Airport Code")
    + ylab("Proportion of delayed flights")
    + ggtitle("Division of delayed US flights by cause, 2004-2015")
    + scale_fill_brewer(type="qual", palette="Pastel1", name="Year",
                        labels=["Late aircraft", "Weather", "Security", "Carrier"])
    + ggsize(1400, 800)
)

Now it’s much easier to spot the causes of delays between the airports. We can see that Chicago O’Hare (ORD) and Chicago Midway (MDW) are particularly affected by delays from late carriers. It also appears that airports in warmer, drier areas tend to be less affected by weather delays.

With that, we’ve covered barplots! This chart type can really help you gain insight into the relative differences between groups, and it forms the launchpad for more detailed exploration of your data. In the next blog post, we’ll look at boxplots, which allow us to capture a lot of the detail left out by barplots.

DataSpell 2022.3.3 Is Out!

Ciara Byrne — Mon, 13 Mar 2023 19:44:00 +0000

DataSpell 2022.3.3 gets GitHub Copilot back on board and includes fixes for remote Jupyter issues, overenthusiastic Notebook updates and the DataSpell onboarding tour.

Download the new version from our website, update directly from the IDE, via the free Toolbox App, or use snaps for Ubuntu.

Download DataSpell 2022.3.3

Your Copilot Is Back On Board!

In DataSpell 2022.3, GitHub Copilot worked in Python script files (.py), but not with Jupyter Notebooks. The issue was caused by changes in the Jupyter editor-to-file relationship made in DataSpell 2022.2. The DataSpell team worked with IntelliJ’s GitHub Copilot plugin’s authors to resolve this issue in DataSpell and in the plugin. So your Copilot is back on board in DataSpell 2022.3.3!

Please upgrade to version 1.2.3.2385 or later of IntelliJ’s GitHub Copilot plugin to get this fix. Updating the plugin also fixes this issue for DataSpell 2022.3.2 and 2022.2.4. [DS-3756]

GitHub Copilot in DataSpell

Other Notable Fixes

Remote Jupyter Issues

In DataSpell 2022.3.2, Jupyter Notebook files on a remote Jupyter server were not always executable because the IDE used a local file path to the remote Notebook instead of the remote Jupyter server URL. This issue is fixed in DataSpell 2022.3.3. [DS-4351]

Overenthusiastic Notebook Updates

Jupyter Notebooks (files with .ipynb extension) were sometimes modified by DataSpell 2022.3.2 even when they had not been opened in the IDE, e.g. by replacing /u001b with /u001B or adding lines to the end of the .ipynb file. This problem also occurred for .ipynb files in PyCharm and IntelliJ, but is fixed in DataSpell 2022.3.3. [DS-4540]

An Uninterrupted Onboarding Tour

In DataSpell 2022.3, the onboarding tour stalled for some users at the first step. The issue is fixed in DataSpell 2022.3.3. Select Help > Learn IDE Features from the main menu to start the tour. [DS-4502]

Want to be the first to know about new features and get DataSpell and data science tips? Subscribe to our blog and follow us on Twitter now! If you encounter a bug or have a feature suggestion, please share it in our issue tracker.

The DataSpell Team

Picking the Perfect Data Visualization: Line Plots

Jodie Burchell — Tue, 28 Feb 2023 13:56:53 +0000

Data visualizations are one of the most powerful tools when exploring and presenting data. However, when you first start using visualizations, it’s easy to get overwhelmed by the huge number of plots you can make. In this series of blog posts, we’ll go over five of the most commonly used visualizations, and how they can help you tell your data’s story. First up, we’ll cover line plots.

In this blog post, we’ll use the “Airline Delays from 2003–2016” dataset by Priank Ravichandar, licensed under CC0 1.0. This dataset contains information on flight delays and cancellations in US airports from 2003 to 2016. The code for this blog post can be found in this repo.

Line plots with a single group

Line plots are designed to demonstrate a trend over time. This means that on the x-axis, you’ll use some sort of datetime variable – anything from milliseconds to years. In order to show a trend, the y-axis then needs to contain a continuous variable, like the number of goods in stock, the price of an item, or a volume of water.

Let’s use a line plot to take a look at the total number of flight delays due to late aircraft across the whole period of the dataset. First, we’ll read our raw data:

import pandas as pd

airlines = pd.read_csv("data/airlines.csv")
airlines["Time"] = pd.to_datetime(airlines["TimeLabel"], infer_datetime_format=True)
airlines = airlines[~(airlines["TimeYear"].isin([2003, 2016]))]

As 2003 and 2016 only have partial data for the year, we’ve removed them from the dataset. We’ve also created an explicit datetime variable from the TimeLabel variable, which contains both the month and year.

To make our plots, we need to create a summary DataFrame that contains the number of delays by cause (late aircraft, weather, security, or carrier issues) over time.

delays_by_time_and_cause = (
   airlines[["Time", "NumDelaysLateAircraft", "NumDelaysWeather", 
             "NumDelaysSecurity", "NumDelaysCarrier"]]
   .groupby("Time")
   .sum()
   .reset_index()
)

delays_by_time_and_cause = (
   pd.melt(delays_by_time_and_cause, id_vars="Time",
           value_vars=["NumDelaysLateAircraft", "NumDelaysWeather", 
                       "NumDelaysSecurity", "NumDelaysCarrier"])
   .rename(columns={
       "variable": "TypeOfDelay",
       "value": "NumberDelays"
   })
   .assign(TypeOfDelay=lambda x: x["TypeOfDelay"].str.replace("NumDelays", ""))
)

We’re now ready to make our plot. We’ll be using the lets-plot plotting library in Python to create each chart, which is a port of the popular ggplot2 R library. We use our Time variable for the x-axis, which is in months, and on the y-axis we use NumberDelays, the total number of delays for each month.

from lets_plot import *
LetsPlot.setup_html()

(
       ggplot(
           delays_by_time_and_cause[
               delays_by_time_and_cause["TypeOfDelay"] == "LateAircraft"
           ],
           aes(x="Time", y="NumberDelays"))
       + geom_line(color="#fbb4ae", size=1)
       + scale_x_datetime()
       + xlab("Time")
       + ylab("Number of delays")
       + ggtitle("Total delays due to late aircrafts in US airports, 2004-2015")
)

From this chart, we see that the number of flight delays increased from 2004 to 2008, decreased until 2012, peaked again in 2013, and then decreased again. We can also see that there is significant seasonal variation in the delays, possibly due to inclement weather or holiday peaks putting pressure on airports.

Line plots with multiple groups

Line plots are also a great way to compare trends of two continuous variables over time. In the chart below, we’ve compared the number of delays due to late aircraft versus those due to carrier issues. We’ve used Time on the x-axis, and NumberDelays on the y-axis. However, this time, we pass TypeOfDelay to the color argument, indicating that we want to plot the delays due to late aircraft and carrier issues separately.

(
       ggplot(delays_by_time_and_cause[
                  delays_by_time_and_cause["TypeOfDelay"].isin(["LateAircraft", "Carrier"])
              ],
              aes(x="Time", y="NumberDelays", color="TypeOfDelay"))
       + geom_line(size=1)
       + scale_x_datetime()
       + xlab("Time")
       + ylab("Number of delays")
       + ggtitle("Total delays in US airport, 2004-2015")
       + scale_color_brewer(type="qual",
                            palette="Pastel1",
                            name="Type of delay",
                            labels=["Late aircraft", "Carrier"])
)

The trends of delayed flights are quite similar for both late aircraft and carrier issues over the data period, which suggests that these delay types may be linked or have a common cause.

Area plots

Area plots are related to line plots. In area plots, the space under the line is filled in, so these types of graphs are ideal when you really want to emphasize the volume or amount you’re plotting in the y-axis. They are particularly effective for contrasting differences in quantity between groups. You can see this with the area plot below, where we’ve compared how many delays occur for weather-related reasons compared to those because of late aircraft.

(
   ggplot(
       delays_by_time_and_cause[
           delays_by_time_and_cause["TypeOfDelay"].isin(
               ["LateAircraft", "Weather"])
       ].sort_values("TypeOfDelay", ascending=False),
       aes(x="Time", y="NumberDelays", fill="TypeOfDelay"))
   + geom_area(color="white")
   + scale_x_datetime()
   + xlab("Time")
   + ylab("Number of delays")
   + ggtitle("Total delays in US airports, 2004-2015")
   + scale_fill_brewer(type="qual", palette="Pastel1", name="Type of delay", 
                       labels=["Weather", "Late aircraft"])
)

The number of delays from weather are dwarfed by those occurring because of late aircraft. This shows that, while weather-related delays tend to get the most attention, they are nowhere near as big of an issue that routine delays from late aircraft are.

That concludes our introduction to line plots! We’ve covered how these elegant plots can be used to show how things like price and volume change over time, and how they can be used to spot relationships between variables when divided by subgroups. In the next post, we’ll have a look at barplots.

DataSpell 2023.1: Early Access Program is Open!

Ciara Byrne — Tue, 07 Feb 2023 18:43:24 +0000

The Early Access Program (EAP) for DataSpell 2023.1 is now open! The EAP gives you access to pre-release versions of DataSpell, allowing you to evaluate new features, test issues that have been resolved, and provide feedback. All EAP builds are free and don’t require a license. All you need is a JetBrains account. Learn more about how the program works in this blog post.

The first EAP build for DataSpell 2023.1 brings you Jupyter Notebook cell execution time and duration, improved code completion for Jupyter Notebooks, better Data Vision and an enhanced interpreter widget.

The Toolbox App is the easiest way to get EAP builds and keep your stable and EAP versions up to date. You can also manually download the EAP builds from our website.

Download DataSpell 2023.1 EAP

View Cell Execution Time and Duration

Since Jupyter Notebook cells are often executed out of order and some run for a long time, it can be useful to know how recently a cell was run and its execution time. DataSpell now shows when a code cell was last executed and the duration of the execution. This information is displayed directly below the cell.

Improved Code Completion in Jupyter Notebooks

The code completion provided by Jupyter Notebooks is often ineffective, e.g. it fails to complete Pandas DataFrame column names in many cases. To improve the experience of DataSpell users, Jupyter Notebook code completion has been disabled and we will gradually implement new and improved auto-completion for the most useful cases.

DataSpell 2023.1 EAP 1 provides DataFrame column name completion, auto-completion for dynamic classes and path completion for remote Jupyter servers.

Pandas DataFrame Column Name Completion

Better Data Vision

Data Vision allows you to view useful information next to variables in your Jupyter notebooks. You can quickly check important metadata, such as the size of NumPy arrays or the contents of Pandas DataFrames.

In previous releases, inline information was only available when the Jupyter variables tool window was open. In DataSpell 2023.1 EAP 1, Data Vision can be used independently of the variable view. DataFrames viewed with Data Vision also have all the enhanced interactivity features introduced in DataSpell 2022.3.

Data Vision is an optional feature that can be enabled via Settings/Preferences | Languages & Frameworks | Jupyter and selecting Show inline values in editor.

Enhanced Interpreter Widget

It’s now possible to add a new Python interpreter directly from the interpreter widget in DataSpell’s status bar. Open the widget, select the relevant directory and then a popup will open with an option to add a new interpreter. In previous DataSpell releases, a new interpreter could only be added by selecting Project: workspace | Python Interpreter in Settings.

Notable Fixes

Reformat Code Fixes

DataSpell lets you reformat your code according to the requirements you’ve specified in the Code Style settings. Several new fixes in DataSpell 2023.1 EAP 1 ensure that executing Reformat Code doesn’t change IPython magic commands or shell commands in Jupyter cells, thereby breaking the code.

DS-1584: Reformat breaks IPython magic commands
DS-2583 Reformat affects shell commands

Remote Jupyter File Fixes

DataSpell 2023.1 EAP 1 contains several bug fixes related to copying files on, to and from remote Jupyter servers. Files and directories are now correctly copied between local to remote Jupyter servers, within the same remote Jupyter server, and from one remote Jupyter server to another.

DS-3358 Only one file is pasted on the remote Jupyter when multiple files are copied

DS-318 Show copy-paste dialog when copying remote Jupyter files

DS-3800 No copy dialog when copying directories from/to remote jupyter
DS-2734 Can’t duplicate file on a remote Jupyter server

These are the most important updates for DataSpell 2023.1 EAP 1. For the full list of improvements, check out the release notes. Share your feedback on the new features on Twitter or in our issue tracker, where you can also report any bugs you find in the EAP. We’re excited to hear what you think!

The DataSpell Team

DataSpell 2022.3.2 Is Out!

Ciara Byrne — Fri, 27 Jan 2023 18:48:39 +0000

DataSpell 2022.3.2 brings you fixes to ensure that Jupyter Notebooks are consistently reloaded from disk and that your typing doesn’t get stuck in reverse. Download the new version from our website, update directly from the IDE, via the free Toolbox App, or use snaps for Ubuntu.

Download DataSpell 2022.3.2

Notebooks Reloaded

In previous releases, Jupyter Notebook files opened in DataSpell were not always updated when the file changed outside DataSpell, for example when it was edited in another editor or IDE or updated by pulling a new version from Git. Using Reload from Disk (either from the file context menu or via a popup that opens when the file changes on disk) in these circumstances did not reliably update notebooks. In DataSpell 2022.3.2, Jupyter Notebook files update correctly to reflect the latest version on disk.

Don’t Flip it and Reverse it

In DataSpell 2022.3, characters typed in a Jupyter Notebook sometimes appeared in reverse, or from right to left, in the IDE. For example, when typing “print” in a cell, “tnirp” would appear in the cell. This issue is fixed in DataSpell 2022.3.2, so your typing will no longer get stuck in reverse.

The DataSpell Team

Hit the Ground Running With Pandas

Jodie Burchell — Mon, 23 Jan 2023 09:10:32 +0000

If you’re doing any work with data in Python, it’s only a matter of time before you come across pandas. This popular package provides you with many options for reading in, processing, and writing data; however, it’s not the most intuitive package to use and beginners might find it a bit overwhelming at first. In this blog post, we’ll cover the basics of what you need to know to get up and running with the powerful pandas library, including some of my favorite methods for these operations. So, let’s get you on your way!

Reading in and writing data

In order to start working with data, you obviously need to read some in. Pandas is extremely flexible in terms of the data types it can read in and write to file. While CSV (comma-separated values) is one of the most commonly used file formats, pandas is able to directly read and write many other formats, from JSON, Excel, and Parquet files to reading directly from SQL databases. Reading files is done using the “read” family of methods, while writing is done using the “to” family. For example, reading a CSV file is completed using read_csv, and the data can be written back to CSV using to_csv. Let’s take a look at a couple of examples.

Let’s say we want to read in a table in parquet format. We can use panda’s read_parquet method, and specify the engine we want to use with the engine argument:

my_data = df.read_parquet("my_parquet_table.parquet", engine = "pyarrow")

Let’s say we want to write this same data to a JSON-formatted file. We simply use the to_json method like so:

my_data.to_json("my_json_file.json")

As you can see, the basics are straightforward, and each method offers a range of arguments to customize reading and writing data to suit your needs. This excellent blog post from Real Python goes into more detail about the specifics of using these methods with a range of different file formats.

Series and DataFrames

Once you’ve read in your data, it’s time to understand how pandas stores it. Pandas has two data structures: Series and DataFrames. Pandas DataFrames can be thought of as a table containing many columns, with each column being represented as a Series.

To explore this a little further, let’s create a new DataFrame, using pandas’ DataFrame method as follows:

import random
import numpy as np

df = pd.DataFrame(
   {
       "col1": [random.randint(0, 100) for i in np.arange(0, 10)],
       "col2": list("aaabbbcccc"),
       "col3": [random.random() for i in np.arange(0, 10)],
       "col4": list("jjjjjkkkkl"),
   }
)

df

	col1	col2	col3	col4
0	46	a	0.701735	j
1	98	a	0.387270	j
2	88	a	0.230487	j
3	53	b	0.492995	j
4	51	b	0.711341	j
5	28	b	0.833130	k
6	87	c	0.619907	k
7	58	c	0.992311	k
8	36	c	0.892960	k
9	49	c	0.174617	l

We can see that this DataFrame is a table containing four columns, each with a different data type: integer in column 1, text in columns 2 and 4, and float in column 3. If we select one of the columns in this DataFrame, we can see that it is a Series.

type(df["col1"])

pandas.core.series.Series

You can see we’ve selected a column using the square bracket notation, in which you put a column name as a string inside brackets next to the DataFrame name.

As you can probably guess, due to the fact that most data we work with contains multiple columns, you’ll usually be working with pandas DataFrames. However, there are occasions when you’ll need to work with a single pandas Series, so it is useful to know the distinction.

Getting an overview of your data

One of the first steps in working with data is to explore the variables, and pandas offers a number of useful ways of doing this. One of the first things I like to do when I am working with a new dataset is to check the first few rows, in order to get a feel for what the data contains. In pandas, you can do this using the head method. The n argument will allow you to specify how many rows you want to display.

df.head(n=3)

	col1	col2	col3	col4
0	46	a	0.701735	j
1	98	a	0.387270	j
2	88	a	0.230487	j

Pandas also has some nice methods for checking whether the values of your columns make sense, or something weird might be happening in your data. The describe method will give summary statistics for all continuous variables such as the count, mean and standard deviation, and min and max. This method automatically filters its results to continuous variables and produces a summary table containing the metrics for these columns.

df.describe()

	col1	col3
count	10.000000	10.000000
mean	59.400000	0.603675
std	23.580595	0.277145
min	28.000000	0.174617
25%	46.750000	0.413701
50%	52.000000	0.660821
75%	79.750000	0.802683
max	98.000000	0.992311

For categorical variables, we can use the value_counts method to get the frequency of each level. However, unlike describe, it can only be applied to one column at a time. Here you can see we’ve gotten the frequencies of each level of col2:

df["col2"].value_counts()

	col2
c	4
a	3
b	3

Column names

You can also see the names of all of your DataFrame’s columns using the columns method, which outputs the column names in a list-like format:

df.columns

Index(['col1', 'col2', 'col3', 'col4'], dtype='object')

If you want to rename a column, you can use the rename method, along with the columns argument. This argument takes a dictionary, where the key indicates the old column name, and the value indicates the new column name, as in the example below where we rename col1 to column1:

df.rename(columns={"col1": "column1"})

	column1	col2	col3	col4
0	46	a	0.701735	j
1	98	a	0.387270	j
2	88	a	0.230487	j
3	53	b	0.492995	j
4	51	b	0.711341	j
5	28	b	0.833130	k
6	87	c	0.619907	k
7	58	c	0.992311	k
8	36	c	0.892960	k
9	49	c	0.174617	l

Changing values inside existing columns

You may also want to change the values of an existing column. This can easily be done by applying the Series method map to the column of choice. As with the rename method above, you pass a dictionary to this method, with the old column values being represented by keys and the updated values by the values:

df["col2"].map({"a": "d", "b": "e", "c": "f"})

	col2
0	d
1	d
2	d
3	e
4	e
5	e
6	f
7	f
8	f
9	f

Creating new columns

In order to create a new column, we use the same square bracket notation that we’ve been using so far to select existing columns. We then assign our desired value to this new column. A simple example is shown below, where we create a new column where every row has the value “1”:

df["col5"] = 1
df

	col1	col2	col3	col4	col5
0	46	a	0.701735	j	1
1	98	a	0.387270	j	1
2	88	a	0.230487	j	1
3	53	b	0.492995	j	1
4	51	b	0.711341	j	1
5	28	b	0.833130	k	1
6	87	c	0.619907	k	1
7	58	c	0.992311	k	1
8	36	c	0.892960	k	1
9	49	c	0.174617	l	1

New columns can also be created by using the values of existing columns. For example, you can combine string columns to create concatenated values, such as combining two columns containing the first and last name of a customer to create a new column containing their full name. There are other string manipulation tricks you can perform, such as creating new columns containing only substrings or lowercase versions of existing columns. You also have the option to carry out arithmetic operations between numeric columns, as seen below, where we multiply the values of col1 and col3 to get a new column containing their product:

df["col6"] = df["col1"] * df["col3"]
df

	col1	col2	col3	col4	col5	col6
0	46	a	0.701735	j	1	32.279790
1	98	a	0.387270	j	1	37.952489
2	88	a	0.230487	j	1	20.282842
3	53	b	0.492995	j	1	26.128719
4	51	b	0.711341	j	1	36.278398
5	28	b	0.833130	k	1	23.327639
6	87	c	0.619907	k	1	53.931952
7	58	c	0.992311	k	1	57.554063
8	36	c	0.892960	k	1	32.146561
9	49	c	0.174617	l	1	8.556212

Finally, you can create a new column which is conditional on the values of one or more existing columns, using the NumPy method where. Below we create a new column that is equal to 1 when col1 is more than 50, and 0 otherwise.

df["col7"] = np.where(df["col1"] > 50, 1, 0)
df

	col1	col2	col3	col4	col5	col6	col7
0	46	a	0.701735	j	1	32.279790	0
1	98	a	0.387270	j	1	37.952489	1
2	88	a	0.230487	j	1	20.282842	1
3	53	b	0.492995	j	1	26.128719	1
4	51	b	0.711341	j	1	36.278398	1
5	28	b	0.833130	k	1	23.327639	0
6	87	c	0.619907	k	1	53.931952	1
7	58	c	0.992311	k	1	57.554063	1
8	36	c	0.892960	k	1	32.146561	0
9	49	c	0.174617	l	1	8.556212	0

Filtering data

There are a huge number of ways to filter DataFrames in pandas, and for beginners it can feel quite overwhelming. While you may at times need more specific filtering options, I’ll list the methods that I use for 90% of my filtering needs.

We’re already familiar with using the square bracket notation to extract one specific column. You can extend this by passing a list of columns to filter your DataFrame to multiple columns at a time. For example, let’s say that we just want to view col1 and col4 – we just need to include them both in a list that we place inside of the square brackets:

df[["col1", "col4"]]
df

	col1	col4
0	46	j
1	98	j
2	88	j
3	53	j
4	51	j
5	28	k
6	87	k
7	58	k
8	36	k
9	49	l

The most common way of filtering rows in a DataFrame is by condition. We can do this by nesting a conditional statement inside of our trusty square brackets notation. Here, we filter our DataFrame to only those rows where col2 equals “a”:

df[df["col2"] == "a"]
df

	col1	col2	col3	col4	col5	col6	col7
0	46	a	0.701735	j	1	32.279790	0
1	98	a	0.387270	j	1	37.952489	1
2	88	a	0.230487	j	1	20.282842	1

It’s also possible to filter based on multiple conditions. In this case, each condition needs to be enclosed within brackets, as below, where we’re checking for rows where col2 equals “a” and col6 is more than 1:

df[(df["col2"] == "a") & (df["col6"] > 1)]
df

	col1	col2	col3	col4	col5	col6	col7
0	46	a	0.701735	j	1	32.279790	0
1	98	a	0.387270	j	1	37.952489	1
2	88	a	0.230487	j	1	20.282842	1

We can also combine column and row filtering using pandas’ loc method. Here we’ll have a look at the values of col1 and col7 where col1 is more than 50:

df.loc[df["col1"] > 50, ["col1", "col7"]]
df

	col1	col7
1	98	1
2	88	1
3	53	1
4	51	1
6	87	1
7	58	1

While there will be times when your task will need more specific or complex filtering, these basic filtering methods will give you a strong start to explore and manipulate your data.

Merging two DataFrames

Finally, there will be times when you need to merge two DataFrames. There are a variety of methods of doing this in pandas, but in my experience, the most flexible and intuitive to use is the merge method. To see how it works, let’s first create a second DataFrame:

df2 = pd.DataFrame({
   "field1": list("aabb"),
   "field2": df["col1"][5:9].to_list(),
})
df2

	field1	field2
0	a	28
1	a	87
2	b	58
3	b	36

Let’s say we want to merge these two DataFrames based on col1 in df and field2 in df2. We simply need to define a left table (df), a right table (df2), and the columns to use to join these tables (using the left_on and right_on arguments) in the merge method:

pd.merge(df, df2, left_on="col1", right_on="field2")

	col1	col2	col3	col4	col5	col6	col7	field1	field2
0	28	b	0.833130	k	1	23.327639	0	a	28
1	87	c	0.619907	k	1	53.931952	1	a	87
2	58	c	0.992311	k	1	57.554063	1	b	58
3	36	c	0.892960	k	1	32.146561	0	b	36

As you can see, because the fields we chose to use to join the two DataFrames have unique values in either DataFrame, we ended up with a one-to-one join. However, the merge method can also handle one-to-many, many-to-one, and many-to-many joins automatically, depending on the fields you use for the join. We can see this if we try to merge the two DataFrames on col2 and field1, which both have duplicate values in their respective DataFrames:

pd.merge(df, df2, left_on="col2", right_on="field1")

	col1	col2	col3	col4	col5	col6	col7	field1	field2
0	46	a	0.701735	j	1	32.279790	0	a	28
1	46	a	0.701735	j	1	32.279790	0	a	87
2	98	a	0.387270	j	1	37.952489	1	a	28
3	98	a	0.387270	j	1	37.952489	1	a	87
4	88	a	0.230487	j	1	20.282842	1	a	28
5	88	a	0.230487	j	1	20.282842	1	a	87
6	53	b	0.492995	j	1	26.128719	1	b	58
7	53	b	0.492995	j	1	26.128719	1	b	36
8	51	b	0.711341	j	1	36.278398	1	b	58
9	51	b	0.711341	j	1	36.278398	1	b	36
10	28	b	0.833130	k	1	23.327639	0	b	58
11	28	b	0.833130	k	1	23.327639	0	b	36

You can see that this automatically results in a many-to-many join.

The merge method is also capable of doing SQL join types. While the default type is inner joins, left, right and outer are all possible by using the argument how. Here we’ll change the first merge we did, on col1 and field2, to a left join:

pd.merge(df, df2, left_on="col1", right_on="field2", how="left")

	col1	col2	col3	col4	col5	col6	col7	field1	field2
0	46	a	0.701735	j	1	32.279790	0	NaN	NaN
1	98	a	0.387270	j	1	37.952489	1	NaN	NaN
2	88	a	0.230487	j	1	20.282842	1	NaN	NaN
3	53	b	0.492995	j	1	26.128719	1	NaN	NaN
4	51	b	0.711341	j	1	36.278398	1	NaN	NaN
5	28	b	0.833130	k	1	23.327639	0	a	28.0
6	87	c	0.619907	k	1	53.931952	1	a	87.0
7	58	c	0.992311	k	1	57.554063	1	b	58.0
8	36	c	0.892960	k	1	32.146561	0	b	36.0
9	49	c	0.174617	l	1	8.556212	0	NaN	NaN

You can see that we’ve kept all values from our left table, df, and have missing values where those values don’t exist in our right table, df2, as expected.

Finally, as well as being able to merge on columns, pandas’ merge method also allows you to merge DataFrames using their indices. To do so, you simply need to use the left_index and right_index arguments instead of left_on and right_on, passing True as the value:

pd.merge(df, df2, left_index=True, right_index=True)

	col1	col2	col3	col4	col5	col6	col7	field1	field2
0	46	a	0.701735	j	1	32.279790	0	a	28
1	98	a	0.387270	j	1	37.952489	1	a	87
2	88	a	0.230487	j	1	20.282842	1	b	58
3	53	b	0.492995	j	1	26.128719	1	b	36

Here, we’ve merged the first 4 rows of each table using their index values.

As you can see from this tutorial, you can complete a large variety of data exploration and manipulation tasks with only a handful of commands in pandas. I hope this has helped show you that getting up and running with pandas is relatively straightforward, and will boost your confidence when working with data in Python.

Get some hands-on practice with pandas

Continue your journey of learning pandas in DataSpell, where we’ll guide you through how to read in and analyse a dataset of airline delays and cancellations in pandas.

Get started!

DataSpell 2022.3.1 Is Out!

Ciara Byrne — Thu, 22 Dec 2022 08:26:49 +0000

DataSpell 2022.3.1 brings you a fix for interpreter widget woes, an uninterrupted onboarding tour, simplified settings sync and code completion for column names in Python scripts.

Download the new version from our website, update directly from the IDE, via the free Toolbox App, or use snaps for Ubuntu.

Download DataSpell 2022.3.1

Simplified Settings Sync

Want to use the same settings in all your JetBrains IDEs? In DataSpell 2022.3.1, the Settings Sync plugin synchronizes UI settings, keymaps, code style, color schemes, and bundled plugins across all IDE instances connected to your JetBrains account. The Settings Sync plugin replaces both the IDE Settings Sync and Settings Repository plugins.

Guess That Column Name (in Python Scripts)

Code completion will now guess column names for Pandas DataFrames in Python scripts. When running scripts against interactive Python consoles, column names will be suggested when using period or bracket notations.

No More Widget Woes

The interpreter widget shows the interpreter configured for the attached folder. The correct interpreter is now displayed even when the user works in the terminal or with a Python console. Previously, the widget sometimes showed the wrong interpreter since it reacted only to the editor in focus or workspace file/folder selection.

Take a Magical DataSpell Onboarding Tour

Our onboarding tour stalled at step five in DataSpell 2022.3. You can now complete the full DataSpell onboarding tour. Select Help > Learn IDE Features from the main menu to start the tour. Bon Voyage!

Want to be the first to know about new features and get DataSpell and data science tips? Subscribe to our blog and follow us on Twitter now!

If you encounter a bug or have a feature suggestion, please share it in our issue tracker.

The DataSpell Team