The Big Data Tools Blog : A data engineering plugin | The JetBrains Blog

Big Data Tools 2023.1 Is Out!

Anna Maltceva — Tue, 04 Apr 2023 14:08:59 +0000

Our latest release includes several new features and improvements based on feedback from our users. In this release, we’ve added integration with Kafka Schema Registry, Kerberos authentication, and extended support for all cloud storages in Big Data Tools. Read on to learn about the most important changes in the Big Data Tools plugin or try it right now by installing it to IntelliJ IDEA Ultimate, PyCharm Professional, DataSpell, or DataGrip 2023.1.

New: Kafka Schema Registry connection

In response to numerous requests, we have integrated the Kafka Schema Registry connection into the Big Data Tools plugin 2023.1. The Schema Registry defines the data structure and helps keep your Kafka applications synchronized with data changes. With this latest update, you can explore Kafka topics serialized in Avro or Protobuf directly from the IDE.

We also enabled a connection to Schema Registry via a secure SSH tunnel to help you consume and produce messages from your local machine.

More convenient work with cloud storages

We have aligned our feature set between all supported remote file storages (such as Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage). Here are the most notable improvements:

You can now view and manage versions of the bucket objects right in your IDE.
If you use object or bucket tagging to categorize storage, you can now modify and view them without leaving your editor window.
The new contextual search feature lets you quickly locate a specific bucket in your cloud storage by typing relevant keywords.

Extended Hive Metastore integration

Now the plugin supports both Hive 3 and Hive 2, allowing users to preview and understand their data within their IDE. You can open the Hive Metastore or its particular catalogs, databases, and tables in a separate tab of the editor.

Improvements to Apache Zeppelin integration

If you are working with a Zeppelin notebook in your IDE, you can extract Spark job code into a Scala file to continue working on it in an IDE project. In this release, we have added options to extract both selected Scala code and code from paragraphs.

We’ve added code completion for PySpark paragraphs in Zeppelin, which completes column names by providing a list of the inferred columns from the DataFrame.

We’ve also consolidated Dependencies and Interpreter Settings into a single window, making it easier to find the necessary settings.

Kerberos authentication

It is now possible to connect to Kafka, HDFS, and Hive Metastore by using Kerberos authentication.

Single authorization in AWS services

We’ve added the ability to share AWS authorizations across AWS S3, AWS Glue, and AWS EMR connections. This eliminates the need to repeatedly enter keys or perform MFA authentication for each connection.

Viewing big data binary files

With the Big Data Tools plugin, you can conveniently preview the content of big data file formats without leaving your IDE. The latest release enables the opening of Parquet files that use compression methods such as zstd, Brotli, and LZ4.

For the full list of new features and enhancements, see the changelog on the Big Data Tools plugin page. Please share your feedback with us and report any issues you encounter to our issue tracker.

The Big Data Tools team

View the Current State of Variables in Zeppelin Notebooks

Pasha Finkelshteyn — Thu, 05 Jan 2023 15:03:53 +0000

Do you remember a long time ago when we introduced a feature in the Big Data Tools plugin called ZTools that allows you to view the current state of variables in a Zeppelin notebook? Do you remember that we made significant changes to ZTools implementation almost a year ago?

It’s our tradition to make big announcements about this part of Big Data Tools yearly, and today we’re announcing multiple exciting changes. If you’d rather just check them out yourself, here is the link to the plugin:

New name

Most importantly, this feature is now called “State Viewer”. We decided to change the name because State Viewer can implement variable viewing for other notebooks too, not just for Zeppelin. There were a few other names being considered, and the choice wasn’t easy.

Here is how it looks:

In the same picture, you can see that the State Viewer is now enabled by default!

Thick server to thick client

The first significant change we made was the move from a “thin client” to a “thin server” model. This means that the IDE doesn’t need to install anything into Zeppelin’s interpreter. In turn, this is beneficial when you do not fully control the Zeppelin instance.

How does it work? Well, these changes are evolutionary rather than revolutionary. Previously, State Viewer used background cell execution only to call a method from a library we installed into the Zeppelin interpreter. Now the plugin acts differently, executing the data gathering logic from Zeppelin and sending it to the State Viewer window.

The code of the hidden (disappearing) cell is quite impressive too. It starts like this:

Don’t take our word for it – check out the code yourself! To do so, enable “Debug mode” in the settings:

This leads us to the following significant change: the State viewer settings have been revamped.

State viewer settings

The fully expanded view of State Viewer settings looks like this:

It’s absolutely massive, right?

As we can see in the previous screenshot, it consists of 3 main parts:

1. Common Interpreter Settings

2. Variable Introspector settings

3. SQL Introspector settings

Hopefully the name “Common Interpreter Settings” is self-explanatory. You can enable or disable the collection of info about variables from Spark SQL, and it allows you to enable debug mode. Admittedly this is most useful for us, the plugin developers, but you can use it to get a deeper understanding of the plugin’s inner workings.Variable Introspector Settings

Valuable feedback from our customers has given us an understanding of exciting corner cases when introspecting variables. Variable Introspector Settings addresses all of the potential issues you might encounter. Need to introspect longer texts? Increase the limit of the String size. Now it takes more time to extract them all? Increase the timeout while waiting for Zeppelin to answer. Have to work with deeply nested structures? You can fine-tune the maximum dig depth.

SQL Introspector settings

We didn’t even realize how useful SQL Introspector would be for some of our customers needed to find out how complex usage scenarios can be across lots of schemas and tables, which led us to the next exciting solution:

Depending on the size of your Spark database, you can use different strategies when you need to pull changes from the Spark database. Sometimes databases are so extensive that we will decide not to pull changes automatically at all.

All of these changes led us to switch on State Viewer by default in this release.

We also learned that when our customers work with Spark, they work with more than just the default catalog. Sometimes, the “default” catalog is absolutely massive, and they need to filter data from it somehow. That’s why we introduced the following filter in the SQL Introspector:

Here you can filter and add more catalogs to look for data for autocompletion, as well as limit the subset of queried tables inside a catalog.

If you use AWS Glue or Hive Metastore, you might also find this checkbox useful:

Your company’s Glue is likely to be incredibly large, and you only need the data you put there during your session.

State Viewer is now enabled by default

We’ve introduced many improvements since our last blog post on this feature. We are very grateful to our customers for their ongoing support and continuous feedback – it was possible to implement many changes because of them! We are now confident that Big Data Tools is stable and flexible enough for our general audience to be able to use it continuously for these reasons:

1. It no longer requires changes in Zeppelin.

2. It allows you to fine-tune variable introspection to your use case.

3. It allows you to tune SQL introspection according to the context and complexity of your tasks.

4. The mechanics of hidden cells have been improved.

5. We have a method for our users to check and deeply understand the plugin’s functionality. took an extra step to allow our customers to “check up on us” and make sure we’re doing only what we say we’re doing, and no more.

If you’re interested in trying Big Data Tools, you can easily find the plugin here.

Data Engineers Are Like Plumbers Who Install Pipes for Big Data

Robert Demmer — Tue, 20 Dec 2022 11:16:13 +0000

Roman Poborchiy interviewed Pasha Finkelshteyn, a Big Data IDE developer advocate. Pasha loves talking to people about big data and has broad experience across the IT sphere. He also has a degree in psychology and is a speaker, author, and host of several podcasts.

Pasha Finkelshteyn – Big Data IDE Developer Advocate

Roman: Good morning!

Pasha: Morning! Though 10:30 AM is not actually morning for me. I’m generally an early bird, but in December I have one more reason to get up early. Every day, I solve programming puzzles in Advent of Code. Do you know what that is?

Roman: There’s a hint in the name, but you’d better tell me.

Pasha: This year, more than a hundred thousand people are participating. Every day, new programming challenges are posted on the website, primarily in mathematics and algorithms. You can solve them in any programming language and submit your solutions. The task for the day is the same for everyone, but the input data, against which your solution is checked, is different for each participant.

Roman: And why wake up early?

Pasha: In my time zone (GMT+1), a new task is published at 6 AM, and ideally it has to be dealt with around the same time. To be honest, I’m not enough of an early bird to start it at 6 AM, but I’ll do my best to get it done as early as possible. And since many of my peers are in Europe too — the competition is still tight!

Not to mention that I enjoy solving problems before getting down to work. It sort of gets me in a working mood.

Roman: Great! So there you are, in the right mood, and your workday begins. What do you do?

Pasha: I work as a developer advocate on the Big Data Tools team.

My job is about making users awesome. There is an amazing book called Badass: Making Users Awesome about how we should work to make people productive and happy at work.

I try to educate people on various topics in the hope that someday they will talk to me, and I will be able to listen to them and say, “Yes, we can implement this for you in Big Data Tools.”

The phenomenon of developer advocacy appeared in response to the fact that IT people don’t buy products that are advertised in traditional ways. They want to understand why the product exists. Internally, advocacy is often a kind of sales – people come to their managers and tell them which tool they want to use.

I’d like to believe that someday our Big Data Tools will become the default tool.

Roman: The default tool for whom?

Pasha: Big Data Tools is a plugin for data engineers…

Roman: Let me stop you right there. From your point of view, who is a data engineer? What kind of profession is it?

Pasha: I really like to make an analogy with plumbers when I talk about data engineers.

We always have sources and receivers, we can collect data, transfer it, pour it back and forth, and perhaps transform it with every operation. It can be a complex system, like plumbing or sewage. And along the way, we can say how to lay pipes correctly and in which facilities we should store data.

In other words, data engineers are people who know how to handle data. At the same time, they normally don’t extract any business value from the data, neither analytical nor scientific. For them, data is just a raw material that needs to be prepared and handed over to other experts.

Roman: OK, let’s assume this is clear. What’s the problem with tools for data engineers?

Pasha: I think that Big Data is still too young a field. That’s why we are seeing a growing number of specializations, deepening knowledge in particular sub-fields, and that’s why we have data scientists, data analysts, and data engineers.

Sometimes, they distinguish between a lot of different kinds of data engineers. Often people are engaged in one very specific task which eats up all eight hours of their working time.

This isn’t cool. I believe that people should be pentagon-shaped. It’s when all of your knowledge is deep, and only one point of the figure, in my case it’s Java, is a little deeper than everything else.

Roman: Why doesn’t it work that way in Big Data?

Pasha: The field is not mature enough.

The tooling is still poor, and we need a lot of people who know how to work with different tools. A lot of things just haven’t gotten around to being done yet. For example, there is Apache Spark, a very popular tool. And, let’s say, we stumble upon a bug.

Roman: Our bug or a Spark bug?

Pasha: Our bug. In our code. If we have a bug in the backend of a standard enterprise application, we usually have an understandable stacktrace. We can try debugging or using debug prints – these are working methods.

But there’s no way you can debug Spark. And you can’t do debug prints – you have too much data and won’t find what you’re looking for in the debug output. You can’t even download this output. So you have to look for the bug analytically, which isn’t always a trivial task.

Roman: You mean the only thing you can do is read the code?

Pasha: Sort of. I really hope that one day JetBrains will make a debugger for Spark. I have an idea how to approach the task, and we’re considering it.

And I’ve only mentioned Spark, which is the most advanced of all the tools. There is this picture, State of BigData Ecosystem, with more than a thousand technologies. There are so many because none of them solves the problem perfectly.

Roman: But, as a business, we don’t want to wait until the ecosystem evolves. We want to be able to act right now, even though the field is immature. How do you develop Big Data Tools?

Pasha: Actually, the fact that the industry is immature is to our advantage. There are so many tools on the market, they are so diverse and achieve such different goals, that we can integrate all of them (ultimately) and solve all the problems. When industry matures, the number of tools will get smaller, but we will still support these tools and thus we’ll support most data engineers.

Roman: Or we could make our tool the perfect tool?

Pasha: That’s possible, but there’s a problem. At the moment, the no-code and low-code paradigms are becoming increasingly popular in data engineering.

If by chance the perfect tool happens to be a no-code tool, there will simply be no place for us. We make great IDEs, but visual programming has never been our strong suit. We have a different focus.

Therefore, beating a tool originally designed as no-code on its home field seems unrealistic.

For now, we’re doing well in the absence of a universal tool, where we can integrate a million different technologies into our solution. We can say, “No matter what you’ve got under the hood, you can work with it in our IDEs, and we will provide a consistent user experience.”

Roman: Could you give an example?

Pasha: For example, we’ve recently added an integration with Amazon EMR. It is not only a MapReduce, but a cluster with a Spark-like UI. This work was finished and rolled out, as well as many exciting things like Tencent CLoudand Alibaba Cloud integrations..

Our next step could be to support the Google Data Proc to have feature parity between major cloud vendors.

Then there is integration with Zeppelin, which allows you to work with notebooks without having to leave the IDE. A lot of work has been done here. With a bit of luck, we might be able to reuse part of this work to support other notebooks.

And the work on massively requested features — dbt support and Avro Schema Registry — is in progress too.

Roman: What made this integration so hard to implement? You have a Zeppelin notebook, which can be located anywhere, and you need to connect to it?

Pasha: Connection is no big deal. The hardest part was to make a Zeppelin editor because it’s not just about showing a web interface in the IDE. It is a full-featured editor with lots of cells where every cell is another editor. It renders HTML for you and all sorts of settings.

We’ve also made ZTools. It’s a tool that shows you a scheme of your Spark data frames. We know this scheme and can offer autocompletion for columns in Spark-SQL. The database doesn’t exist in reality, but we can create a synthetic Spark database, autocomplete column names in your code, and check whether the arguments are passed correctly.

Roman: And how do you choose and prioritize the directions for Big Data Tools development? How do you decide what needs to be done first?

Pasha: I do the search, but not prioritization. But I can tell you how I do my research.

I hang out on data engineer forums, like in the international data engineers Slack channel. There are not many members, around a thousand people, but it’s interesting to read the questions that they ask.

These questions can’t always be answered, but they help me understand the trends. Then I go to our team lead, Ekaterina Podkhaliuzina, and tell her, “There is this tool, and people have been showing a lot of interest in it; perhaps we could help them.”

I have some kind of a threshold in mind which helps me judge whether a tool is popular. For example, a couple of days ago I realized that about DBT (Data Build Tool), which allows you to describe your data model in YAML. In a nutshell, it allows you to create data marts.

It is extremely popular now, and perhaps we’ll be able to come up with something to support it. Since we have connectors to everything, we could provide coding assistance for SQL inside the models.

Roman: In order to follow what’s happening, you have to have some understanding of everything, right?

Pasha: I like the idea of bringing domains together. Anton Keks, a great speaker and person, in his talk “The world needs full-stack craftsmen” describes the need for artisans who can do any job. This resonates with me a lot because it fits well with my experience. I can do almost any job, except for something very science-intensive, like compilers or advanced mathematics. I can develop frontend with React or write databases – these are the things I’ve learned by doing.

Roman: How did you get all this experience? I know that you have a degree in organizational psychology. What is it all about?

Pasha: The aspect of psychology that has been gaining popularity deals with the issues of individuals. Psychotherapy, and all things related, can be called personality psychology.

But psychologists are needed not only to help individuals, sometimes they can help solve issues for organizations. A popular request is formulating company values.

Roman: What brought you from psychology to software development? You once said that it was out of despair. Why despair? I’ve come across many psychological startups – it seems that a lot is going on in the sphere. I mean, there’s life there.

Pasha: It’s definitely not dead. A lot of interesting things are going on.

As for me, it is worth starting with my connection to IT. My parents were and still are mathematicians and developers. We got our first PC when I was 6 years old, back in 1992.

But I never did anything seriously with it and did not think I could. When I was 8 years old, I tried learning programming together with a girl who was my mother’s student. It didn’t work out, and my mother said that programming was probably not for me.

After university, I didn’t have that many options. I could do an internship in a consulting firm and work as a psychologist, but at that time it was very little money, and I was already married and wanted to earn a little more.

The other option was to try something different. Between ages 6 and 22, I managed to learn a thing or two about computers, I enjoyed exploring software and got a job in technical support at a company called Poligor. I’m grateful to them, although they fired me on the last day of my probation, saying that IT was probably not for me.

It wasn’t the first time I heard this, as you can imagine.

Roman: But that didn’t stop you.

Pasha: Right. I started working for Philips in technical support, but it was too bureaucratic for me, and to be honest, they also fired me.

Then I got a job at an insurance company where it turned out that during my time at Philips and Polygor I became a good administrator. But the company started going bankrupt, and I got downsized. This time they didn’t say that IT was not for me.

Then a classmate invited me to a scientific institute to write code for them as an intern. I told him right away that programming wasn’t my thing. He replied that I didn’t have many options, but had my head screwed on right and had to try.

So I started programming with smart forms. This development paradigm doesn’t exist anymore.

Roman: Was it the visual programming that wasn’t our strong suit at JetBrains?

Pasha: Probably, although the term is from Delphi. And you know, I did it. And then it turned out that you can still write code on top of the forms, and I did that too.

I worked for 5 years in a different team at the same place. We developed secondary monitoring systems in Java for nuclear power plants all over Russia. It was an interesting and rewarding job.

You know what was the best part? I had a chance to explore a gazillion technologies. I had an amazing team leader who allowed me to experiment. After that, job interviews were a piece of cake, because I had gained some experience with pretty much everything. I used to include a huge list of technologies in my CV. Of course, I don’t do that anymore, but back then I could.

And then things were off and rolling. I became a developer, then a team lead, then worked as a CTO for a while with 35 people under me. Next, I returned to being a team lead and then moved to a linear data engineer position.

Roman: Do you sometimes feel that you lack fundamental education in programming? If so, how do you fill the gaps?

Pasha: Honestly, working at JetBrains, it’s hard not to have this feeling. Too many people around you are very well versed in programming.

Roman: Looking closer, everyone at JetBrains, including marketers and UX-researchers, writes code. They are often very good at programming, regardless of their job title.

Pasha: Sure thing. How do I fill the gaps? The honest answer is that I don’t. I usually do research when I need to solve a particular problem.

Going back to the talk that I’ve mentioned about full-stack craftsmen, Anton said something cool: “The more you have learned, the easier it is for you to learn new things.” Polyglots can confirm this: they say that the more languages you speak, the easier it is for you to master a new one.

But not everyone can master languages easily. For some people, it requires a huge effort. It’s worth learning all sorts of things. I learned how to make clay pots at one point. It is also fun, and it increases your brain’s plasticity too. Honestly, I don’t really think that the ability to learn declines with age. Perhaps we become slower learners, but if you keep training this muscle, so to speak, it will be easier to learn new things.

So I keep trying new technologies. At my current job, I can afford to try anything.

Roman: Are you happy with your current role?

Pasha: Totally, I do what I enjoy doing. I enjoy talking to people and I’m always on the lookout for curious things that are going on in the industry. I’m really looking forward to the return of offline events (and multiple conferences are not in-person yet) because it is so much easier to meet new people there.

Roman: As a final thought, let’s hope that we and our readers can get back to enjoying in-person events as soon as possible.

Pasha: Right – and without having to worry so much.

Roman Poborchiy

Big Data Tools 2022.3: Integration with AWS Glue Data Catalog, Code Completion for SQL Expressions in Zeppelin Notebooks

Anna Maltceva — Thu, 01 Dec 2022 15:03:57 +0000

In this update we’ve added integrations with AWS Glue, Tencent Cloud Object Storage, enhanced Zeppelin notebooks support, and delivered important fixes. Read on to learn about the most important changes in the Big Data Tools plugin or try it right away by installing it to the 2022.3 of IntelliJ IDEA Ultimate, PyCharm Professional, DataSpell or DataGrip.

Tencent Cloud Object Storage (COS) Support

We’ve supported a new distributed storage service. Now you can work with Tencent COS right from the IDE. Supported features include file versioning with the ability to compare two versions of the same file (Show Diff) and restore the desired version.

AWS Glue Monitoring

Integration with AWS Glue is now supported, letting you monitor your databases, view schemas and partitions, filter data, and customize database views. Learn how to configure a connection to your AWS Glue.

Zeppelin State Viewer Feature

Fine-tune variable introspection for your use case

The State Viewer feature which allows you to preview local variables and SQL schemas for the current Zeppelin session is now enabled by default and is available in local Zeppelin notebooks. It is more customizable, with many new configuration parameters:

Enjoy code assistance for SQL expressions

Code highlighting and completion are now supported for SQL expressions within Spark. We have also provided completion for tables based on data from Hive Metastore and AWS Glue.

Remote File Systems Updates

Search bar for RFS connections to quickly navigate to a directory or a file within a selected storage.
You can now provide the sudo password while connecting to an SFTP server to enable access to files with root permissions.
In the connection settings, you can now disable SSL certificate validation if you trust the server.
You can now connect to a custom Google Cloud Storage host.

Bulk actions for Kafka topics

You can now select multiple topics and delete or copy them at once. We have also improved the visibility of long values and implemented display of `null` values.

For the full list of new features and enhancements see the changelog on the Big Data tools plugin page. Please share your feedback with us and report any issues you encounter to our issue tracker.
For more updates follow us on Twitter!

The Big Data Tools team

Data Engineering Annotated Monthly – October 2022

Pasha Finkelshteyn — Wed, 09 Nov 2022 10:33:44 +0000

Greetings from sunny Berlin! Yes, it’s still 20+ °C here – perfect conditions for sitting down on your balcony with the latest issue of your favorite Annotated! I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, hit me up on Twitter and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to receive this information as an email, you can subscribe to the newsletter here.

News

A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.

Apache Doris 1.1.3 – Here’s another interesting database for you. We aren’t aware of many MPP databases, and none of them are under the motley umbrella of the Apache Software Foundation. It is built specifically for ad-hoc queries, report analysis, and other similar tasks. For example, take a look at this picture from Doris’ site:

Typical usage pattern of Apache Doris

This looks like an excellent candidate for use as your next DWH, doesn’t it?

One of the great things about ASF projects is that they usually work nicely together, and this is no exception. For example, the current 1.1.3 release supports Apache Parquet as an output file format.

Apache Age 1.1.0 – Sometimes, we data engineers do work that doesn’t deal directly with big data. Sometimes our job is just to ensure things are designed correctly, which can require us to use tools we are familiar with in a non-typical manner. Take, for example, Postgres. It’s one of the most popular databases; it is extensible, has a good enough planner, and is tunable. Some extensions for it are fairly well-known. For example, `ltree` is a popular extension from postgres-contrib that facilitates the representation of tree structures in a friendly way with a special type. But today, I want to highlight Apache Age, an extension that makes it possible to use Postgres as a graph database. The query language is some kind of mix of traditional SQL and Cypher, which is, as far as I’m concerned, the most popular graph query language today.

RocketMQ 5.0.0 – I’ve already mentioned Apache RocketMQ, a high-performance queue based on ActiveMQ, in previous installments of this series. This new major release is notable, however, because it introduces a handy new concept: logic queues. Currently, MessageQueue is coupled to the broker name, and the broker name is coupled to the number of presently active brokers. When there’s a change, planned or unplanned, to the number of brokers, queue rebalancing begins. This rebalancing can take minutes, and the queue will not always be available during that time. That, in turn, can lead to significant degradation of the overall quality of service. Logic queues remove this relation between queues and the number of nodes, so now, when the number of nodes changes, the rebalance will take significantly less time, if any.

ScyllaDB 5.0.5 – I’ve had my eye on ScyllaDB for a long time now, but I still haven’t had a chance to share its progress. ScyllaDB is interesting because it’s a drop-in replacement for Apache Cassandra. Many years ago, when Java seemed slow, and its JIT compiler was not as cool as it is today, some of the people working on the OSv operating system recognized that they could make many more optimizations in user space than they could in kernel space. One example of an application they targeted for improvement was Apache Cassandra, as it was powerful but slow… Fast forward seven years, and it looks like they achieved their goal and built a sustainable business as well! Among the most notable changes in ScyllaDB 5.0 is the implementation of (experimental) support for the strongly consistent DDL, which is very important, especially when your data changes, as it is likely to do.

Future changes

Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.

Docker Official Image for Spark – A proposal to make Spark a Docker Official Image was recently approved. The Docker Official Image status is a way to indicate that an image is top-level and very important for the community. This type of image will usually be located not in some directory in Docker Hub but rather in the root (in this case, it should be https://hub.docker.com/_/spark). But of course, making Spark a Docker Official Image would not just entail small cosmetic changes. Docker Official Images are rebuilt proactively, so if you find yourself vulnerable to a new security breach that’s already fixed in the master, you can just download the latest version of the DOI and get back to safety. Additionally, DOIs are maintained by the Docker community, which usually means best practices are adhered to. Of course, some work still needs to be done to achieve this goal, but the proposal has already been approved and is halfway implemented.

Flink: Support Customized Kubernetes Schedulers – This proposal is fascinating. On the one hand, the authors say that the current integration of Flink with k8s is already excellent. On the other hand, they say that resource scheduling is implemented only with a very narrow set of techniques, which is not enough. They postulate that different Flink workflows require different kinds of resource scheduling and describe four strategies for resource allocation and scheduling that could significantly improve the performance of Flink jobs.

Kafka: The Next Generation of the Consumer Rebalance Protocol – The current rebalance protocol in Kafka has existed for a long time. It’s superior to what was introduced at first with ZooKeeper, but nevertheless, it’s already a legacy protocol. It relies on intelligent clients that know everything about other consumers in their consumer group and can act accordingly when the number of consumers in the group changes. The authors of the proposal state that the majority of bugs they have encountered in the protocol this year required fixes on the client side, which is indeed bad because we don’t have control over a consumer’s code, for a variety of reasons. The coming change is absolutely massive and has lots of goals, but for me, the most crucial difference is that now the broker will decide how clients should be rebalanced, allowing it to dictate changes that are as small as possible. And for clients, the process should be completely transparent.

Articles

This section is all about inspiration. Here are some great articles and posts that can help inspire us all to learn from the experience of other people, teams, and companies who work in data engineering.

Data Engineers Aren’t Plumbers – When people ask me what a data engineer is, I usually use the metaphor of a plumber. We work with pipes. We make sure they work correctly, and that they aren’t clogged, and so on, right? Well, it looks like Luís Oliveira disagrees with me. He says that data engineering work is indeed related to pipes, but in a different way, and he uses a different pipe-related metaphor. Will I use it in my future explanations? Maybe. But I suspect his explanation will raise even more questions. Nevertheless, the analogy is beautiful.

How to create a dbt package – I like dbt and even wrote a couple of posts about it. But this post brought something to my attention that I hadn’t given a lot of thought to: dbt packages. Dbt packages allow one project to depend on others, making the usage of dbt much more manageable in the event your warehouses are massive. Different teams can, for example, reuse the “common” package that contains all the basic models from your anchor-organized Data Warehouse.

How to run your data team as a product team – Sometimes, it can be tempting to conceive of data engineering as purely a technical enterprise, like plumbing. But the truth is all the data engineers are (or should be) working on a data product – a product that will solve particular problems confronting management, customers, or another party. And that’s why we should think about our tasks as engineers and as part of a product team. This post offers insight into that dual mindset.

That wraps up October’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other exciting data engineering articles you come across!

Data Engineering Annotated Monthly – September 2022

Pasha Finkelshteyn — Mon, 10 Oct 2022 11:44:15 +0000

It’s been a very bustling two months in Berlin. Indeed, it’s been so busy that I had to skip the digests. I am now delighted to have the privilege of returning to the task of collecting for you the most exciting news from the world of data engineering. Greetings from sunny Berlin! I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, hit me up on Twitter and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to receive this information as an email, you can subscribe to the newsletter here.

News

A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.

Brooklin 4.1.0 – Once again, I learned something new while preparing this article. This time I learned about Brooklin, a LinkedIn service for streaming data in a heterogeneous environment. The official GitHub for the project says that it is characterized by high reliability and throughput, claiming that Brooklin can run hundreds of streaming pipelines simultaneously. This is no doubt very interesting. One of the use cases from the product page that stood out to me in particular was the effort to mirror multiple Kafka clusters in one Brooklin cluster!

Ambry v0.3.870 – It turns out that last month was rich in releases from LinkedIn, all of them related in one way or another to data engineering. The authors of Ambry, another new product, release so often that they must already be getting tired of writing changelogs. Nevertheless, the project looks very interesting. It is a distributed on-premise object repository that targets trillions of small immutable objects or billions of large ones. I often hear that MinIO does not perform well in large installations. Perhaps it’s time to try an alternative!

Apache Pegasus 2.3.0 – Have you ever been in a situation where you were designing a storage architecture and all the solutions in some areas just seemed wrong, leaving you to choose between an unsuitable option and an even less suitable one? Key-value storage is a fairly typical context in which this problem arises. HBase is too slow and brittle, and Redis is fast but can lose data. Apache Pegasus might be the alternative you are looking for, if not now, then in your next project. On the one hand, it is written in C++, which probably makes it faster, but on the other, Pegasus is very concerned about data persistence on disk, which is critical when migrating data, for example.

Cloudstack 4.17.1.0 (LTS) – While it may sometimes seem like Kubernetes is on top of the pile, the truth is that even the mighty k8s needs the hardware it runs on. If we use cloud providers, we don’t have to think about that. For your own hardware, it can be easy too: take your favorite hypervisor and go with it. But what if there are many hardware clusters? Apache Cloudstack provides a free IaaS stack that is compatible with all the most popular hypervisors, both paid and free ones. It would seem that why should data engineers care about this tool? It’s very simple: at some point, while talking to the Ops team, it will turn out that they’re tired of creating virtual machines with different characteristics all over the world for us. This is where Cloudstack comes in handy. Why should data engineers care about this? Simple: because it allows them to manage hardware they need themselves instead of going to the Ops team every time they need a slight change in resources to run their what-they-need-to-run. So both sides can benefit from this product – the Ops and Data teams.

DuaLip 2.4.1 – Sometimes the job of a data engineer is not just to build pipelines but also to help data science professionals optimize their solutions. Imagine, for example, that your colleagues are working on a sales or scheduling task that requires a constraint-based solution. They have their algorithm. They have their data. And they know what they need to do. They have a problem, however. Their solution is too slow and isn’t scalable. While OptaPlanner, the most famous product for completing these tasks, does not scale, LinkedIn’s DuaLip runs on a tool that many of us know and love – Apache Spark, which allows us to scale tasks and thus run them faster.

Druid 24.0.0 – Apache Druid has made the leap from 0.23.0 to 24.0.0. I think it’s about time! Druid seems to have really made the transition from an immature project to a production-ready solution. But maturity does not signal a halt in development! On the contrary, new features appear in every release, including noticeable ones. This release, for example, introduced a multi-stage query task engine that promises significant changes in the execution speed for batch queries. Unfortunately, it is not clear yet whether all your current queries will work as before, and the developers recommend testing your queries on the staging environment first. As they say, you can’t make an omelet without breaking any eggs.

Future changes

Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.

Potentially breaking change in Spark 3.3.1 – While reading Apache Spark’s mailing list, I stumbled upon a concern expressed by one of the project’s maintainers: The SPARK-40218 fix seems to have introduced a change in the framework’s behavior. If by chance it happens to work correctly for you now – before updating to version 3.3.1 – it might still work after this release. Don’t forget to check your tests! This feature is not included in the release at the time of writing, voting is still in progress. You can find more info on the voting process here.

Support for Standalone mode in Flink’s Kubernetes operator – As you may have noticed, I write about Kubernetes all the time. Indeed, I believe that it’s the future of data engineering, at least for on-premises solutions. And that’s why I get especially excited when popular products add or extend support for k8s. Over the past few months, Flink’s developers realized that the current support for its Kubernetes operator is insufficient and may also lead to security risks. In addition to solving this problem, the Flink developers also addressed the issue of how to run an older version of Flink in a cluster that does not yet know anything about Kubernetes-native features. This will be possible after the release thanks to support for running Flink in standalone mode instead of cluster mode.

Multicasting record results with Kafka – I can’t say it any better than the author of this KIP did:

[Often] in Kafka Streams users want to send a record to more than one partition on the sink topic. Currently, if a user wants to replicate a message into N partitions, the only way of doing that is to replicate the message N times and then plug-in a new custom partitioner to write the message N times into N different partitions.

Articles

Upgrading Data Warehouse Infrastructure at Airbnb – The value of data lakes – or quasi-mutable storages, as I like to call them – is difficult to overstate, as it is becoming increasingly difficult for companies to fit all their data into completely immutable structures. GDPR and DMCA compliance alone is already an arduous enough process! These conditions have led to a wave of changes at Airbnb. They’ve replaced their simple Parquet files with Apache Iceberg, as well as migrated to Spark 3 for query optimization. From Iceberg they need not only create, read, and update operations, but also metastore, which reduces their number of S3 bucket client listings.

Is DataOps the Future Of the Modern Data Stack? – DevOps, a process built around the interaction of developers (Dev) with the Operations (Ops) team, has been with us for a long time. This article discusses the need to establish a similar framework around data. Some people maintain data while some people use data, and it may be useful to have a discrete field that brings them together. From my perspective, the introduction of DataOps is long overdue. In fact, the term is not entirely new, which may be an indication that we are about to experience a sea change.

Spanner on a modern columnar storage engine – From time to time, industry giants share some insight into how they solve their problems, even if they don’t reveal the specific details of the tools they create. This practice can be extremely helpful, and in fact, famous, industry-changing open-source tools like Hadoop have been born out of it. In early September, Google decided to share how columnar storage works in Spanner and how it was migrated from the old engine to the new one under a load of two billion requests per second. Who knows? Maybe that’s precisely the information you need to create the next industry-changing product.

That wraps up September’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other interesting data engineering articles you come across!

Big Data Tools 2022.2 is here!

Anna Maltceva — Mon, 01 Aug 2022 09:33:36 +0000

The highlights of this release include integration with Hive Metastore and the ability to monitor Flink jobs right inside your IDE, as well as SSO authentication on Amazon S3. The new version provides many other noteworthy changes that are covered below.

Get the latest version by installing it to the 2022.2 of your IDE.

Hive Metastore Integration

We’ve added the ability to create a Hive Metastore connection from the IDE and browse Hive catalogs, tables, and columns.

Big Data Tools now also provides code completion for Spark SQL based on Hive Metastore data.

Apache Flink Monitoring in the IDE

You can now monitor Apache Flink applications right in your IDE with the ability to submit jobs.

Improved AWS EMR Integration

The Amazon EMR integration allows you to monitor clusters and nodes. As of v.2022.2, you can view EMR clusters as a tree view and filter them by the following states: terminated last day/hour/week. We also added the ability to open logs as .txt files.

Enhanced RFS Storages Support

Amazon S3 now supports SSO authentication and can filter by region for S3 connections.

Microsoft Azure now supports multiple containers.

We also added the ability to create SFTP connection with project SSH configuration, and introduced multiple improvements to RFS file editor.

Kafka Monitoring Refinements

Monitoring your Kafka event streaming processes has become even more convenient. Now you can filter topics in the topics lists. Support for SASL_PLAINTEXT authentication has also been added.

Improved Usability of the Big Data Tools Panel

We’ve reworked keyboard interactions inside the Big Data Tools panel and added multiple shortcuts, such as opening Kafka, Hadoop, and Spark Monitoring from the panel by simply clicking Enter, and adding convenient drag-and-drop functionality.

The Big Data Tools team

Data Engineering Annotated Monthly – June 2022

Pasha Finkelshteyn — Thu, 14 Jul 2022 05:49:50 +0000

Hi, I’m Pasha Finkelshteyn, and I’ll be your guide today through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, catch me on Twitter and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to get this monthly source of data engineering information delivered straight to your inbox each month, you can subscribe to the newsletter here.

News

A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.

Apache Ambari: Resurrected – In February, Apache Ambari was moved to the Apache Attic. It made me think that the era of on-premises free Hadoop installations had come to an end. However, a miracle happened! This is actually the first instance I remember of something being revived after it was already in the attic. The process of returning to active maintenance is not even described in the docs. I’m actually happy that this has happened – Hadoop was there for me at the very beginning of my career and I have very positive feelings associated with it.

ShardingSphere – One more thing I learned while preparing this installment is that there is an entire top-level project to convert traditional databases into distributed ones. To be honest, I’m a little skeptical. How is it possible to support distributed transactions and solve the other complex problems of distributed systems? Amazon spent tremendous amounts of money developing on top of Postgres in Redshift. Greenplum is in some sense the successor of PostgresXL and PostgresXC. Combined, it took them many years to develop something both distributed and functional. And yet, there is a publication on the topic, and who am I to argue with them, anyway? The creators of ShardingSphere promise that it is SQL-aware and can transparently proxy SQL traffic, while also being pluggable, meaning you can extend the whole sphere with custom plugins.

Druid 0.23.0 – Druid — not the tree-folk kind — recently increased their development speed tremendously. In this release, the authors have implemented dozens of features, and some of them are very significant. For example, grouping on arrays without exploding the arrays significantly improves the readability of queries. This is crucial because, as we know, code is only written once but is read a potentially infinite number of times. There are also multiple improvements for streaming support (for Kafka and Kinesis), along with many other changes.

InLong 1.2.0 – This is one of the more interesting projects I hadn’t already heard of before preparing this installment. Apache InLong was formerly named TubeMQ and was initially created by Tencent, a huge multimedia company with roots in China. When I say “huge”, I mean it’s one of the highest grossing multimedia companies in the world. Just like any multimedia company, they handle very large amounts of data. And, of course, they have created a solution that will suit their ingestion needs. In a nutshell, InLong is a SaaS-based streaming platform that scales. It wouldn’t be quite right to call it “Kafka on steroids” because it includes lots of batteries. It integrates with different Message Queues out of the box, provides a real-time ETL experience, and offers built-in alerting and monitoring. On top of that, on the main page of its documentation you can find an impressive list of integrations.

Future improvements

Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.

Kafka: Monitor KRaft Controller Quorum Health – In the previous installment I wrote about KRaft, the new consensus algorithm in Kafka. However, when you implement such a major feature, you need to provide customers with the ability to monitor it. KIP-835 has already been accepted and is waiting to be implemented. Once it has been, we’ll have the ability to understand not only the health of the Kafka cluster, but specifically the state of quorum as well.

Flink: Add Retry Support For Async I/O In DataStream API – There is no such thing as a “reliable data source”. Everything we connect remotely is inherently faulty. Networks are unreliable, slow, and error-prone. They can drop packets and even lie on occasion, depending on the communication protocol. There are, of course, different strategies for dealing with these issues. We can just ignore the absence of information and accept that it will be lost in the haystack of data (and that can be perfectly fine), but sometimes we need to obtain it at any cost. In this case, the usual solution is to retry until it will succeed. In Flink, customers have to write the whole logic for their retries themselves. But with the implementation of FLIP-232, this might be about to change!

Spark: Spark Connect – A client and server interface for Apache Spark – This proposed improvement has a lot of potential! The authors claim that Spark, with its current architecture, lacks 4 important traits: built-in remote connectivity, a rich developer experience, stability, and upgradability. They say that they are aiming to introduce a new API that will make it possible to work with Spark in a client-server manner. This means the client would connect an API to a running Spark cluster, and this API would make it much easier to perform exploratory data analysis (which is a common task for both data engineers and data scientists). And who knows? Maybe the Kotlin API for Apache Spark can benefit from it too!

Articles

Recap of Databricks Machine Learning announcements from Data & AI Summit – This year’s Data & AI Summit was huge, and it was full of interesting announcements. A month ago it might have seemed like Databricks just provided notebooks, but that’s not the case anymore. The platform’s new features include MLflow 2.0, serverless model endpoints, model monitoring, and many other features aimed at MLOps and production-ready data science models and experiments.

The State of Data Engineering 2022 – I like this kind of content. Somebody looks at what’s going on in the world of data engineering today, classifies it, and puts it all together into one nice image. I’ve already shared a similar piece by Matt Turck, who does this every year for the whole data landscape. I hope the folks at lakeFS continue their good work and update this yearly. Keep it up!

Cache in Distributed Systems – There are two hard problems in programming: variable naming and cache invalidation. Maybe you’ve already found your own solution to the first, but the second is likely still an issue for you. This article not only describes invalidation, but also addresses matters like eviction, hits, and misses. Of course, it won’t solve the problem of cache invalidation, but it might just help you understand caches a little better.

Events

Current 2022: The Next Generation of Kafka Summit – This is the most popular conference dedicated to Kafka, and it is hosted by one of Kafka’s main maintainers – Confluent. Of course, the main topic is data streaming, as always.

Big Data Event: London – This is going to be a huge data event in London. It’s likely there will be thousands of attendees, and there are already dozens of speakers from a wide selection of companies, including the widely known Aerospike, Stack Overflow, and Snowflake.

That wraps up June’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other interesting data engineering articles you come across!

Why We Need Hive Metastore

Pasha Finkelshteyn — Fri, 01 Jul 2022 10:00:00 +0000

Everybody in IT works with data, including frontend and backend developers, analysts, QA engineers, product managers, and people in many other roles. The data used and the data processing methods vary with the role, but data itself is more often than not the key.

— “It’s a very special key, meant only for The One”
— “What does it unlock?”
— “The future”
The Matrix Reloaded

In the data engineering world, data is more than “just data” – it’s the lifeblood of our work. It’s all we work with, most of the time. Our code is data-centric, and we use the only real 5th-generation language there is – SQL. (5th-generation languages are those that let you define what you want to achieve and the language itself solves the problem for you.)

There is a huge “but”, though. Data is cool, working with it is exciting, but accessing it is often cumbersome. Data is stored in a multitude of different formats, in different locations, and under different access restrictions, and it is structured in very different ways. We must be aware of them all, query them all, and sometimes even join them in our queries.

So, we need one place where we can manage all the information we have about our data stores. And this place is Hive Metastore.

Hive Metastore

Hive Metastore was developed as a part of Apache Hive, “a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale”, per the Amazon website. This basically means you can query everything from one place.

Hive achieves this goal by being the storage point for all the meta-information about your data storages. With its HSQL dialect (which has some limitations compared to regular SQL – but also some advantages), Hive allows you to project any data structure to a structure that is suitable for querying with SQL.

One slightly confusing thing about Hive Metastore is that, even though it has “Hive” in its name, in reality it is separate from and completely independent of Hive.

Since we’re talking about components, let’s explore Hive Metastore’s architecture.

Architecture

The actual architecture of Hive Metastore is quite simple:

Since data is projected to SQL, information about it is very easy to map to a simple relational structure, almost in an entity-attribute-value kind of representation. For example, entity “table” – attribute “name” – value “clickstream”.

Hive Metastore projects types from an underlying store to supported HSQL types, and stores information about the location of the underlying data. This data is stored in the metastore database, which is usually MySQL, Postgres, or Derby.

But the database itself is only an implementation detail. Its schema is more or less fluid, and it changes over time and without any prior notice. (Well, you could keep track of the changes if you followed the hive-dev mailing list, but few people do.) This database serves only one purpose: to provide the Thrift server with data.

The Metastore Thrift server is the main entry point for its metastore clients. Let’s focus on Thrift for a moment. Thrift is an RPC protocol initially created by Facebook and now maintained by the Apache Software Foundation. I would choose Thrift over the very popular gRPC any day, for several reasons:

It has typed exceptions. So you don’t just get a random exception from inside your RPC, but rather you can actually understand what’s going wrong.
It has a rich standard library (if a set of predefined types can be called that).
Like gRPC, it supports many languages, but in my opinion, Thrift’s generator produces much nicer code than gRPC’s generator does.

This Thrift protocol was developed by Facebook to meet the needs of its Big Data ecosystem, but that also makes it a perfect fit for Hive, and, in my opinion, for other ecosystems as well.

So, getting back to the Thrift server, it’s a fairly simple application with an API that lets you obtain the necessary information about the data sources that Hive Metastore is aware of. It’s typed, but you can still use it with dynamically typed languages like Python, which Thrift’s code generator also supports.

And the next part of the architecture is… there are no more parts! All the clients including Hive itself communicate only with the Hive Metastore thrift server. The Thrift server is so simple that we can spin it up with a single docker container for the Thrift server (assuming that we’ll use Derby as the Metastore database). Of course, it’s a rare setup for a production environment, but it comes in very handy for experiments.

Usage by third-party systems

Here comes the best part: Many of the new systems only need to know about the Thrift server and communicate with it. They don’t need Hive or any other query engine to access the data.

One example of such a system is Trino – a spin-off from PrestoDB being developed by a separate company called Starburst. When using Trino, you don’t need to have Hive installed. Having just Hive Metastore is enough. Trino is very simple to spin up in a Docker container, too – just one command is all it takes.

The same is true for lakeFS, a system that lets you work with a data lake using a Git-like interface. This can be very useful when you need to switch between different data sources quickly, as well as in many other situations. And with Hive Metastore’s help, integration is quite straightforward.

Criticism

Like most things, Hive Metastore isn’t perfect. Oz Katz from lakeFS wrote a great post on the limitations of Metastore. He sees three issues with Metastore:

“Scaling Thrift.” While Thrift is not as widespread as HTTP, it’s built on top of HTTP, so I would argue that many popular tools will work with it just fine (such as HAProxy). But it’s true that you can’t just catch a random message from Thrift traffic and understand what it’s talking about. I agree this is a slight drawback.
“Metastore is just a thin layer over RDBMS.” If I understand the argument correctly, very big Hive tables will create headaches while working with Metastore, owing to Hive’s partitioning scheme and the downsides of relational databases. Again, that’s a valid criticism, but here I should note that we actually don’t have to use Metastore with Hive. We can use it with different tools, too, and we don’t have to use partitioning either if we have other solutions that meet our needs.
“Leaky abstractions.” This is a very valid criticism that’s hard to argue with. Still, I’m not aware of any abstractions that don’t leak at all. Yes, Metastore might be leakier than some others, but sometimes you might be able to turn this problem into an opportunity to fine-tune things when you need to. Granted, this is only possible when you know exactly what you’re doing, but I’d say that applies to any tool out there.

Summary

Today we talked about what Hive Metastore is, how it works, and what it’s used for. We got a brief overview of several products that make use of Hive Metastore, and we discussed some of the technology’s pros and cons.

So, why do we need Hive Metastore at the end of the day? Because it stores all the information about the structure of our data and its location. This is the reason why many big companies are using it, to good effect.

We’re well aware that many of our customers are working with Hive Metastore or its Amazon implementation, Glue Data Catalog. They are both great tools, and users deserve to have a tool that will help them work with Metastore in a more efficient way than just querying things with Hive.

Data Engineering Annotated Monthly – May 2022

Pasha Finkelshteyn — Wed, 08 Jun 2022 09:00:00 +0000

It’s the start of June. That means it’s time to start taking summer vacations and enjoying some fresh juice alongside your fresh news! Hi, I’m Pasha Finkelshteyn, and I’ll be your guide through this month’s news. I’ll offer my impressions of recent developments in the data engineering space and highlight new ideas from the wider community. If you think I missed something worthwhile, catch me on Twitter and suggest a topic, link, or anything else you want to see. By the way, if you would prefer to receive this information as an email, you can subscribe to the newsletter here.

News

A lot of engineering is about learning new things and keeping a finger on the pulse of new technologies. Here’s what’s happening in the world of data engineering right now.

DataHub 0.8.36 – Metadata management is a big and complicated topic. There are several solutions. Some of them are free, some of them are paid, but none of them are particularly easy to use. I’ve had some experience with Apache Atlas, and even with the help of my colleagues, I wasn’t able to make it do what I wanted it to. On top of that, it’s a part of the Hadoop platform, which created additional work that we otherwise would not have had to do. DataHub is a completely independent product by LinkedIn, and the folks there definitely know what metadata is and how important it is. If you haven’t found your perfect metadata management system just yet, maybe it’s time to try DataHub! This new release brings exciting features like support for Apache Iceberg!

Feathr 0.4.0 – This feature store by LinkedIn is developing quickly. I know that many companies have not been able to find a suitable feature store on the market and have had to write their own. This task is not easy, and it takes a very long time and significant engineering resources to do properly. Meanwhile, it looks like LinkedIn has the necessary resources and is even ready to open up its solution to external contributors! The most notable change in the latest release is support for streaming, which means you can now ingest data from streaming sources.

Pulsar Manager 0.3.0 – Lots of enterprise systems lack a nice management interface. They need to be configured with configuration files or via the command line. I am an old-school guy. I adore command line, vim, and so on, but I also understand that sometimes configuration is such a complex task that would really be easier to do once with a UI and then just not have to think about it again. Apache Pulsar takes a step in this direction and adds an official management UI! In this release, there are some improvements to the dashboard, as well as several bug fixes.

Bookkeeper 4.15.0 – And while we’re on the subject of Pulsar, we should not forget to mention the engine behind Pulsar: Bookkeeper. Bookkeeper is usually perceived as exclusively a backend behind Pulsar, but the truth is that nothing can stop you from using it in your own systems. Bookkeeper’s team presents it as a “fault-tolerant and low-latency storage service optimized for append-only workloads”, so if you need to store something in a distributed manner, you may not need a traditional database. Perhaps Bookkeeper would suit your needs better! In the latest version BP-46: Running without a journal has been implemented, along with several other features.

Impala 4.1.0 – While almost all data engineering SQL query engines are written in JVM languages, Impala is written in C++. This means that the Impala authors had to go above and beyond to integrate it with different Java/Python-oriented systems. And yet it is still compatible with different clouds, storage formats (including Kudu, Ozone, and many others), and storage engines. It shouldn’t come as a surprise that Cloudera managed to achieve this, as they know how to create on-premise data engineering products. I don’t know how this happened, but there is not even an official changelog yet at the time of writing. However, you can find a diff with the 4.0.0 version on GitHub.

RocksDB 7.2.2 – We often forget that certain data engineering products only work so well because they have other powerful tools under the hood. For proof of this, look no further than systems like Flink and Camunda, which rely on RocksDB. RocksDB is a storage engine with a key/value interface, where keys and values are arbitrary byte streams written as a C++ library. It can store data virtually everywhere, for example in memory or on any kind of permanent storage device. And yes, it pays attention to correctness and effectiveness when storing data.

Future improvements

Data engineering tools are evolving every day. This section is about updates that are in the works for technologies and that you may want to keep an eye on.

Kafka: Mark KRaft as Production Ready – One of the most interesting changes to Kafka from recent years is that it now works without ZooKeeper. This is possible thanks to implementations of KRaft, a Raft consensus protocol designed specifically for the needs of Kafka. This Kafka Improvement Proposal’s goal is to declare KRaft production-ready and to make support and operations related to Kafka clusters much easier.

Flink: Support Advanced Function DDL – SQL query engines like Hive and Spark have supported external functions in SQL for quite some time. This allows developers and data engineers to enrich traditional SQL with their own extensions, which can be useful when you need to perform business-specific operations inside a regular query. Hopefully with the implementation of this Flink Improvement Proposal, Flink will support them too.

Spark: Use Parquet in predicate for Spark In filter – Though it is usually hidden behind the scenes, one of the most popular storage formats – Parquet – is evolving too. At this point in time, filters have been implemented on the storage level in Parquet, and Spark needs to catch up by adding support for native filtering. This improvement can make our queries dramatically faster in some cases!

Articles

RocksDB Is Eating the Database World – Continuing on the topic of RocksDB, here is an older, but still very interesting, article on what RocksDB is and how it works. It also provides some insight into why its popularity is growing rapidly.

Replicated Log – Here’s a relatively long and detailed article about replicated logs. A replicated log is a way to synchronize data among nodes in a distributed system. There are multiple ways to implement a replicated log, and most of them are somehow related to what are called consensus protocols, for example, Paxos and Raft.

Events

Current 2022: The Next Generation of Kafka Summit – This most popular conference related to Kafka is organized by one of its main maintainers, Confluent. Of course, the main topic is data streaming.

Big Data Event: London – Thousands of attendees are expected to participate in this big data event in London. They’ve already booked a large number of speakers from a wide range of companies, including the widely known Aerospike, StackOverflow, and Snowflake.

That wraps up May’s Data Engineering Annotated. Follow JetBrains Big Data Tools on Twitter and subscribe to our blog for more news! You can always reach me, Pasha Finkelshteyn, at asm0dey@jetbrains.com or send a DM to my personal Twitter account. You can also get in touch with our team at big-data-tools@jetbrains.com. We’d love to know about any other interesting data engineering articles you come across!

Kotlin API for Apache Spark: Streaming, Jupyter, and More

Pasha Finkelshteyn — Thu, 26 May 2022 11:15:15 +0000

Hello, fellow data engineers! It’s Pasha here, and today I’m going to introduce you to the new release of Kotlin API for Apache Spark. It’s been a long time since the last major release announcements, mainly because we wanted to avoid bothering you with minor improvements. But today’s announcement is huge!

First, let me remind you what the Kotlin API for Apache Spark is and why it was created. Apache Spark is a framework for distributed computations. It is usually used by data engineers for solving different tasks, for example for the ETL process. It supports multiple languages straight out of the box: Java, Scala, Python, and R. We at JetBrains are committed to supporting one more language for Apache Spark – Kotlin, as we believe it can combine multiple pros from other language APIs while avoiding their cons.

If you don’t want to read and just want to try it out – here is the link to the repo:

Repository

Otherwise, let’s begin our overview.

Spark Streaming

For a long time, we’d supported only one API from Apache Spark: Dataset API. While it’s widely popular, we can’t just ignore the fact that there is at least one more trendy extension to Apache Spark: Spark Streaming. Its name is self-explanatory, but just to be sure that we’re on the same page, allow me to elaborate a little.

Spark Streaming is a solution to build streaming processing systems using Spark. Contrary to other stream processing solutions, Spark Streaming works with micro-batches. When reading data from a source, instead of working on one element at a time, it reads all data in defined time frames or “batches” (for example, it might read everything available each 100ms).

There are multiple core entities in Spark Streaming:

Discretized stream (DStream) represents a continuous stream of data. It can be created from an input source (socket, Kafka, or even a text file) or from another DStream. DStream is represented as a sequence of RDDs (Resilient Distributed Dataset).
Spark streaming context (StreamingContext) is the main entry point for working with DStreams. Its main goal is to provide us with different methods for creating streams from different sources.

As you might already know, we have a special withSpark function in our core API (you can view it here, for example). Of course, for Streaming, we have something similar: withSparkStreaming. It has some defaults that we think are reasonable. You can take a look at them here if you want.

The very basic sample usage will look like this:

withSparkStreaming { // this: KSparkStreamingSession
   val lines: JavaReceiverInputDStream = TODO() // create some string stream, for example, from socket
   val words: JavaDStream = TODO() // some transformation
   words.foreachRDD { rdd: JavaRDD, _: Time ->
      withSpark(rdd) { // this: KSparkSession
         val dataframe: Dataset = rdd.map { TestRow(it) }.toDS()
         dataframe
            .groupByKey { it.word }
            .count()
            .show()
      }
   }
}

What can we see here? We create a Spark Streaming context with the call withSparkStreaming. Many useful things are available inside it, for example, a withSpark function that will obtain or create a Spark Session. It also provides access to the ssc variable (which stands for – you guessed it – “spark streaming context”).

As you can see, no other non-obvious abstractions are involved. We can work with the RDDs, JavaDStreams, etc. that we are familiar with.

The withSpark function inside withSparkStreaming is slightly different from the one you’re familiar with. It can find the right Spark Session from the SparkConf of the ssc variable or (as seen in the example) from an RDD. However, you still get a KSparkSession context which can give you the ability to create Datasets or broadcast variables, etc. But in contrast to its batching counterpart, the Spark Session won’t be closed at the end of the withSpark block. Lastly, it behaves similarly to the Kotlin run function because it returns the last line of its contents.

You can find more examples on our GitHub and more detailed documentation is available on our wiki.

Jupyter support

A Kotlin kernel for Jupyter has existed for some time already. You can perform experiments with different Kotlin for Data Science tools, such as multik (library for multi-dimensional arrays in Kotlin) or KotlinDL (a Deep Learning API written in Kotlin and inspired by Keras, working on top of TensorFlow), and others described in the Kotlin documentation. But we are aware that Jupyter is quite popular among data engineers, too, so we’ve added support for the Kotlin API for Apache Spark as well. In fact, all you need to do to start using it in your notebook is put %use spark in your notebook’s cell. You can see an example on our GitHub.

The main features you will find in this support are autocompletion and table rendering. When you use %use spark, all of the notebook cells are automagically wrapped into one implicit withSpark block, which gives you access to all the sugar we provide.

The aforementioned Spark Streaming is supported too. To use it, all you need to do is add %use spark-streaming. Of course, all the features of dynamic execution are supported – for example, tables will update automatically.

Please be aware that Jupyter support for Kotlin API for Apache Spark is experimental and may have some limitations. One such limitation we are aware of: You cannot mix batching and streaming in the same notebook – withSparkStreaming doesn’t work inside of withSpark block. Don’t hesitate to provide us with examples if something doesn’t seem to behave as it should for you. We’re always happy to help!

You can find an example of a notebook with streaming here.

Of course, it works in Datalore too. Here is an example notebook, as well as an example notebook for streaming. In case you’re not aware of what Datalore is: It’s an online environment for Jupyter notebooks, developed by JetBrains.

A bit more information on Jupyter integration can be found on our wiki.

Deprecating c in favor of t

While preparing this release, we noticed that our own reinvented tuples (called ArityN) aren’t actually that effective when used extensively, and it’s more effective to reuse Scala Tuples. So, we’ve deprecated our factory c method in favor of the factory t method. The semantics are totally the same, but the method creates a native Scala Tuple instead of Kotlin one.

In Kotlin-esque fashion, here are some useful extension methods for tuples:

val a: Tuple2 = tupleOf(1, 2L) // explicit tupleOf, same convention as with `listOf`
val b: Tuple3 = t("test", 1.0, 2) // `t` as an alternative to `c`
val c: Tuple3 = 5f X "aaa" X 1 // infix function which creates tuple of 3 elements
tupleOf(1) + 2 == tupleOf(1, 2) // '+'-syntax can be used to extend tuples

There is much more information about the new Tuples API available in the wiki.

We have also changed the structure of the documentation.

The readme grew too large to easily digest, so we’ve split it up and put its contents into the wiki on GitHub.

Conclusion

These are the most important changes to the Kotlin API for Apache Spark version 1.1. As usual, you can find the release on Maven Central. If you have any ideas or if you need help or support, please contact us on Slack or GitHub issues.

Also, please give a warm welcome to Jolan Rensen, who contributed to the project for some time and now works at JetBrains. He’s the main person in charge of maintaining the project, while yours truly only tries to help him out wherever he can. Jolan, we’re thrilled to have you with us!

Big Data Tools 2022.2 EAP: What’s New?

Anna Maltceva — Fri, 20 May 2022 13:01:08 +0000

Big Data Tools 2022.2 EAP is now available. You can try the newly added features right away by installing the latest plugin version to the 2022.2 EAP of your IDE.

Please note this is an Early Access Program build, meaning it’s not fully tested.

Hive Metastore support

Ability to create a Hive metastore connection from the EMR cluster window and browse Hive catalogs, tables, and columns.

Apache Flink Monitoring

You can now monitor Flink applications right in your IDE. Just like in the Flink Dashboard, you can launch and stop jobs, all without leaving your IDE.

The dedicated Flink tool window allows you to preview:

A list of jobs with their details: exceptions, checkpoints, configurations.
Task Managers with the ability to view logs, stdout, log list, and thread dump.
Job manager with the information on configuration, logs, stdout, and log list.

From each of these tabs, you can go directly to the Flink web interface by clicking the ‘Open in Browser’ icon.

Try Big Data Tools 2022.2 EAP and let us know what you think! We’re looking forward to your feedback in the corresponding tasks of the issue tracker:

Stay tuned and follow us on Twitter!

The Big Data Tools team