In the run-up for the event, Gazeta.Ru interviewed Alexey Natekin. Alexey is the founder and coordinator of the global Open Data Science community and is a speaker of the Who Owns Personal Data workshop of the forum. What is the future of the data-related professions? How to protect your privacy online? How your digital footprint can be used? These were our questions for Alexey.
We talk more and more about data journalists and data scientists. Why so many professions get a data-related offshoot these days?
That tendency is a very peculiar one. To understand if these new professions are to stay for good and who is to replace data-scientists when the time comes, let's dive in the situation at hand and look at where some of these faces in the industry are coming from. The trends we see in data-science have a direct Darwinist parallel.
First, the very technique of data-analysis gets more and more mainstream, it's new programming of sorts. Twenty years ago every industry experienced a surge of software engineering, and software developers were needed to facilitate this demand.
A similar demand for data analysis now comes both from the scientific community with people needing econometrics, psychometrics, etc., and the business community, where thousands of different analytics are needed. Product, client, risk, data analysis… You name it! And some trades evolve and up-grade, putting forward new products and creating added value thanks to data-analysis.
Even in journalism?
For instance, data journalism gives you an opportunity not only to visualise data so that it looks good, but also to tell a data-based story, which is more believable. And then you can give your reader a way to dwell deeper into the data and create a more personal experience by reshaping the narrative the way they want without losing sight of the article.
How universal are the data-analysis tools?
The basic tools are the same for most of the approaches I've just named and are easily applicable to other fields. Data-analysis specialists can change the area of interest with ease. You can switch from finance to telecommunications to oil and gas to chemical industry to what have you. What is also of note is that these incredibly diverse specialisations can be easily divided even further to focus on text, image, process, graphs, and relations analysis.
We might even say that the Darwinist parallel here is that the data science experiences a lot of speciation. The main reason behind it is that the data analysis gets into all areas of life that are not too heavily regulated, isolated or conservative as if they were new biomes like prairies, tropical forests and tundra.
What else is important in terms of the tools the data-analyses employs?
If the industry grows with more experience and best practises being exchanged between players, the data-related roles get more specific. For example, junior engineers specialise in the engineering aspect of machine learning models implementation in the work place. Data-engineers, on the other hand, are responsible for the infrastructure side of things and processing of the data itself.
How can you tell one role from another then?
At first, as soon as new roles emerge, there is a lot of ambiguity as to what they are for. ML Ops that have gained a lot of traction during the last one or two years is a perfect example of that. But best practices and specific goals will appear when the time comes, and with them the uniqueness of a particular role. Don't forget that we already have a much more visible and formalised set of data-related specialisations, where a lot of work has been done already.
If we take a closer look at the business community, there are data owners who take full responsibility for their data as a fully valid business product. By their side you see data governance departments that have an eye on where the different kinds of data get stored. Closely related to them are the data quality assurance professionals.
And there is more of in-house data-related processes going on, resulting in new roles and objectives popping up. In the West, there is a surge in privacy-related positions that ensure compliance with the laws that have to do with privacy. Companies can go as far as create a separate Chief Data Protection Officer position. They may have a whole bunch of those data-chiefs now. There are your Chief Data Officers and Chief Data Scientists coexisting peacefully with more traditional Chief Analytics Officers and Chief Scientists.
Back to this Darwinist parallel of yours. It seems like there is not only speciation going on in the field of data analysis by the means of interspersion and new roles coming to existence in various industries which can be considered biomes, but there is also natural selection.
Absolutely. It's very Darwinistic what we see, because this selection makes new roles become more specific and efficient. We might presume that if these Chief Officers appear in companies that employ people in thousands or even hundreds of thousands, then the role in question has made it through natural selection. And these roles seldom stop developing further.
Hence some roles become redundant…
As a result of natural selection some roles can either change or even disappear as the industry matures. Back to the programming parallel I made, 25 years ago you could hire computer scientists who often came into profession on the backs of their PhD research. Generally speaking, these were very high-qualified specialists with academic background and were very sought-after in the rapidly growing world of software engineering. The Internet getting big at the same time also helped.
The role the data-scientists have today is quite similar to that, but the picture wouldn't be complete without another parallel.
10-15 years ago there also were web-masters, jack-of-all-trades of sorts who could both code all the web components and create content. They were even capable of promoting the websites. Today you would have to have a dozen of specialists to do the job like front and back-end developers, admins, UI and UX designers, mobile web developers, SEO professionals, marketing experts and advertising copy-righters. Keep in mind that you need other team members who would help you manage this army of specialists.
It might be tempting for some to get it all back and hire one web-master instead of that army of specialised professionals. He wouldn't be as good as them, but he would get everything done on his own. Hopefully I don't have to explain the irony here, and it is clear that web-masters getting extinct is a good thing.
The same process happening even faster can be seen if we look at the data-scientists. I've already said that nowadays more than ten data-specialists can be working on a product that requires data-analysis. And each of them night be important and useful depending on how complicated product development is. Memorising all of these roles and demanding that the specialists you seek keep it all in mind is not a very realistic scenario.
Moreover, companies that have a better understanding of the subject and of who they are looking for to join their data-teams, specify the role in question beforehand.
Framing it further in terms of the Darwinist parallel that we drew, there will be those branches of data-science that lead nowhere. There will be those that, like the Neanderthalers, partly die out and partly hybridise with those that survive. And there will be ancestor species that are common for a variety of newer ones. One example of that is data-scientist who is a father of many newer roles.
It's okay that computer-scientists have become extinct, but the Computer Science is still there, developing extensively. The same goes for more specific roles, like that of a data-scientist and probably a whole bunch of others that will become extinct while the tree of Data Science roles will keep on growing. These new species won't appear, get stronger and come of age without natural selection taking place in the labour market. The fate of the less competitive roles is, then, to merge with more successful ones or to step aside.
So you believe that everything will be OK?
It's just that a new generation of more competitive, efficient and developed roles will come to replace data-scientists.
Let's talk maintaining anonymity on the Internet. Is it true that all data there becomes public by default? Are there those you can't find on the Web or has everyone's data long been available?
In terms of the digital footprint we leave online, what we post and create is the most visible part of it. Example: what we share on social media. The videos, presentations, graphics, code, podcasts, all sorts of products of our creativity, as well as the reactions and comments we share all belong to this category.
Historically speaking, there was an urban story about a site where there is information about every human being. It's funny that it has turned out to be a self-fulfilling prophecy based on our desire to share. It comes as no surprise that in this new global communication and IT environment it was us who realised that we have that desire. This sort of a digital footprint, the one which we leave consciously, is called active.
A trait of our time is that the biggest websites produce 0 content on their own, giving their users all the necessary tools to create and share content.
It makes zero difference if we talk about social media, video hosting platforms or renting services. The humanity as a whole got engaged in this broad public network where everyone shares everything and everything is available. Of course, some data can be relatively hidden or available only through a club membership or subscription, but the rule of thumb is that everything that was shared can be accessed. The road is where we go, and the modern Internet society is where we leave our digital footprints. And the rapid way in which the open and public nature of information transfer of today influences not only society, but science as well, is a very positive and up-lifting thing to talk about on its own.
But this publicly available information is only a tip of an ice-berg. The bigger part is the kind of your data which is not as public as you might thing. We call it a passive digital footprint or a digital shadow. For instance, billions of back-end interactions take place right now and data feeds flow as someone reads this very interview.
There is a lot of talk right now about smartphones collecting our data even when we are not using them. Many people stick up their webcams and don't talk about anything of importance near the speakers. Is there a foundation for such precautions? Or are we dealing with conspirological thinking?
Yeah, we leave more digital footprints of various data when we use smartphones, like our geographical position can now be revealed when we consume data. Another example would be allowing access to contacts, calls, mic and camera, often unknowingly, when we install apps. Obviously the voice assistants can't work without having access to your mic, but what data will be analysed in what way is left for the apps to decide.
All sorts of paranoia regarding surveillance and de-anonymization are a bit late for the party, there is hardly anything we can do about these things now. Half of the time a user can be identified simply by analysing his behaviour on the site, even their login is not needed for that.
It gets even more dramatic when we talk de-anonymization. For one thing, if enough data gets collected, developed and mapped in a diligent way, a person on a video can be identified by their manner of walking alone, even if they are dressed in a full-bodied Halloween costume with a mask. This isn't mainstream yet, but it may be in three to five years.
Going back in time, we will see that before biometry graphometry was used. It allowed to identify you by looking at the handwriting and manner in which you write texts. So even if you work hard to create an alter ego, even your anonymous comments can be traced to you with a certain degree of certainty. The only way to hide in a modern world is by leaving the digitalised civilisation for the woods.
Literally prolonging your life using the digital footprint you left to create a bot or a digital copy of you is already a reality. Wouldn't it be the elixir of immortality, albeit digital, that the humanity has always searched for?
It's really simple. This copy is as much of an elixir of immortality as portraits and personal diaries were at different points of human history. Now we get an opportunity to recreate the experience of engaging with a person in much more detail, because we have much more data about them. For instance, if back in the day we would only have portraits and photos, now we can generate a custom depiction, which might be animated and say some specific phrases out-loud. The depiction might even mirror the preferences and behaviour of that person.
Having detailed models of digital human images, it is possible to generate a video with a pre-determined plot. People in the examples I'm going to give didn't leave a huge digital footprint, but it would illustrate my point perfectly. We could, for one thing, generate a video that would come up if you search for "Albert Einstein meeting Herman Hesse at breakfast in Cologne on the 26th of March, 1931". However, that would be nothing more than a pre-generated multimedia material that can be generated again on demand, getting a different result.
Creating a model of a person based on their digital footprint is nothing but an improvement of technology that can't replace living and breathing humans with their identities. This is something that those using these models should keep in mind. They won't write the way they would if they were alive now. But the models help to pass on the experience of these people so that it wouldn't disappear with their biological death, as well as any other information about them. But a human is something bigger than a sum of their data.