Subscribe to our blog to get the latest articles straight to your inbox.

Video Transcription


MODERATOR:  In two sentences, what is data science and machine learning?

JEFF: Yeah, I'll start, keeping it brief. I think, to me, data science is using data to inform decision making. And then, machine learning is, I think, a particular tool in data science that allows you to sort of make automated decisions based on the data. So when you have some data, it builds something that makes the decisions for you, rather than spending a lot of time interpreting the data yourself.

MICHAEL: All right. So, I think data science is taking data and making it useful, right? Looking at, finding applications, and relevant features in the data for whatever you're trying to do. Machine learning, the best explanation of it I've heard is “a search problem”. So you have background knowledge, and you're searching through possible explanations for that data.

MINA: I'm from academia, and you're telling me two sentences? I'll talk for two days. So to me, data science is storing the data, clean it, and clean it, and clean it. That's the hardest part of the data. And then, analyze it and present it. That's also a very important part of it. And then, that can improve the decision-making process. And machine learning is just one part of that that helps you to get from the clean data to the end result.

MODERATOR:  So, in order to ground this, to make it applicable to Very. This first question will be guided a little bit towards Michael and Jeff. How have we used data science on past and current client projects?

MICHAEL: I have a list. I don't know how deeply I will go. This is the one I’m a little prepared for. So we did social network analysis, we've detected fake news, we've done facial recognition, we did recommendation systems. Jeff worked on that frame.

JEFF: So, so we took Slack data and we analyzed the way that people were communicating in Slack and tried to find out what people were going to stay at an organization, and what people were happy, and we also found-like informal hierarchies in the data, which was interesting. So you have the formal hierarchy, but then we were able to look at the Slack data and find the people, like the actual hierarchies of the data.

MODERATOR:  So looking at the relationships in a network of people and the relationships amongst each other?


JEFF: Yeah, and just really quick to that point, like Dr. Sartipi was talking about: cleaning, cleaning, cleaning. Before that, we were dealing with natural language Slack data. So how do we use the proper techniques to transform that into a valuable social network graph or a lot of the other things that we did? And so, yeah, a lot of data cleaning with natural language, messy data was a big part of that.

MICHAEL: And so, we've used natural language processing (NLP) and similar approaches to do fake news detection. Taking text, this unstructured, informal data and turning it into something that's more structured and we're able to analyze. So, we did that for our fake news detection, and for a project, we're working on now, annotating legal documents.

MICHAEL: That is it. And then I was going to let Jeff talk about the recommendation systems and the facial recognition since those works were his. 

JEFF: Yeah, I worked with Brian Zambrano for a little bit on a project, where we were analyzing bike-ride routes. And given a set of bike rides that people have gone on, how do we recommend new routes that they might enjoy? And it's interesting there, it was GIS data, right? GPS coordinates and elevation. And so how do you turn a time series through space, a space-and-time data set, into something that can be compared with other space-time datasets for similarity, for recommendation purposes? So, yeah, that was a really fun project. And then, the facial recognition? Yeah. So when we did facial recognition, the big thing there, it's important, I think, in our industry and for our clients that we're always cognizant of the tools that are already out there at our disposal. And in that case, facial recognition, there's an API for that, right? That Amazon runs. It’s called AWS recognition, there may be others. We leveraged that, but we found that it had very high precision and low recall. I guess I'll go into that a little bit. But basically, it was very unlikely to falsely admit someone that was not you. But it was highly likely to reject you even though it was you, right? So we did a pretty simple Bayesian technique around that, where we took multiple observations and multiple pictures of people and used that to basically provide a better likelihood of whether or not this was the person that was querying the system with their face. So, yeah. That was a — that was a really neat one, and I'll talk a lot about this. But the thing that I've learned, coming here, is how do we take tools that are out there and the knowledge we have and build something in a lean manner quickly, just like everything else that we always do for our clients. 

MICHAEL: I did want to add one more thing. So Daniel actually did something very interesting on a project with data science, where he was trying to decide whether to build a feature. And so he looked at the database and looked at who was actually using features related to that, and then he was able to make an informed decision about how to do development on that project. And so I think that's one way that we can use data science in a symbiotic relationship with the rest of our projects.

MODERATOR:  What's the most challenging project, in general, that you have worked on?

MINA: So I think when I mentioned cleaning that many times, and I really meant it because a lot of times, when we get contracts, especially from industry, they want progress. And you're cleaning the data. Nobody believes you til actually If they have worked with the data that the data is messy, the data is missing points, it’s very  —  there are mistakes in the data, how can you actually go in into the project without having that expertise? Because we work with an energy, transportation, and health area, and I don't have any of those backgrounds. I'm a computer scientist. I do data analysis, I do wireless communication, but I'm working with people that are from health. The first time that I was working on a health project, I had my Google open. I was really translating the words that they were saying. And believe it or not, we test the same thing but in very different languages. So it was just, coming  —  going into it when you have people from different fields, that is one of the main challenges. Because when you get to clean data and you know exactly what you want to do, that's the easy part. Getting there is the hard part. The other thing that was an issue, we got the data from Tennessee Department of Health, we did a deep learning algorithm on it, and it was LSDM, and it was a beautiful result, 90-some percent accuracy. The MD in the room says, "So what?" And I was like, "What do you mean by 'so what'? This can be published.” And she was like, "No doctor would use it. This is too complex. This requires a lot of computation. Nobody would be using it. Reduce it to 80 percent less accuracy. We want that." And I mean, because they were saying that no doctor would use this as a single data point. This is going to be one data point in addition to everything else that they have. So even if it's not as accurate as you wanted it to be, it would be still very valuable for them. So I think going into a lot of these projects and I’m hearing social network you’re talking about, facial recognition, a lot of these things, because you approach it from your domain of expertise and then it's for something completely different, just adjusting for that is, I think, to me, is always be for all the projects that we have, exactly.

MODERATOR:  Are there any processes or tools, in particular, that you've found that help with that cleaning process?

MINA: I don't think there's any tool. From the very beginning now, when we start working on a project, I'll let them know that 70 percent of my time would be on this part, to understand the data, to clean the data. As long as we have that agreement, we're good to go.

MODERATOR: Okay, so the data’s key in data science?

MINA: It’s very key.

MICHAEL: So the issue that I've always had on every project that I've had at Very is labels. We have a lot of trouble both collecting examples of the data —  of the classifications that we want from the data. So they might have data, it's usually messy and requires a lot of cleaning. But then beyond that, people think that you can just pull answers out of thin air. And what you really need is you need examples of "this is the data, and this is the cost that it belongs to, and here's more data and it belongs to this other class." In that way, you can separate it. And without that, it becomes a lot trickier. 

JEFF: Yeah, to add a little bit of color there, with supervised learning in particular, which is where we have some input and we want to predict an output. We have to have many, many examples of a ground truth, input and output. And it's extremely common for clients to think that they are ready to have a machine learning solution because they have a lot of data, but they don't have the ground truth outputs that they want you to predict. So that involves trying to be clever in a lot of ways in dealing with that. Another thing, adding to what Dr. Sartipi was saying, I think it's important. You heard me mention precision and recall, which are two accuracy metrics that are used to evaluate algorithms. But it's important that you have a wide and deep knowledge of these different error calculations because they, depending on the application, certain ones will be more important to the domain expert or the user. So just raw accuracy or the exact number of predictions that you got right may be meaningless, depending on the application. I mean, it will have some meaning, but relative to other accuracy metrics that may not be as important. It may be okay to have a lot of false positives in order to prevent missing a positive example, things like that.

MICHAEL: And I think that that kind of gets to an issue that we have, which you have touched on, which is translation. Because we come to these people who are experts, and we really have just a handful of techniques that we can apply to a lot of different problems. And so, there's always that process of taking what they know and figuring out how to work it into this framework that we know. So, there's usually some miscommunications there, and we have to figure that out over some time.

JEFF: So the people part?


JEFF: The worst part for all the scientists.

MINA: I don't mean it to sound cold, but I was telling a lot of people data, to me, is a number. And it can be the blood pressure of a person, it can be traffic accidents, it can be the number of rainy days. It is just a number. And what we are trying to do is we are adding a lot of collaboration, that between computer scientist, and social science, and psychologist. Now, I have a  —  actually psychologist student and a social scientist student in my team because we want to add that "people sense" to the data because now, data science is not just a cold hard science that you just do it. It's just like you solve a theory to get a problem. Now, we're using it for businesses. We are using it for real applications. So we need that part.

MICHAEL: That was actually something we dealt with on Foresight, where we were trying to figure out what these things meant. So we were able to cluster the data. We were able to figure out what communication was important and what direction of communication was flowing in. But we talked to psychologists about what that meant and how these structures existed in the real world.

MODERATOR:  So to summarize, you guys tell me if I'm wrong, sounds like data, in terms of this is going to be a big part of the job, actually starting with the right data, labels, is this actually an application? Are they ready for us to apply data science? And then, understanding the general problem, right? You have to have my error calculations right. If I'm measuring accuracy for solving a completely different problem, then that's not going to help the people in the end. Does that sound correct?

JEFF: Yeah.


MODERATOR:  What is the biggest hurdle for Very to become one of the top machine learning consultancies? 

JEFF: I would say I've got three things. We need more team members with expertise. That's a big one, big number one, the biggest one in my opinion. We need content. Emily is driving hard on that but it's been preached a bunch that the more content we get out there, the more people will recognize us as experts. And then when it comes time when we get the contracts, we have to execute. And that will generate more content and hopefully, attract experts to want to work with us, for us.

MICHAEL: I think the biggest hurdle that we have is that traditionally, we worked at very early-stage startups that don't have much data. And so before you can even worry about cleaning the data, you need to have the data. And that's something that's been a challenge with. I mean, I think we're moving into the right direction there, but that's definitely been a challenge so far. And the other big challenge I see with just consulting in data science, in general, is that places are basically outsourcing their R&D to us. And so, we have to build the trust there with these companies that we're working with.

JEFF: Yeah. I resonate with that a lot. People, especially entrepreneurs, if this is going to be the core of their business, to get somebody to outsource the core of what they're doing, the core algorithm that's going to make them their millions or whatever they think as an entrepreneur, it's tricky. And also a lot of times, R&D, this stuff borders between R&D and software. And I think we should push to make it lean and as close to our regular software build as possible, but it's going to be expensive and time-consuming for a lot of our clients. And we have to figure out, like I said in my number three, how to execute that well.

MODERATOR:  So we'll validate the clients that have data. And Dr. Sartipi, I know that you deal a lot with industry, what would be a large hurdle, that you, not knowing maybe much about Very, which you guess could be?

MINA: From what I'm hearing, and I think I heard it earlier, Michael mentioned it, too. I think one of the things that how many experts do you have on your team. And it would be a lot of times much easier if you bring interns, and you train them. And, I do the same thing at the university level, I bring students as a sophomore and junior in my research group. I train them, they stay for Master's and Ph.D. and then they can. They knew and see the result. And one of the things I would say that, and I don't know that about Very because I'm not familiar with that, but a lot of companies I know and see are, like Chatanooga, when they are hiring, they want someone coming ready to go. Such a thing does not exist. We hear a lot of bad from industry, our industry advisory board either startups, big corporations that, "They don't know this." Well, we can’t change our curriculum based on everybody's need. What we do is we teach them the basics that they need to know. Rest of it is really, you need to invest, especially if you are recruiting our students right out of the college. And I think internship and things like that can be a way that you can really train them to take them to a place that you want them to be because they will come with what they need to start with, and the rest would be, I think, companies' responsibility.

MODERATOR:  How could data science minimize our risk as a consultancy, in general, not just in terms of us wanting to get data science contracts but the consultancy as a whole?

MICHAEL: So we were talking about lead qualification earlier. I think the fact that we can use a gut feeling for that, it seems like we could augment that with some analysis, right? So we have the data available, and this is beyond machine learning, just building tools where we can take what we know about the business and we can mitigate some of the risks. So we can look at what are high-risk clients and how much do they pay off versus what are lower-risk clients and how much do they pay off. We can do some analysis and basically treat it like a financial problem, where we are able to diversify our risks and invest in different areas. Also, the other thing that we have issues with is actually like resource allocation. And I think that if we could stack the pipeline, and I know we don't have a whole lot of control over this, but if we could control what's coming into the pipeline and what people are going to be available, then we could maybe better design projects that are coming up based on our team. 

JEFF: Yeah, we have a whole slew of conversations around resource allocation. And I think we do a pretty good job but that's like a very well-defined and well-understood area of research. And, there are pretty good algorithms out there to solve these problems, and I’m sort of the belief, and maybe we don't treat the algorithms as ground truth or as the final say. But it could at least be a tool, so that's a big thing. And then another thing that's always made me cringe, I'm sorry, Gabe and Tyler, but the revenue models always sort of optimistically trending upward, or it's just like, highly sensitive to whatever the last few months were or something without much probability built in. That's something that I would really like to explore. Of course, it's a resource allocation problem. Do we have the resources to put towards that  —  solving that problem? But in terms of helping our consultancy, I think those are some big areas.

MODERATOR:  How could data science create smarter IoT technologies? Dr. Sartipi, this might fit well with smart cities and urban development.

MINA: To me, these two are inseparable. We do a lot of projects with IoT, and I see IoT as a data resource or data source. So that is the one that's generating it. I don't know exactly if that's what your question is asking, but that's how I see it. The IoT is the one that we are applying data science to it to make a smarter decision, to make a better decision, more intelligent decision. It can be cameras on the street, that's what we are doing now, to having drivers more alert about their surrounding. It can be either any physiological sensors for patients so we can have real-time data and we can do personalized medicine versus generalized medicine. It can be sensors on the smart grids. They do resource allocation, we can do maintenance prediction. So to me, that's where data science is needed. If you have the IoT and you're not using data science on the data you are generating, I don't know what that IoT would be.

JEFF: I think dumb sensors and smart algorithms are what can really help the IoT space. Can you make something cheap and disposable, that maybe isn't even super reliable but have a smart algorithm that can make value from that, right? So that's where I think there's a huge opportunity there. It's one of the things that I've been lucky enough to work with a little bit. Yeah, there's tons of potential there. There's a lot of instruments out there right now that are very expensive, very precise. Like I mentioned with the facial recognition, there's value in having a lot of crappy measurements rather than a very expensive, perfect one. So that's something that's really interesting to me.

MICHAEL: One thing I'm really interested in is something that Dan brought up a while back, which is edge computing. So pulling back the data to our centralized place to actually do the training and then pushing out a model to do inference on these devices. I think it is interesting.

MINA: I really like that, Jeff, "dumb sensors in a smart algorithm.”

MODERATOR: So to tie into another one of our pillars, with Blockchain, how could distributed ledgers further data science? Or could they be used?

JEFF: I think the main thing, for me as a data scientist that I see and also not being a blockchain expert, is just the immutable record factor if you know that certain events are basically not going to change out from under you, you can have maybe a little more trust or build systems like Tyler had his talk about supply chain for champagne. You could do things like let anybody record a transaction on the blockchain without any validation beforehand and then use data science after-the-fact to decide if that was likely real, if this really was champagne, these grapes came from wherever and based on the information we know maybe about the weather or that sort of thing. And it would be very hard to fake something like that. So, if there’s a bottle of champagne, we have this trace of the things that went into this bottle and maybe data science could look at that and say, “You know what, that’s extremely unlikely that that’s really champagne.” And then also, because of the whole distributed ledger thing and the transparency of it, you know what transactions might have been faulty and maybe who participated in it.

DANIEL: So a way to improve the quality that is stored on the blockchain, so to speak, or just verify the quality.

JEFF: I guess it would be almost like fraud detection now, except for that you can't really turn back. If you've committed fraud, you can't go back and cover it up, right? So that you would always have a record of the fraud, because of the immutability.

MICHAEL: So for me, I think any use would not be application-specific to what data you're storing on the distributed ledger but I just see it as another data source, right? It’s something where you can easily pull in the holistic data so rather than getting a little piece of it you could get everything and then you can do whatever sort of analysis is relevant to that domain.

MODERATOR:  Can you guys think of a data that would benefit from being stored and then distributed and immutable fashion where those qualities actually would benefit the project? 

MICHAEL: So think about times where you have a large number of people who all want to arrive at some shared ground truth. I don’t have a particular example on mind, but I think that’s helpful, and then you can analyze flow, and you can look at populations and make predictions. So, if you’re talking about wine, you can predict future production based on what’s available. I mean, that wasn’t the original intent of that blockchain but it’s stated that it’s available on that blockchain, so you can use that in other ways. 

MODERATOR: Can data science tools or techniques help me do my job if I don’t want to be a data scientist? And if so, then how? 

JEFF: Pandas.

MODERATOR:  Pandas. Cute, cute, black and white bears?

JEFF: So, one of the tools that I preach about a lot, talking about cleaning data like Dr. Sartipi was saying, Pandas is a tool in what’s called the Scientific Python stack or SciPy stack. And really, it lets you quickly clean and work with data. It’s useful. There’s a quick way to hook it into your database, make a few queries, and then get local data in memory that you can manipulate super fast. Even if you don’t want to be a machine learning expert or whatever, just having that tool under your belt can be pretty useful.

MICHAEL: Yeah, exactly. So anytime we have data, it’s a very quick way to pull it in and clean it. So I think that that tool can be invaluable. And data science is pretty general because everything that we’re dealing with is data. Like if we have something in database, then that’s data that we can pull insights out of to help our clients. So I think that we need to keep an awareness to what possible ways we can visualize that data or display it in a way that’s helpful.

JEFF: Yeah, that’s what I was going to say. I think even if you don’t want to be a data scientist, especially if you’re a UI/UX guy, really, if you spend some time in some of the data science tools for data visualization, of course they’re not going to make visualizations that are as beautiful and interactive as I know you guys would make on your own. But it sort of gives you some ideas on the way that data is typically visualized, how to gain insights from a graph, what useful information can you glean from this graph? And then as a designer, if you have that in your tool belt, then we can make products that resonate better with the users whoever those may be.

MICHAEL: I think that there’s a large part of data science that overlaps with design because design is all about conveying the use of the thing and that’s what we’re trying to do with data science. We’re trying to take this regularized data that we have, and make it useful, make it meaningful, and that‘s really the concern of design. 

MINA: I think I totally agree. And one of the things I’ve mentioned about data science was presenting it. I think the presentation of the result is very, very important because we tend to  —  Again when I say “we”, I mean the computer scientists. I know we are having a lot of designers here, but it’s just different because we draw graphs, we draw ROCs, look at the area under this chair and look how beautiful it is. Nobody understands what we’re talking about.

AUDIENCE MEMBER 1: How much overlap is there between having the competency to be a great data analyst and being someone who can clean data and build data warehouses? Is it hard to think that one person could do all of that?

MINA: Let's say from my experience with this is that data management, it depends on who you are approaching, right? So data management is one main topic in data science as well as the data analysis because we're talking about a lot of these applications that, depending on a company the data is real time, it’s coming. Some of them are coming thousands of samples a second. And you want to be making sure that if you are taking on that job, you want to sort it from beginning correctly, so then you can actually be using it later on, because you don’t want to have like a mess that you're generating in top of messy data that it's collecting, right? So, is there a “one person can do it all”? Probably no. Depending on the project. It can be a project if the data is ready to go at least structured, yes. Probably the same person if one person can do it. But there is the data science, there is way more than just the data analysis part.

MINA: And honestly, to tell you the truth, a lot of times, we know that people don't even know what kind of data they have. Like labeling is a very serious issue they're giving us. Another problem is balance. You want to do anomaly detection. They give you a million rows, a million samples of data. There’s less than a thousand anomalies in there. So, there’s really a huge imbalance there. But there are techniques  —  a lot of research is going on in every single one of these. I think they need to be open-minded about this. This is a new field. It's not that nobody would know at all and we adjust ourselves to each project. But I think what they care about and that has been —  if you have shown some expertise in some things in your portfolio, most companies would go for it because they can’t expect that you go there and you know exactly everything because problem to problem are very different. And I would suggest you actually looking at the data yourself because a lot of times, they think the data has those things. But that's not what it is, really. 

JEFF: I will say that our  —  because we have such a talented team who's already built careers in the web development world, a lot of this idea of clean and scalable platforms for collecting data, they’re problems that we’re already pretty good at, I think, as a team. And so, coming in, the idea of more than academic writing code for analysis rather than writing production code is basically where I was when I came here. At my last job, I mainly made predictions and made models, and then handed them off to the software team. And so, I think working together, not just so  —  we want to teach you things, but also you have already taught me a lot. Michael has a crazy deep software background already. So, he's taught me a lot because we already speak the machine learning language together. I think if we bring people in with the right skills and like Dr. Sartipi was saying with regards to interns, I think we can train up some of the more specific skills in people, honestly, in either direction, which is why we're having this panel. So, we have a bunch of great software engineers. I think we can train you if you have interest to be an effective machine learning practitioner, building products with machine learning.

AUDIENCE MEMBER 2: This is for Dr. Sartipi. With the kind of smart city data science work that you're doing, what type of production systems are you encountering in data acquisition and providing value for these companies or cities? What is your initial data acquisition look like for a new contract for university?

MINA: We have a contract with TVA, and we can only work in their computers. We have the access to data. We see it all, but we don't have a copy of the data. So we have that kind of data. I have health data, the type we have purchased from Medicare that it's on a locked computer in a locked room. Me and only one student can see the data. But at the same time, I have a data on a camera that we have live stream of data that it's- All of my students can see it. All of the people can see it, basically, the data. So we have different states of the data. We work with different departments in the city, also. We have all 911 calls. And I think it’s actually an open data port for Chattanooga. We have little bit more detail than what it is there. So we have data that is not as classified. I mean, health data is not classified, but are not as sensitive. So we do work with different kind of data. And we have to follow all the regulations for them. So we have a HIPAA compliance service and center, and we can only store our data there.

AUDIENCE MEMBER 2: So you have standard processes that you follow when starting these new engagements? Or is it going to be a pretty custom one-off, according to who you’re working with?

MINA: We do have kind of a standard, but it has to be flexible for the application because we have some cases, all we have is historical data. So there is nothing new coming to the system, so that's it. We have waste data that the Tennessee Department of Health has collected for a long period of time. So that's it. We don't have any real-time access to that. Does that make sense? So it is just the standard thing that we have, we follow that. But then, we have also the data that comes in every second. It's just kind of we have to have flexibility on that.

AUDIENCE MEMBER 2: Are you building real-time systems to work with the incoming data for when you do have your models up and running? 

MINA: We do have systems, it's near real-time. But yes, we do have systems that it would be getting the new data. It's a combination of Python and SQL.

AUDIENCE MEMBER 3: So, we didn’t talk very much about reinforcement learning but with Google's DeepMind starting to have increasingly sensational accomplishments with AlphaGo and AlphaZero, essentially being able to take these closed systems and simulate things many thousands of times faster than human beings can process them, are we reaching the edges very quickly of what’s possible there? And do you believe that once we hit those edges, it would be very confined or will these accomplishments start to spill into the real world? With computers, and robotics, and things like that being able to do increasingly human things or superhuman things.

JEFF: I feel like, at this point, there's really a lack of expertise to interpret what they have done and bring it into the real world. It's growing in academia but then, it's slow to train up a new generation of students on this research that's happening so fast. I'm sure, and Dr. Sartipi will know more about it than I will, for sure, but my impression of some of the DeepMind stuff is it's amazing. But when you think about the problems that we encounter in the real world, the machine learning problems that we encounter, the problems that they are choosing, like games, lend themselves super easily to simulation. And building a system that simulates a game millions of millions of times is sort of trivial. They're doing way better machine learning things than I can ever do on that problem, but the problems that we encounter are a lot more complex. And people aren’t building generalized models yet that can just like look at all the constraints and say, “Oh, yeah. I can make a prediction now.” I mean, I'm pretty far removed from cutting-edge research so I can't say for sure, but it seems a long way off to me, especially regarding the lack of people. It's hard to encounter experts to talk to about this stuff in industry, so it seems like it may be a while and there may be a lot of space for us to do some stuff there.

MICHAEL: I think reinforcement learning is definitely the area of machine learning that I am most excited about. The issue that I see with it is it takes so much data that to train a model. So there's this trade-off of exploration and exploitation, right? You explore and you find what works on the problem, and then you have to, like, at some point, start trading off to doing the right thing. And so, I think what Jeff was saying, reinforcement learning lends itself really well to simulators, where you can train a lot and you can basically train faster than things could happen in the real world. But, when you're dealing with real-world data- I mean, IoT might be an exception to this, just because you can get so much data from so many different sensors. But I think that reinforcement learning, it's a little too slow for the amount of data that it takes at this point. I mean, I think that it's moving quickly, but I don’t know.

MINA: I think a lot of applications that we see are simplified version of the real world. Everybody is talking about autonomous driverless cars, but see where we are now. Probably, all of us thought by now, all the cars would be that great. But there’s a lot of cases. I was at SAE World Congress for this, so you would expect to see autonomous vehicle there. But as an example, they had a shuttle there and this was also on the shuttle. So we were driving and a person comes in. So they were showing us if there is a pedestrian, it would stop. So my question was what if that person keeps stopping there, the bus would stop, too? It's just like, I know that it’s a simple thing, like go to the other lane or back up, or something. But there is so many things that —  And their way was like, okay, the person moves so the bus will stop moving again, starts moving. I think a lot of these things that are still not there. And I think part of it is because data is not there.

JEFF: I also think there's a lot of low hanging fruit with simple machine learning models that if the biggest thing that's standing in the way of proliferation of the stuff is like looking at the problem from the right way and understanding that machine learning can be applied. I think simple models can go a long way on a lot of problems that just haven’t been tried yet. 

AUDIENCE MEMBER 4: What are the fundamentals that would lead us or prepare to start moving and become apprentices or eventually to move into Master’s and be generalized, and all that?

MINA: If I were you, I would actually get my hands dirty with data. I would get any data, and I would start playing with it. And I think that's the best way because there is no, “Go this block, then this block, this block,” because it’s just so different. And the best way to learn is actually, like, playing with it. It's just defining different —  See where you go with it because you have an open-ended question. You don’t even know what you’re getting to. To tell you the truth, a lot of times, our case is the same thing. You guys would be more coming. And because this is a business, so people come to you with a problem that they want you to solve. But for us, as a researcher, we just get this data. We want to know what can we do with it. And then when we learn what are the possibilities, then we will have a more also focused direction. But I think I will just go, I won't be even scared of data. I will just go and start from scratch. I would start with a clean data and just play with that, and then I will just go backward. One of the other things is —  and again, I don't know because it's a business if you can use Open Source? I would use Open Source codes because why are we going to be reinventing the wheel? Somebody has written object detection. I mean, YOLO is perfect. We are using YOLO for a lot of our connected vehicle projects, it’s for object detection. It can detect. We have this camera on the campus. Sorry, I’m smiling because it’s just that, I remember that. It's facing this side of the crosswalk. The crosswalk that it's facing to, we would keep seeing that, we were getting the messages that there's a keyboard on the road because we were just like, we were not visualizing it. It is just messages coming to me and then it will realize that it was a chip — the paint had chipped off the white paint. YOLO was detecting that as a keyboard. But this is a computer. It doesn't realize that it's a car. This is the size of the car, this cannot be a keyboard. So, a simple brain can tell you that but a computer cannot tell you that. We have to actually take keyboard out of that library. But what I'm saying is, I would be making myself comfortable and knowing what are the open sources out there, codes that I can actually use because it makes your life a lot easier. And then, of course, in any of those, you have to go in and just make it first more for what you want to do. But I would definitely look into those as well. 

MICHAEL: And there's data sets that you can grab there, basically, like the Hello World of data science. So there's the IRIS dataset for like simple classification and there’s ImageNet and there's a bunch of different places you can pull data from.


MINA: Images for handwriting.

JEFF: Yeah. I think UC Irvine has one. One of the oldest machine learning.

MICHAEL: Repository of data science. You can punch in what you want to do with the data, and what sort of data you're interested, and how many samples you want. And it will return data sets.

JEFF: And if you Google a lot of those data sets that are famous, then you’ll find people who’ve done analyses. Like toy analyses and stuff. I would say though, echoing Dr. Sartipi, pick a problem that you're interested in and see if you can find some data and do some stuff with it. In my opinion, the biggest skill that you need to learn, above all else, is what tools you have at your disposal, and when you see a problem, which tool you need, just like software in general. Honestly, if I could train somebody from scratch to do my job, that's what I would want from them. I think some people get intimidated by the Math and stuff, and it is important. But if you have a Computer Science degree, you’re probably in a good position, or any kind of scientific degree where you did a lot of Math, you’re probably in a good position. You can probably learn most of the stuff that you need to learn. It's really getting enough applications of machine learning under your belt so you'll start to realize, like, "Okay, I see a problem. I see that machine learning could be a solution here. And I know that I can grab this model and grab that model, and make something that can be useful." That’s, to me, the most important thing above anything else for the work that we do, anyway.