Source – https://www.analyticsinsight.net/
The new algorithm is an effective machine learning tool that is capable of extracting the desired information
Big data, evidently, is too large to be processed using conventional data processing tools and techniques. Majority of the information systems produce data in huge quantities that poses difficulties to measure. This complex Big data that organizations have to deal with are characterized by – huge volume, high value, big variability, high velocity, much variety, and low veracity.
Yet another area that generates huge amount of data is the one involving scientific experiments. As days passed by, researchers have come up with highly efficient ways to plan, conduct, and assess research. A combination of computational, algorithmic, statistical and mathematical techniques is what goes behind these scientific experiments. Also, whenever a scientific experiment is conducted, the results obtained are usually transformed into numbers. All this ultimately results in huge datasets. Such big data isn’t that easy to handle and extracting meaningful insights from the same is a trickier task. This is why every possible method to reduce the size of the data is being employed and tested. Today, different types of algorithms are being employed to reduce the data size and also pave the way for extracting the principal features and insights. All this ultimately throwing light on the most critical part of the data, its statistical properties. On the downside, the fact that certain algorithms cannot be applied directly to these large volumes of big data cannot be overlooked.
With many researchers and programmers coming up with ways to deal with this humungous big data in the most optimal manner, Reza Oftadeh, a doctoral student in the Department of Computer Science and Engineering at Texas A&M University, too took a step towards this. Reza developed an algorithm which, according to him, is an effective machine learning tool as it is capable of extracting the desired information. Reza along with his team, which comprises of a couple of other doctoral students and some assistant professors, have published their research work in the proceedings from the 2020 International Conference on Machine learning. This research by Reza and his team was funded by the National Science Foundation and U.S. Army Research Office Young Investigator Award.
There is a fair chance that the data set in consideration has high dimensionality, meaning that it has a lot of features. The problem associated with this is the ability to generalize. This is why efforts from every corner are put in to reduce the dimensionality of the data. With those areas being identified which need to undergo reduction in dimensionality, annotated samples of the same are made to make it easy for further analysis. Well, not just this, tasks such as classification, visualization, modelling, etc. also see a smooth workflow.
Though this isn’t for the first time that such algorithms and methodologies have been put in place. This has been doing rounds for quite some time now but with big data increasing exponentially, analysing it is not just time consuming but also complicated. This led to the invention of ANNs – Artificial Neural Networks. Artificial Neural Networks are one of the greatest innovations that the world has seen on the technical front. Artificial neural networks are made up of billions of artificial neurons. Their task is to extract meaningful information from the dataset provided. In simple terms, Artificial Neural Networks are models that are equipped with a well-defined architecture of many interconnected artificial neurons and are designed to simulate how the human brain works when it comes to analysing and processing data. Artificial Neural Networks have seen numerous applications so far and that one application which sets it apart is the way it is capable enough of classifying big data into different categories based on its features.
When Reza was asked his views on the same, he started off by mentioning how much we rely on ANNs in our day-to-day life. He quoted the examples of Alexa, Siri and Google Translate saying how they are trained to be able to understand what the person is saying. However, he also mentioned how all the features possessed aren’t equally significant. He supported his statement by giving an example of a specific type of ANN called an “autoencoder”. This cannot tell where the features are located and also which features are more critical than the rest, he added. Running the model repeatedly doesn’t serve the purpose as this too, is time consuming.
Reza and his team aim to come take their algorithm to a next level altogether. They plan on to add a new cost function to the network. With this feature, it is possible to provide the exact location of the features. For this, they incorporated an OCR – Optical Character Recognition experiment. This team of researchers trained their machine learning model to convert images of both typed as well as handwritten text into machine encoded text. They made use of digital physical documents for this experiment. This model, on being trained for OCR, holds the potential to tell which features among all are important and must be put into priority. They claim that their machine learning tool would cater to bigger datasets as well, thereby resulting in an improved data analysis.
As of now, the algorithm that this group of researchers have come up with stands the potential to deal with one-dimensional data samples only. However, the team is willing to extend its capabilities to the extent that it will be possible to deal with even more complex unstructured data. The team is ready to face all the challenges that might come their way and explore this algorithm to the farthest level possible. They would also be working in the area of generalizing their method. The reason for doing this is to provide a unified framework to produce other machine learning methods. Ultimately, the objective that still remains is to extract features by dealing with a smaller set of specifications.