<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MLlib Archives - Artificial Intelligence</title>
	<atom:link href="https://www.aiuniverse.xyz/category/mllib/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.aiuniverse.xyz/category/mllib/</link>
	<description>Exploring the universe of Intelligence</description>
	<lastBuildDate>Fri, 07 Jun 2019 05:06:38 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.1</generator>
	<item>
		<title>Apache Spark MLlib Tutorial</title>
		<link>https://www.aiuniverse.xyz/apache-spark-mllib-tutorial/</link>
					<comments>https://www.aiuniverse.xyz/apache-spark-mllib-tutorial/#respond</comments>
		
		<dc:creator><![CDATA[aiuniverse]]></dc:creator>
		<pubDate>Fri, 07 Jun 2019 05:06:38 +0000</pubDate>
				<category><![CDATA[MLlib]]></category>
		<category><![CDATA[Apache Spark]]></category>
		<category><![CDATA[Big data]]></category>
		<category><![CDATA[Classification Algorithms]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Spark Mllib]]></category>
		<guid isPermaLink="false">http://www.aiuniverse.xyz/?p=3559</guid>

					<description><![CDATA[<p>Source:- towardsdatascience.com Introduction In this part of the series, we will put together everything we have learned to train a classification model. The objective is to learn how to build a complete classification workflow from the beginning to the end. Problem Definition The problem we are going to solve is the infamous Titanic Survival Problem. We are <a class="read-more-link" href="https://www.aiuniverse.xyz/apache-spark-mllib-tutorial/">Read More</a></p>
<p>The post <a href="https://www.aiuniverse.xyz/apache-spark-mllib-tutorial/">Apache Spark MLlib Tutorial</a> appeared first on <a href="https://www.aiuniverse.xyz">Artificial Intelligence</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Source:- towardsdatascience.com</p>
<h3 id="5287" class="graf graf--h3 graf-after--figure">Introduction</h3>
<p id="9ea9" class="graf graf--p graf-after--h3">In this part of the series, we will put together everything we have learned to train a <strong class="markup--strong markup--p-strong">classification model</strong>. The objective is to learn how to build a complete classification workflow from the beginning to the end.</p>
<h3 id="cbef" class="graf graf--h3 graf-after--p">Problem Definition</h3>
<p id="53c7" class="graf graf--p graf-after--h3">The problem we are going to solve is the infamous <em class="markup--em markup--p-em">Titanic Survival Problem</em>. We are asked to build a machine learning model that takes passenger information and predict whether he/she survived or not.</p>
<h3 id="1ece" class="graf graf--h3 graf-after--figure">Preparing the Development Environment</h3>
<p id="a010" class="graf graf--p graf-after--h3">You should be familiar with this step now. We will open a new <em class="markup--em markup--p-em">Jyputer notebook</em>, import and initialize <em class="markup--em markup--p-em">findspark</em>, create a <em class="markup--em markup--p-em">spark session</em> and finally <em class="markup--em markup--p-em">load </em>the data.</p>
<p id="01fa" class="graf graf--p graf-after--figure">Here is an example on how someone may select/update his features by analyzing the above tables:</p>
<ul class="postList">
<li id="3e18" class="graf graf--li graf-after--p">It does not make sense to include some features such as: <em class="markup--em markup--li-em">PassengerID</em>, <em class="markup--em markup--li-em">Name </em>and <em class="markup--em markup--li-em">Ticket </em>→ we will drop them</li>
<li id="ef34" class="graf graf--li graf-after--li"><em class="markup--em markup--li-em">Cabin </em>has a lot of null values → we will drop it as well</li>
<li id="9581" class="graf graf--li graf-after--li">Maybe the <em class="markup--em markup--li-em">Embarked </em>column has nothing to do with the survival → let us remove it</li>
<li id="e255" class="graf graf--li graf-after--li">We are missing 177 values from the <em class="markup--em markup--li-em">Age </em>column → <em class="markup--em markup--li-em">Age</em> is important, we need to find a way to deal with the missing values</li>
<li id="172c" class="graf graf--li graf-after--li"><em class="markup--em markup--li-em">Gender </em>has nominal values → need to encode them.</li>
<li>
<h3 id="04b0" class="graf graf--h3 graf-after--pre">Feature Transformation</h3>
<p id="61b2" class="graf graf--p graf-after--h3">We will deal with the transformations one by one. In a future article, I will discuss how to improve the process using <strong class="markup--strong markup--p-strong">pipelines.</strong> But let us do it the boring way first.</p>
<h4 id="77dd" class="graf graf--h4 graf-after--p">Calculating Age Missing Values</h4>
<p id="be40" class="graf graf--p graf-after--h4"><em class="markup--em markup--p-em">Age </em>is an important feature; it is not wise to drop it because of some missing values. What we could do is to fill missing values with the help of existing ones. This process is called <strong class="markup--strong markup--p-strong">Data Imputation</strong>. There are many available strategies, but we will follow a simple one that fills missing values with the <em class="markup--em markup--p-em">mean value</em> calculated from the sample.</p>
<p id="dc43" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">MLlib </strong>makes the job easy using the <strong class="markup--strong markup--p-strong">Imputer </strong>class. First, we define the estimator, fit it to the model, then we apply the transformer on the data.</p>
</li>
<li>
<p id="a53b" class="graf graf--p graf-after--figure">No more missing values! Let us continue to the next step…</p>
<h4 id="dab7" class="graf graf--h4 graf-after--p">Encoding Gender Values</h4>
<p id="f00f" class="graf graf--p graf-after--h4">We learned that machine learning algorithms cannot deal with categorical features. So, we need to index the <em class="markup--em markup--p-em">Gender </em>values:</p>
</li>
<li>
<h4 id="d407" class="graf graf--h4 graf-after--p">Creating the Features Vector</h4>
<p id="f67a" class="graf graf--p graf-after--h4">We learned previously that MLlib expects data to be represented in two columns: a <em class="markup--em markup--p-em">features vector </em>and a <em class="markup--em markup--p-em">label column</em>. We have the <em class="markup--em markup--p-em">label </em>column ready (<em class="markup--em markup--p-em">Survived</em>), so let us prepare the <em class="markup--em markup--p-em">features vector</em>.</p>
</li>
<li>
<h3 id="08f8" class="graf graf--h3 graf-after--p">Training the Model</h3>
<p id="9726" class="graf graf--p graf-after--h3">We will use a <strong class="markup--strong markup--p-strong">Random Forest Classifier</strong> for this problem. You are free to choose any other classifier you see fit.</p>
<p id="96f8" class="graf graf--p graf-after--p">Steps:</p>
<ol class="postList">
<li id="28b7" class="graf graf--li graf-after--p">Create an estimator</li>
<li id="0d25" class="graf graf--li graf-after--li">Specify the name of the features column and the label column</li>
<li id="8d55" class="graf graf--li graf-after--li">Fit the model<br />
<h3 id="f8b0" class="graf graf--h3 graf-after--p">Generating Predictions</h3>
<p id="9a59" class="graf graf--p graf-after--figure">y one. We need to calculate some metrics to get the overall performance of the model.<strong class="markup--strong markup--p-strong"> Evaluation time…</strong></p>
<h3 id="0f82" class="graf graf--h3 graf-after--p">Model Evaluation</h3>
<p id="36d8" class="graf graf--p graf-after--h3">We will use a <strong class="markup--strong markup--p-strong">BinaryClassificationEvaluator</strong> to evaluate our model. It needs to know the name of the <em class="markup--em markup--p-em">label column</em> and the <em class="markup--em markup--p-em">metric name</em>.</p>
</li>
<li>
<p id="0dfe" class="graf graf--p graf-after--p">Given that we did nothing to configure the <em class="markup--em markup--p-em">hypreparatmers</em>, the initial results are promising. I know that I did not evaluate it on a testing data, but I trust you can do it.</p>
<h3 id="3f42" class="graf graf--h3 graf-after--p">Model Evaluation with SciKit-Learn</h3>
<p id="9721" class="graf graf--p graf-after--h3">If you want to generate other evaluations such as a confusion matrix or a classification report, you could always use the scikit-learn library.</p>
<p id="ee4a" class="graf graf--p graf-after--p">You only need to extract <em class="markup--em markup--p-em">y_true </em>and <em class="markup--em markup--p-em">y_pred </em>from your DataFrame.</p>
<h3 id="b5b7" class="graf graf--h3 graf-after--figure">Final Thoughts</h3>
<p id="8727" class="graf graf--p graf-after--h3 graf--trailing">Congrats! You have successfully completed another tutorial. You should be more confident with your MLlib skills now. In future tutorials, we are going to improve the preprocessing phase by using <strong class="markup--strong markup--p-strong"><em class="markup--em markup--p-em">pipelines</em></strong>, and I will show you more exciting MLlib features. Stay tuned…</p>
</li>
</ol>
</li>
</ul>
<p>The post <a href="https://www.aiuniverse.xyz/apache-spark-mllib-tutorial/">Apache Spark MLlib Tutorial</a> appeared first on <a href="https://www.aiuniverse.xyz">Artificial Intelligence</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.aiuniverse.xyz/apache-spark-mllib-tutorial/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
