<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dropping data Archives - Artificial Intelligence</title>
	<atom:link href="https://www.aiuniverse.xyz/tag/dropping-data/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.aiuniverse.xyz/tag/dropping-data/</link>
	<description>Exploring the universe of Intelligence</description>
	<lastBuildDate>Thu, 16 May 2019 06:11:24 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>3 simple ways to handle large data with Pandas</title>
		<link>https://www.aiuniverse.xyz/3-simple-ways-to-handle-large-data-with-pandas/</link>
					<comments>https://www.aiuniverse.xyz/3-simple-ways-to-handle-large-data-with-pandas/#comments</comments>
		
		<dc:creator><![CDATA[aiuniverse]]></dc:creator>
		<pubDate>Thu, 16 May 2019 06:11:24 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[chunk]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[Dropping data]]></category>
		<category><![CDATA[RAM]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[scientists]]></category>
		<category><![CDATA[techniques]]></category>
		<guid isPermaLink="false">http://www.aiuniverse.xyz/?p=3497</guid>

					<description><![CDATA[<p>Source:- towardsdatascience.com Pandas has become one of the most popular Data Science libraries out there. It’s easy to use, the documentation is fantastic, and it’s capabilities are powerful. <a class="read-more-link" href="https://www.aiuniverse.xyz/3-simple-ways-to-handle-large-data-with-pandas/">Read More</a></p>
<p>The post <a href="https://www.aiuniverse.xyz/3-simple-ways-to-handle-large-data-with-pandas/">3 simple ways to handle large data with Pandas</a> appeared first on <a href="https://www.aiuniverse.xyz">Artificial Intelligence</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Source:- towardsdatascience.com</p>
<p id="3068" class="graf graf--p graf-after--figure">Pandas has become one of the most popular Data Science libraries out there. It’s easy to use, the documentation is fantastic, and it’s capabilities are powerful.</p>
<p id="c874" class="graf graf--p graf-after--p">Yet regardless of what library one uses, large datasets always present an extra challenge that needs to be handled with care.</p>
<p id="ac92" class="graf graf--p graf-after--p">You start to run into hardware roadblocks since you don’t have enough RAM to hold all the data in memory. Enterprise companies store datasets that get up to the range of 100s or even 1000s of GBs .</p>
<p id="8f2b" class="graf graf--p graf-after--p">Even if you do happen to buy a machine that has enough RAM to store all that data, just reading it into memory is very slow.</p>
<p id="1fa4" class="graf graf--p graf-after--p">But once again the Pandas library is going to help us out. This article will talk about 3 techniques you can use to reduce the memory footprint and read-in time for your large dataset. I’ve used these techniques for datasets of over 100GB in size, squeezing them onto machines with 64 and sometimes 32GB of RAM.</p>
<p id="9701" class="graf graf--p graf-after--p">Let’s check them out!</p>
<h3 id="013d" class="graf graf--h3 graf-after--p">Chunking your data</h3>
<p id="9bb7" class="graf graf--p graf-after--h3">CSV format is a very convenient way to store data, being both easy to write to and human readable. Plus, there’s a nice pandas function <code class="markup--code markup--p-code">read_csv()</code> for loading up data that’s stored as CSV.</p>
<p id="d66f" class="graf graf--p graf-after--p">But what happens when your CSV is so big that you run out of memory?</p>
<p id="7269" class="graf graf--p graf-after--p">There’s a very simple pandas trick to handle that! Instead of trying to handle our data all at once, we’re going to do it in pieces. Typically, these pieces are referred to as <em class="markup--em markup--p-em">chunks</em>.</p>
<p id="2b50" class="graf graf--p graf-after--p">A chunk is just a part of our dataset. We can make that chunk as big or as small as we want. It just depends on how much RAM we have.</p>
<p id="19d9" class="graf graf--p graf-after--p">The process then works as follows:</p>
<ol class="postList">
<li id="fe50" class="graf graf--li graf-after--p">Read in a chunk</li>
<li id="9093" class="graf graf--li graf-after--li">Process the chunk</li>
<li id="8461" class="graf graf--li graf-after--li">Save the results of the chunk</li>
<li id="a45f" class="graf graf--li graf-after--li">Repeat steps 1 to 3 until we have all chunk results</li>
<li id="0e13" class="graf graf--li graf-after--li">Combine the chunk results</li>
</ol>
<p id="0445" class="graf graf--p graf-after--li">We can perform all of the above steps using a handy variable of the <code class="markup--code markup--p-code">read_csv()</code> function called <strong class="markup--strong markup--p-strong">chunksize</strong>. The chunksize refers to how many CSV rows pandas will read at a time. This will of course depend on how much RAM you have and how big each row is.</p>
<p id="ae10" class="graf graf--p graf-after--p">If we think that our data has a pretty easy to handle distribution like Gaussian, then we can perform our desired processing and visualisations on one chunk at a time without too much loss in accuracy.</p>
<p id="5e4f" class="graf graf--p graf-after--p">If our distribution is a bit more complex like a Poisson, then it’s best to filter each chunk and put all of the small pieces together before processioning. Most of the time, you’ll end up dropping many irrelevant columns or removing rows that have missing values. We can do that for each chunk to make them smaller, then put them all-together and perform our data analysis on the final dataframe.</p>
<p id="5724" class="graf graf--p graf-after--p">The code below performs all of these steps.</p>
<h3 id="1285" class="graf graf--h3 graf-after--figure">Dropping data</h3>
<p id="ccc2" class="graf graf--p graf-after--h3">Sometime, we’ll know right off the bat which columns of our dataset we want to analyse. In fact, it’s often the case that there are several or more columns that we don’t care about like names, account numbers, etc.</p>
<p id="23d2" class="graf graf--p graf-after--p">Skipping over the columns directly before reading in the data can save on tons of memory. Pandas allows us to specify the columns we would like to read in:</p>
<p id="c6c0" class="graf graf--p graf-after--figure">Throwing away the columns containing that useless miscellaneous information is going to be one of your biggest memory savings.</p>
<p id="8696" class="graf graf--p graf-after--p">The other thing we can do is filter out any rows with missing or NA values. This is easiest with the <code class="markup--code markup--p-code">dropna()</code> function:</p>
<p id="c9cb" class="graf graf--p graf-after--figure">There’s a few really useful variables that we can pass to the <code class="markup--code markup--p-code">dropna()</code> :</p>
<ul class="postList">
<li id="5c1f" class="graf graf--li graf-after--p"><strong class="markup--strong markup--li-strong">how:</strong> this will let you specify either “any” (drop a row if any of its columns are NA) or “all” (drop a row only if all its columns are NA)</li>
<li id="9287" class="graf graf--li graf-after--li"><strong class="markup--strong markup--li-strong">thresh: </strong>Set a threshold of how many NA values are required for a row to be dropped</li>
<li id="a6a5" class="graf graf--li graf-after--li"><strong class="markup--strong markup--li-strong">subset:</strong> Selects a subset of columns that will be considered for checking for NA values</li>
</ul>
<p id="cd46" class="graf graf--p graf-after--li">You can use those arguments, especially the <em class="markup--em markup--p-em">thresh</em> and <em class="markup--em markup--p-em">subset</em> to get really specific about which rows will be dropped.</p>
<p id="f0b6" class="graf graf--p graf-after--p">Pandas doesn’t come with a way to do this at read time like with the columns, but we can always do it on each chunk as we did above.</p>
<h3 id="dddd" class="graf graf--h3 graf-after--p">Set specific data types for each column</h3>
<p id="2dc9" class="graf graf--p graf-after--h3">For many beginner Data Scientists, data types aren’t given much thought. But once you start dealing with very large datasets, dealing with data types becomes essential.</p>
<p id="c00a" class="graf graf--p graf-after--p">The standard practice tends to be to read in the dataframe and then convert the data type of a column as needed. But with a big dataset, we really have to be memory-space conscious.</p>
<p id="3627" class="graf graf--p graf-after--p">There may be columns in our CSV, such as floating point numbers, which will take up way more space than they need to. For example, if we downloaded a dataset for predicting stock prices, our prices might be saved as 32 bit floating point!</p>
<p id="3696" class="graf graf--p graf-after--p">But do we <em class="markup--em markup--p-em">really</em> need 32 float? Most of the time, stocks are bought at prices specified by two decimal places. Even if we wanted to be <em class="markup--em markup--p-em">really</em> accurate, float16 is more than enough.</p>
<p id="b610" class="graf graf--p graf-after--p">So instead of reading in our dataset with the columns’ original data types, we’re going to specify the data types we want pandas to use reading in our columns. That way, we never use up more memory than we actually need.</p>
<p id="fea4" class="graf graf--p graf-after--p">This is easily done using the <strong class="markup--strong markup--p-strong">dtype</strong> parameter in the <code class="markup--code markup--p-code">read_csv()</code> function. We can specify a dictionary where each key is a column in our dataset and each value is the data type we want to use that key.</p>
<section class="section section--body">
<div class="section-content"></div>
</section>
<section class="section section--body section--last">
<div class="section-divider">
<hr class="section-divider" />
</div>
<div class="section-content"></div>
</section>
<p>The post <a href="https://www.aiuniverse.xyz/3-simple-ways-to-handle-large-data-with-pandas/">3 simple ways to handle large data with Pandas</a> appeared first on <a href="https://www.aiuniverse.xyz">Artificial Intelligence</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.aiuniverse.xyz/3-simple-ways-to-handle-large-data-with-pandas/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
	</channel>
</rss>
