<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Markus Breitenbach</title>
	<atom:link href="http://blog.markus-breitenbach.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.markus-breitenbach.com</link>
	<description>AI, Data Mining, Machine Learning and other things</description>
	<lastBuildDate>Sat, 11 May 2013 23:56:57 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Removing AVG Secure Search from Chrome</title>
		<link>http://blog.markus-breitenbach.com/2013/05/11/removing-avg-secure-search-from-chrome/</link>
		<comments>http://blog.markus-breitenbach.com/2013/05/11/removing-avg-secure-search-from-chrome/#comments</comments>
		<pubDate>Sat, 11 May 2013 23:56:57 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Fixing Stuff]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/?p=126</guid>
		<description><![CDATA[I use AVG as my anti-virus and today it installed the whole safe-search toolbar thing in the background while I was playing a video game. I&#8217;m pretty sure I didn&#8217;t consent to that, but whatever&#8230; Undoing the damage to my settings took quite a bit of work and I had trouble removing it from Chrome. [...]]]></description>
				<content:encoded><![CDATA[<p>I use AVG as my anti-virus and today it installed the whole safe-search toolbar thing in the background while I was playing a video game. I&#8217;m pretty sure I didn&#8217;t consent to that, but whatever&#8230; Undoing the damage to my settings took quite a bit of work and I had trouble removing it from Chrome. First I followed all the <a href="https://productforums.google.com/forum/?fromgroups=#!topic/chrome/f7BRB1WBunY" target="_blank">steps outlined here</a>, but pressing the home button in Chrome would still bring up the  &#8220;mysearch.avg.com&#8221; website no matter what settings I changed.</p>
<p><a href="http://www.avg.com/ww-en/faq.num-5200">Uninstalling the AVG toolbar component finally solved the Chrome start-page problem</a> for me.</p>
<p>Man, what I piece of work. I&#8217;m not alone thinking that <a href="http://www.zdnet.com/avg-security-toolbar-is-the-worst-foistware-ive-ever-seen-7000001055/" target="_blank">AVG did something bad for the user</a> here.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2013/05/11/removing-avg-secure-search-from-chrome/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On Statistical Significance Testing</title>
		<link>http://blog.markus-breitenbach.com/2013/03/26/on-statistical-significance-testing/</link>
		<comments>http://blog.markus-breitenbach.com/2013/03/26/on-statistical-significance-testing/#comments</comments>
		<pubDate>Tue, 26 Mar 2013 06:31:03 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/?p=122</guid>
		<description><![CDATA[I found a good read about the fallacies of statistical significance testing. See also here.]]></description>
				<content:encoded><![CDATA[<p>I found a good read about <a href="https://plus.google.com/u/0/111308405120672978476/posts">the fallacies of statistical significance testing</a>. See also <a href="http://andrewgelman.com/2013/03/25/the-harm-done-by-tests-of-significance/">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2013/03/26/on-statistical-significance-testing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Frequentist vs. Bayesian Inference</title>
		<link>http://blog.markus-breitenbach.com/2012/11/19/frequentist-vs-bayesian-inference/</link>
		<comments>http://blog.markus-breitenbach.com/2012/11/19/frequentist-vs-bayesian-inference/#comments</comments>
		<pubDate>Mon, 19 Nov 2012 07:37:10 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/2012/11/19/frequentist-vs-bayesian-inference/</guid>
		<description><![CDATA[I found a good article discussing the difference between the Frequentist and Bayesian approach to Inference.]]></description>
				<content:encoded><![CDATA[<p>I found a good article discussing the difference between the <a href="http://normaldeviate.wordpress.com/2012/11/17/what-is-bayesianfrequentist-inference/" target="_blank">Frequentist and Bayesian approach to Inference</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2012/11/19/frequentist-vs-bayesian-inference/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cross-VM Side Channels and Their Use to Extract Private Keys</title>
		<link>http://blog.markus-breitenbach.com/2012/10/28/cross-vm-side-channels-and-their-use-to-extract-private-keys/</link>
		<comments>http://blog.markus-breitenbach.com/2012/10/28/cross-vm-side-channels-and-their-use-to-extract-private-keys/#comments</comments>
		<pubDate>Mon, 29 Oct 2012 02:55:39 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Security]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/2012/10/28/cross-vm-side-channels-and-their-use-to-extract-private-keys/</guid>
		<description><![CDATA[Cool application of machine learning in the security field: extracting private keys from virtual machines running on shared hardware by training a Support-Vector-Machine model to classify data bits collected. http://www.cs.unc.edu/~reiter/papers/2012/CCS.pdf]]></description>
				<content:encoded><![CDATA[<p>Cool application of machine learning in the security field: extracting private keys from virtual machines running on shared hardware by training a Support-Vector-Machine model to classify data bits collected.</p>
<p><a href="http://www.cs.unc.edu/~reiter/papers/2012/CCS.pdf">http://www.cs.unc.edu/~reiter/papers/2012/CCS.pdf</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2012/10/28/cross-vm-side-channels-and-their-use-to-extract-private-keys/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Classification with inputs that change over time &#8211; P2P Loan Data</title>
		<link>http://blog.markus-breitenbach.com/2012/10/06/classification-with-inputs-that-change-over-time-p2p-loan-data/</link>
		<comments>http://blog.markus-breitenbach.com/2012/10/06/classification-with-inputs-that-change-over-time-p2p-loan-data/#comments</comments>
		<pubDate>Sat, 06 Oct 2012 17:30:52 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Classification]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/2012/10/06/classification-with-inputs-that-change-over-time-p2p-loan-data/</guid>
		<description><![CDATA[Predicting whether a loan will default or not is a tricky task. It may involve many variables, incomplete information and is a task that involves time as a component. Loans may also perform for a while before they default. Some loans may even be late, but recover back to the regular payment schedule. It&#8217;s an [...]]]></description>
				<content:encoded><![CDATA[<p>Predicting whether a loan will default or not is a tricky task. It may involve many variables, incomplete information and is a task that involves time as a component. Loans may also perform for a while before they default. Some loans may even be late, but recover back to the regular payment schedule. It&#8217;s an interesting application for statistics.</p>
<p>The <a href="http://www.lendingclub.com/" target="_blank">LendingClub </a>website, a service offering <a href="https://www.lendingclub.com/public/how-peer-lending-works.action" target="_blank">peer-to-peer lending</a>, offers an interesting data set: historical data of loan performance as well as data for new loans. I&#8217;ve been playing around a bit with the data and built a model to predict whether a loan is a good investment. The <a href="https://www.lendingclub.com/info/download-data.action" target="_blank">LendingClub data is available for download</a>. A <a href="http://www.lendingclub.com/kb/index.php?View=entry&amp;EntryID=253" target="_blank">data dictionary</a> can be found on the website also.</p>
<p>First we need to define the outcome we want to predict. A loan can be in several states, some being &#8220;current&#8221;, others being &#8220;defaulted&#8221;, &#8220;late&#8221; or even on a &#8220;performing payment plan&#8221;. Conservatively, I defined all loans that were not &#8220;paid off&#8221; as bad. Loans that are &#8220;current&#8221; were excluded as they still can default in the future. Loans that are &#8220;late&#8221; are considered bad, because the borrower run into problems. The model I&#8217;m trying to built is basically for a conservative investor looking for loans that will simply be paid back without a hitch. With the usual statistical techniques a model can be built and the performance can be measured by 10-fold cross-validation or evaluating the model on a hold-out set. The real result of a prediction will of course only be available after about 3 years when a loan is fully paid off. As measure to optimize I chose the <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic" target="_blank">AUC metric</a>. A 10-fold cross-validation estimates the performance of my model at 0.698 which is not too bad. The predictions implicitly make a few assumptions. The first one being that future performance of loans will be similar to historical performance of similar loans. I&#8217;m assuming a stationary distribution and the <a href="http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables" target="_blank">IID assumption</a> &#8211; which is not completely true in reality, but hopefully close enough <img src='http://blog.markus-breitenbach.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  Also, inflation expectations were not taken into account, but I&#8217;m limiting my model to 36 month loans to make that more manageable.</p>
<p>I won&#8217;t go into the details of how I encoded the variables and what variables I&#8217;m using. I discovered that I can extract information out of the textual variables in the loans. The &#8220;Loan Description&#8221;, a free text field where potential borrowers can leave comments or answer questions, is quite predictive. The difficult part is using that information in practice. A loan is in &#8220;funding state&#8221; for two weeks were investors can ask questions and invest in the loan. Many loans get fully funded before the two week period is over, some without any question or comment on the loan. New information may become available in the Loan Description field that may change the classification. That means, however, that the prediction may change over time &#8211; positively or negatively &#8211; after an investment decision has been. Not ideal, but the variables are quite powerful so I&#8217;m still looking for a good solution.</p>
<p>I made the <a href="https://cervisia.org/lc_credit/lc_credit_ratings.php">ratings for the LendingClub loans </a>my program produces public. I will update them occasionally (i.e., whenever I feel like it). If you have some suggestions on how to use the textual variables, leave a comment.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2012/10/06/classification-with-inputs-that-change-over-time-p2p-loan-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Preserving Privacy in Big Data</title>
		<link>http://blog.markus-breitenbach.com/2012/08/01/preserving-privacy-in-big-data/</link>
		<comments>http://blog.markus-breitenbach.com/2012/08/01/preserving-privacy-in-big-data/#comments</comments>
		<pubDate>Wed, 01 Aug 2012 15:58:09 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/2012/08/01/preserving-privacy-in-big-data/</guid>
		<description><![CDATA[Interesting blog post on Differential Privacy. I wasn&#8217;t aware of this specific privacy model.]]></description>
				<content:encoded><![CDATA[<p>Interesting blog post on <a href="https://normaldeviate.wordpress.com/2012/07/31/differential-privacy/" target="_blank">Differential Privacy</a>. I wasn&#8217;t aware of this specific privacy model.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2012/08/01/preserving-privacy-in-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Puzzling Outcomes In Controlled Experiments</title>
		<link>http://blog.markus-breitenbach.com/2012/07/06/puzzling-outcomes-in-controlled-experiments/</link>
		<comments>http://blog.markus-breitenbach.com/2012/07/06/puzzling-outcomes-in-controlled-experiments/#comments</comments>
		<pubDate>Fri, 06 Jul 2012 14:07:30 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/2012/07/06/puzzling-outcomes-in-controlled-experiments/</guid>
		<description><![CDATA[A really interesting paper on A/B testing and experiments in online environments just got accepted to KDD 2012: puzzlingOutcomesInControlledExperiments.pdf Summary: Don&#8217;t make changes to your application if your average customers lifetime value will decline. Understand the change, consider alternative hypothesis, watch several metrics. Ensure that your findings align with the long term strategy so that [...]]]></description>
				<content:encoded><![CDATA[<p>A really interesting paper on A/B testing and experiments in online environments just got accepted to KDD 2012:</p>
<p><a href="http://www.exp-platform.com/Documents/puzzlingOutcomesInControlledExperiments.pdf" title="Puzzling Outcomes in Controlled Experiments" target="_blank">puzzlingOutcomesInControlledExperiments.pdf</a></p>
<p>Summary:</p>
<ul>
<li><span class="comment"><font color="#000000">Don&#8217;t make changes to your application if your average customers lifetime value will decline. Understand the change, consider alternative hypothesis, watch several metrics. Ensure  that your findings  align with the long term strategy so that long term growth is not  sacrificed for short term financial gain. Example: one time  Bing had a bug, which served poor search results, so distinct queries  went up 10% and CTR on advertisements went up 30%. </font></span></li>
<li><span class="comment"><font color="#000000">Ensure that your  statistic results are trustworthy.  Incorrect results may cause bad  ideas to be deployed; good ideas may be ruled out by mistake. </font></span></li>
<li><span class="comment"><font color="#000000">An upwards trend in a newly launched feature does not imply that  users like the feature more. (delayed effect &amp; primacy effect). </font></span></li>
<li><span class="comment"><font color="#000000">Often running an experiment longer does  not provide extra statistical power. Pick a duration and stick  to it. Do not stop tests early (unless you use algorithms to  tell you when you have statistical confidence enough to be able to stop  your test) </font></span></li>
<li><span class="comment"><font color="#000000"> Re-run your experiment again if you get surprising results. Investigating the underlying reasons is often worth it. </font></span></li>
<li><span class="comment"><font color="#000000">Watch for Carryover Effect&#8230; Run A/A experiments. If you use bucketing techniques to assign participants to experiments rerun the exerpiment with a larger test group and with  local randomization. </font></span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2012/07/06/puzzling-outcomes-in-controlled-experiments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Will 2012 be the year of Big Data?</title>
		<link>http://blog.markus-breitenbach.com/2012/01/28/will-2012-be-the-year-of-big-data/</link>
		<comments>http://blog.markus-breitenbach.com/2012/01/28/will-2012-be-the-year-of-big-data/#comments</comments>
		<pubDate>Sat, 28 Jan 2012 20:56:10 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/2012/01/28/will-2012-be-the-year-of-big-data/</guid>
		<description><![CDATA[Interesting view on that here.]]></description>
				<content:encoded><![CDATA[<p>Interesting view on that <a href="http://www.fastcompany.com/1811441/why-big-data-won-t-make-you-smart-rich-or-pretty">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2012/01/28/will-2012-be-the-year-of-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>UK plans to exempt data mining from copyright laws</title>
		<link>http://blog.markus-breitenbach.com/2011/08/14/uk-plans-to-exempt-data-mining-from-copyright-laws/</link>
		<comments>http://blog.markus-breitenbach.com/2011/08/14/uk-plans-to-exempt-data-mining-from-copyright-laws/#comments</comments>
		<pubDate>Mon, 15 Aug 2011 02:41:04 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/2011/08/14/uk-plans-to-exempt-data-mining-from-copyright-laws/</guid>
		<description><![CDATA[The UK is in the process of overhauling their overly stringent copyright laws. That&#8217;s an interesting development (see the Nature blog entry on the topic). One idea being discussed is to generally allow data and text mining without the copyright holders permission, which would usually be required for any kind of electronic processing.]]></description>
				<content:encoded><![CDATA[<p>The UK is in the process of overhauling their overly stringent copyright laws. That&#8217;s an interesting development (see the <a href="http://blogs.nature.com/news/2011/08/data_mining_given_the_go_ahead.html" target="_blank">Nature blog entry on the topic</a>).  One idea being discussed is to generally allow data and text mining without the copyright holders permission, which would usually be required for any kind of electronic processing.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2011/08/14/uk-plans-to-exempt-data-mining-from-copyright-laws/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Risk Assessment of Rare Events in adversarial Scenarios</title>
		<link>http://blog.markus-breitenbach.com/2011/06/21/risk-assessment-of-rare-events-in-adversarial-scenarios/</link>
		<comments>http://blog.markus-breitenbach.com/2011/06/21/risk-assessment-of-rare-events-in-adversarial-scenarios/#comments</comments>
		<pubDate>Tue, 21 Jun 2011 07:26:53 +0000</pubDate>
		<dc:creator>Markus</dc:creator>
				<category><![CDATA[Predictive Modeling]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Society]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://blog.markus-breitenbach.com/2011/06/21/risk-assessment-of-rare-events-in-adversarial-scenarios/</guid>
		<description><![CDATA[The RAND corporation just published an interesting paper exploring the benefits of using risk prediction to reduce the screening required at airports. You might have noticed various attempts to establish some kind of fast-lane or trusted traveler program. Obvious this is a very sensitive topic and probably hard to get right. Screening certain groups of [...]]]></description>
				<content:encoded><![CDATA[<p>The RAND corporation just published an interesting paper exploring the benefits of using risk prediction to reduce the screening required at airports. You might have noticed various attempts to establish some kind of fast-lane or trusted traveler program. Obvious this is a very sensitive topic and probably hard to get right. Screening certain groups of the population more than others (&#8220;profiling&#8221;) is generally frowned upon and also not a good idea in general (see &#8220;<a href="http://blog.markus-breitenbach.com/2010/01/10/strong-profiling-is-not-mathematically-optimal-for-discovering-rare-malfeasors-on-rare-event-detection/">Strong profiling is not mathematically optimal for discovering rare malfeasors on rare event detection</a>&#8220;), but what hasn&#8217;t been examined much is identifying people that can be considered more &#8220;safe&#8221; than others. The paper explores that idea and shows that even under the assumption that the bad guys will try and subvert this program that there can be benefits to implementing this solution. The paper is a bit sparse on mathematical details. Certainly an interesting idea, though.</p>
<p><strong>Paper</strong>: <a href="http://www.rand.org/content/dam/rand/pubs/working_papers/2011/RAND_WR855.pdf">Assessing the Security Benefits of a Trusted Traveler Program in the Presence of Attempted Attacker Exploitation and Compromise</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.markus-breitenbach.com/2011/06/21/risk-assessment-of-rare-events-in-adversarial-scenarios/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
