{"id":17460,"date":"2020-11-19T15:00:53","date_gmt":"2020-11-19T15:00:53","guid":{"rendered":"http:\/\/www.wjst.de\/blog\/?p=17460"},"modified":"2021-02-09T14:17:45","modified_gmt":"2021-02-09T14:17:45","slug":"how-to-scrape-a-website-with-r-i-using-a-browser-generated-cookie","status":"publish","type":"post","link":"https:\/\/www.wjst.de\/blog\/sciencesurf\/2020\/11\/how-to-scrape-a-website-with-r-i-using-a-browser-generated-cookie\/","title":{"rendered":"How to scrape a website with R I: Using a browser generated cookie"},"content":{"rendered":"<p>While there are quite some <a href=\"https:\/\/stackoverflow.com\/questions\/6224120\/login-to-php-website-using-rcurl\">SO examples<\/a> out there how to manage the login, here are the ncessary steps whenever you need to login in manually and have to start with a browser cookie. First install the &#8220;<a href=\"https:\/\/chrome.google.com\/webstore\/detail\/editthiscookie\/fngmhnnpilhplaeedifhccceomclgfbg?hl=de\">EditThisCookie<\/a>&#8221; plugin in Chrome and export the cookie<!--more--><\/p>\n<p>to the clipboard.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-17461 alignnone\" src=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2020\/11\/screen-1.jpg\" alt=\"\" width=\"188\" height=\"186\" srcset=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2020\/11\/screen-1.jpg 545w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2020\/11\/screen-1-506x500.jpg 506w\" sizes=\"auto, (max-width: 188px) 100vw, 188px\" \/><\/p>\n<p>Then cut and paste the clipboard content into the R editor in the 3rd line<\/p>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\r\ninstall.packages(c(&quot;rvest&quot;,&quot;stringr&quot;,&quot;rjson&quot;,&quot;RCurl&quot;)\r\nc.json &lt;- c('\r\n&#x5B;\r\n{\r\n    &quot;domain&quot;: &quot;www.domain.de&quot;,\r\n    &quot;hostOnly&quot;: true,\r\n    &quot;httpOnly&quot;: true,\r\n    &quot;name&quot;: &quot;MX_SID&quot;,\r\n    &quot;path&quot;: &quot;\/&quot;,\r\n    &quot;sameSite&quot;: &quot;unspecified&quot;,\r\n    &quot;secure&quot;: true,\r\n    &quot;session&quot;: true,\r\n    &quot;storeId&quot;: &quot;0&quot;,\r\n    &quot;value&quot;: &quot;4e1799f010efa4387874399253695521&quot;,\r\n    &quot;id&quot;: 1\r\n}\r\n]\r\n')\r\nresult &lt;- fromJSON(c.json)\r\nfn &lt;- c(&quot;\u02dc\/cookies.txt&quot;)\r\nunlink(fn)\r\ne &lt;- if (!exists(&quot;i$expirationDate&quot;)) 2147483647 else 0\r\nfor (i in result) {\r\n  cat( paste0(i$domain,&quot;\\t&quot;,\r\n    &quot;FALSE&quot;,&quot;\\t&quot;,\r\n    i$path,&quot;\\t&quot;,\r\n    &quot;TRUE&quot;,&quot;\\t&quot;,\r\n    e,&quot;\\t&quot;,\r\n    i$name,&quot;\\t&quot;,\r\n    i$value ), file=fn, sep=&quot;\\n&quot;, append=TRUE )\r\n}\r\ncurl = getCurlHandle (cookiefile = fn,\r\n    cookiejar = fn,\r\n    useragent = &quot;Mozilla\/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko\/20070725 Firefox\/2.0.0.6&quot;,\r\n    verbose = TRUE)\r\ncontent &lt;- getURL(&quot;https:\/\/www.domain.de\/whatever&quot;,curl = curl)\r\nrm(curl)\r\n<\/pre>\n\n<p>&nbsp;<\/p>\n<div class=\"bottom-note\">\n  <span class=\"mod1\">CC-BY-NC Science Surf , accessed 24.04.2026<\/span>\n <\/div>","protected":false},"excerpt":{"rendered":"<p>While there are quite some SO examples out there how to manage the login, here are the ncessary steps whenever you need to login in manually and have to start with a browser cookie. First install the &#8220;EditThisCookie&#8221; plugin in Chrome and export the cookie &nbsp; CC-BY-NC Science Surf , accessed 24.04.2026<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[2893,3573,2847,3572,3570,3568,3569,3571],"class_list":["post-17460","post","type-post","status-publish","format-standard","hentry","category-computer-software","tag-r","tag-rcurl","tag-cookie","tag-rjson","tag-rvest","tag-scraper","tag-ssl","tag-stringr"],"_links":{"self":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/17460","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/comments?post=17460"}],"version-history":[{"count":8,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/17460\/revisions"}],"predecessor-version":[{"id":18060,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/17460\/revisions\/18060"}],"wp:attachment":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/media?parent=17460"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/categories?post=17460"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/tags?post=17460"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}