{"id":68,"date":"2025-06-12T10:00:00","date_gmt":"2025-06-12T10:00:00","guid":{"rendered":"https:\/\/vicfolio.com\/blog\/?p=68"},"modified":"2025-06-17T08:18:55","modified_gmt":"2025-06-17T08:18:55","slug":"web-scraping-etico-con-python-extrae-datos-de-sitios-web-usando-beautiful-soup-y-scrapy","status":"publish","type":"post","link":"https:\/\/vicfolio.com\/blog\/?p=68","title":{"rendered":"Web Scraping \u00c9tico con Python: Extrae Datos de Sitios Web Usando Beautiful Soup y Scrapy"},"content":{"rendered":"\n<p>El <strong>web scraping<\/strong> es una t\u00e9cnica para recopilar informaci\u00f3n automatizada desde sitios web, \u00fatil para an\u00e1lisis de datos, monitoreo de tendencias o creaci\u00f3n de aplicaciones inteligentes. Sin embargo, debe realizarse de manera <strong>\u00e9tica<\/strong>, respetando las pol\u00edticas de uso y los derechos de autor.<\/p>\n\n\n\n<p>En este tutorial, aprender\u00e1s a extraer datos de sitios web usando <strong>Python<\/strong>, <strong>Beautiful Soup<\/strong> y <strong>Scrapy<\/strong>, con proyectos pr\u00e1cticos y buenas pr\u00e1cticas.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Proyecto 1: Scraper B\u00e1sico con Requests + BeautifulSoup (Noticias P\u00fablicas)<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Instalaci\u00f3n de Paquetes<\/strong><\/h3>\n\n\n\n<p>Primero, instala las librer\u00edas necesarias:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install requests beautifulsoup4<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>C\u00f3digo del Scraper<\/strong><\/h3>\n\n\n\n<p>Este ejemplo extrae t\u00edtulos de noticias de un sitio web p\u00fablico:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import requests\nfrom bs4 import BeautifulSoup\n\nurl = \"https:\/\/www.example.com\/news\"\nresponse = requests.get(url)\nsoup = BeautifulSoup(response.content, 'html.parser')\n\nnews_titles = soup.find_all('h2', class_='news-title')\nfor title in news_titles:\n    print(title.text.strip())<\/code><\/pre>\n\n\n\n<p><strong>Explicaci\u00f3n:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Se realiza una solicitud <code>GET<\/code> a la URL del sitio.<\/li>\n\n\n\n<li>El contenido HTML se parsea con <strong>BeautifulSoup<\/strong>.<\/li>\n\n\n\n<li>Se extraen todos los t\u00edtulos con la clase <code>news-title<\/code> y se imprimen.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Proyecto 2: Scraper Avanzado con Scrapy (Marketplace de Productos)<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Instalaci\u00f3n de Scrapy<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install scrapy<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Creaci\u00f3n del Spider<\/strong><\/h3>\n\n\n\n<p>Define un spider para extraer productos de un marketplace:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import scrapy\n\nclass ProductSpider(scrapy.Spider):\n    name = \"product_spider\"\n    start_urls = &#91;\n        'https:\/\/www.example.com\/products',\n    ]\n\n    def parse(self, response):\n        products = response.css('div.product')\n        for product in products:\n            yield {\n                'name': product.css('span.name::text').get(),\n                'price': product.css('span.price::text').get(),\n            }<\/code><\/pre>\n\n\n\n<p><strong>Funcionamiento:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>El spider inicia en la URL especificada.<\/li>\n\n\n\n<li>Usa <strong>selectores CSS<\/strong> para extraer nombres y precios de productos.<\/li>\n\n\n\n<li>Devuelve los datos en formato diccionario.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Proyecto 3: Manejo de JavaScript con Selenium<\/strong><\/h2>\n\n\n\n<p>Algunos sitios cargan contenido din\u00e1mico con JavaScript, lo que dificulta el scraping tradicional. <strong>Selenium<\/strong> permite automatizar un navegador para extraer estos datos.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Instalaci\u00f3n de Selenium<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install selenium<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>C\u00f3digo de Ejemplo<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>from selenium import webdriver\n\ndriver = webdriver.Chrome()\nurl = 'https:\/\/www.example.com\/dynamic-content'\ndriver.get(url)\n\ncontent = driver.find_element_by_css_selector('div.content').text\nprint(content)\ndriver.quit()<\/code><\/pre>\n\n\n\n<p><strong>Nota:<\/strong> Requiere el controlador de Chrome (<code>chromedriver<\/code>).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Proyecto 4: Scraping de APIs vs HTML Directo<\/strong><\/h2>\n\n\n\n<p>Muchos sitios ofrecen APIs para acceder a sus datos. Usar una API es m\u00e1s eficiente que parsear HTML.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Ejemplo con Requests<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>import requests\n\nurl = 'https:\/\/api.example.com\/products'\nresponse = requests.get(url)\nproducts = response.json()\n\nfor product in products:\n    print(product&#91;'name'], product&#91;'price'])<\/code><\/pre>\n\n\n\n<p><strong>Ventajas:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Respuesta estructurada (JSON\/XML).<\/li>\n\n\n\n<li>Menor carga en el servidor.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Componentes T\u00e9cnicos Importantes<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Parsing de HTML\/XML complejo<\/strong>: Usa <code>BeautifulSoup<\/code> o <code>Scrapy<\/code>.<\/li>\n\n\n\n<li><strong>Manejo de formularios y cookies<\/strong>: Scrapy soporta autenticaci\u00f3n.<\/li>\n\n\n\n<li><strong>Rotaci\u00f3n de proxies<\/strong>: Evita bloqueos cambiando IPs.<\/li>\n\n\n\n<li><strong>Almacenamiento<\/strong>: Guarda datos en CSV, JSON o bases de datos.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Buenas Pr\u00e1cticas y Advertencias<\/strong><\/h2>\n\n\n\n<p>&#x2705; <strong>Revisa los t\u00e9rminos de servicio<\/strong>: No violes las pol\u00edticas del sitio.<br>&#x2705; <strong>Identifica si hay una API disponible<\/strong>: Usa herramientas como Postman.<br>&#x2705; <strong>Maneja cambios en la estructura web<\/strong>: Actualiza tu scraper peri\u00f3dicamente.<br>&#x2705; <strong>Monitoriza el rendimiento<\/strong>: Evita sobrecargar servidores.<br>&#x2705; <strong>Mant\u00e9n el c\u00f3digo modular<\/strong>: Facilita actualizaciones futuras.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusi\u00f3n<\/strong><\/h2>\n\n\n\n<p>El <strong>web scraping \u00e9tico<\/strong> es una herramienta poderosa si se usa correctamente. Con <strong>Python<\/strong>, <strong>BeautifulSoup<\/strong>, <strong>Scrapy<\/strong> y <strong>Selenium<\/strong>, puedes extraer datos de manera eficiente y responsable.<\/p>\n\n\n\n<p>\u00a1Empieza a experimentar y aplica estas t\u00e9cnicas en tus proyectos! &#x1f680;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>El web scraping es una t\u00e9cnica para recopilar informaci\u00f3n automatizada desde sitios web, \u00fatil para an\u00e1lisis de datos, monitoreo de tendencias o creaci\u00f3n de aplicaciones inteligentes. Sin embargo, debe realizarse de manera \u00e9tica, respetando las pol\u00edticas de uso y los derechos de autor. En este tutorial, aprender\u00e1s a extraer datos de sitios web usando Python, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":69,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[20,5,22],"class_list":["post-68","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-programacion","tag-programacion","tag-python","tag-scraping"],"_links":{"self":[{"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/68","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=68"}],"version-history":[{"count":2,"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/68\/revisions"}],"predecessor-version":[{"id":72,"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/68\/revisions\/72"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=\/wp\/v2\/media\/69"}],"wp:attachment":[{"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=68"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=68"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vicfolio.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=68"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}