Making Free Proxies with Tor and Ansible
2024-09-20
Est. 6m readWas reading through Hacker News when I saw “We accidentally burned through 200GB of proxy bandwidth in 6 hours”. Brutal! 😅
I remember getting into Skyvern. Really interesting tech! Too bad the open-source models aren’t quite there yet. I’m not VC enough to spend AI credits on web scraping either.
The post was honest, and I certainly could’ve made the very same mistake! But recently I’ve been feeling more like the Chad pictured below:

My favorite trick at the moment for getting free proxies is to just use Tor. It doesn’t work with every website, but for ones that do, that’s ~2k+ proxies free of charge!
Getting Tor to work well as a web scraper took some work, and that’s what this post will be about.
Avoiding Detection
When starting with the PoC (proof-of-concept) I started with webfp/tor-browser-selenium. Seemed like the natural place to start. Real quickly though, it became apparent that somehow… sites were detecting Selenium and rejecting my requests as bot-like.
Diving deep into the Firefox about:* pages looking for what could be the issue. I spent quite a while looking,
trying things like privacy.resistFingerprinting = False, excluding domains, etc. etc. In the end, I believe
it was a combination of Selenium and the browser telling the website that it was being automated. “Marionette”
as they called it.
The Solution
This library is the solution: kaliiiiiiiiii/Selenium-Driverless. It hides the fact that Selenium is “driving” the browser and comes with some other patches that Selenium is (intentionally?) missing.
Since we’re using Selenium-Driverless, we can’t use the Tor browser that was included in the previous library. Now might be a good time to pick a better browser than Tor’s browser anyhow since it’s probably not what the average user is using.
Blending in
The trick to bypassing detection is to look as average as possible.
So we combine this sneaky Selenium client with a regular Chrome browser, but how do we connect that to Tor? Well, there’s a launch argument for that! It looks like:
1./google-chrome --proxy-server=socks5://<HOST>:<PORT>But How to Tor?
Tor is often used entirely through the “Tor Bundle” which includes the browser. But alternatively, you can install the tor service on a standard Linux machine and get a new connection on each instance.
This is not an "Exit Node"
Exit Nodes are a lot more involved in their setup, installing tor is just a connection to the network. You won’t have to worry about other people using your IP to surf the web.
Within my /etc/tor/torrc file1:
1SocksPort 0.0.0.0:9050
2ControlPort 0.0.0.0:9051
3Log notice stdout
4DataDirectory /var/lib/tor
5HashedControlPassword <HASHED_PASSWORD_HERE>These are mostly defaults. The 0.0.0.0 is to ensure we can connect from machines on the local network.
The SocksPort is our proxy and the ControlPort is used to control the tor service (e.g. renewing the IP).
The password is generated with tor --hash-password password_here and is used to authenticate on :9051.
Creating Unlimited Proxies
In hindsight
I should’ve checked to see if there’s a Dockerized way of creating a Tor connection. Guess I needed an excuse to finally automate Proxmox. Feel free to deviate from what I did here, but the concepts will still apply.
To create a bunch of these tor services, I used Ansible and Proxmox to create 4 LXC containers, each one
with Alpine and a static IP. I wanted them to be as lightweight as possible just in-case I need more.
For your benefit, here’s the Ansible playbook:
1- name: Create LXC containers for Tor proxies on Proxmox
2 hosts: proxmox
3 gather_facts: no
4 vars:
5 proxmox_api_host: "10.0.0.69"
6 proxmox_api_user: "root@pam"
7 proxmox_api_password: "hunter2"
8 proxmox_node: "akon"
9 container_password: "hunter2"
10 containers:
11 - { name: "torproxy1", id: 201, ip: "10.0.0.201" }
12 - { name: "torproxy2", id: 202, ip: "10.0.0.202" }
13 - { name: "torproxy3", id: 203, ip: "10.0.0.203" }
14 - { name: "torproxy4", id: 204, ip: "10.0.0.204" }
15
16 tasks:
17 - name: Create LXC containers
18 community.general.proxmox:
19 api_host: "{{ proxmox_api_host }}"
20 api_user: "{{ proxmox_api_user }}"
21 api_password: "{{ proxmox_api_password }}"
22 node: "{{ proxmox_node }}"
23 vmid: "{{ item.id }}"
24 hostname: "{{ item.name }}"
25 ostemplate: 'local:vztmpl/alpine-3.19-default_20240207_amd64.tar.xz'
26 password: "{{ container_password }}"
27 netif: '{"net0":"name=eth0,ip={{ item.ip }}/24,gw=10.0.0.1,bridge=vmbr0"}' # gateway may need to change
28 storage: local-lvm
29 unprivileged: no
30 onboot: yes
31 features:
32 - nesting=1
33 loop: "{{ containers }}"
34
35 - name: Start LXC containers
36 community.general.proxmox:
37 api_host: "{{ proxmox_api_host }}"
38 api_user: "{{ proxmox_api_user }}"
39 api_password: "{{ proxmox_api_password }}"
40 node: "{{ proxmox_node }}"
41 vmid: "{{ item.id }}"
42 state: started
43 loop: "{{ containers }}"
44
45 - name: Configure SSH and install packages in containers
46 ansible.builtin.command:
47 cmd: >
48 pct exec {{ item.id }} -- /bin/sh -c "
49 apk update &&
50 apk add openssh &&
51 rc-update add sshd &&
52 echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config &&
53 echo 'root:{{ container_password }}' | chpasswd &&
54 rc-service sshd start &&
55 apk add tor python3 &&
56 echo 'SocksPort 0.0.0.0:9050' > /etc/tor/torrc &&
57 echo 'ControlPort 0.0.0.0:9051' >> /etc/tor/torrc &&
58 echo 'Log notice stdout' >> /etc/tor/torrc &&
59 echo 'DataDirectory /var/lib/tor' >> /etc/tor/torrc &&
60 echo 'HashedControlPassword <HASHED_PASSWORD_HERE>' >> /etc/tor/torrc &&
61 rc-update add tor default &&
62 rc-service tor start
63 "
64 loop: "{{ containers }}"
65
66 - name: Wait for LXC containers to be ready
67 ansible.builtin.wait_for:
68 host: "{{ item.ip }}"
69 port: 22
70 timeout: 300
71 loop: "{{ containers }}"
72
73 - name: Verify Tor is running in containers
74 ansible.builtin.command:
75 cmd: pct exec {{ item.id }} -- rc-service tor status
76 loop: "{{ containers }}"
77 register: tor_status
78
79 - name: Display Tor status
80 ansible.builtin.debug:
81 var: tor_statusRunning the playbook:

Scaling this up to 8x, 16x, 32x is no problem. We can add each IP as a proxy and round robin through all of them to distribute the load. After each request, we can use the ControlPort to renew the IP and essentially get a new proxy.
In Python there’s a library called stem for controlling Tor over the ControlPort.
Here’s most of the code I’m using to round robin:
1import asyncio
2from stem.control import Controller
3
4PROXIES = [
5 {"host": "10.0.0.201", "port": 9050, "control_port": 9051},
6 {"host": "10.0.0.202", "port": 9050, "control_port": 9051},
7 {"host": "10.0.0.203", "port": 9050, "control_port": 9051},
8 {"host": "10.0.0.204", "port": 9050, "control_port": 9051}
9]
10
11async def renew_tor_ip(proxy):
12 with Controller.from_port(address=proxy["host"], port=proxy["control_port"]) as controller:
13 controller.authenticate("password_here")
14 controller.signal(Signal.NEWNYM)
15
16async def run_session(proxy):
17 while True:
18 await launch_chrome(proxy)
19 await do_scraping()
20 await renew_tor_ip(proxy)
21 await asyncio.sleep(3)
22
23sessions = [run_session(proxy) for proxy in PROXIES]
24
25await asyncio.gather(*sessions)Wrapping Up
That’s it. Hopefully you got something out of this. There’s still Cloudflare checks, IP blacklists, etc. etc. that may cause troubles when scraping a web page, but I don’t think you’ll ever be hit with a $500 bill for using the Tor proxies!
One Final Trick
If you want to use a specific exit node’s IP, you can specify it with: ExitNodes IP in
the torrc2. You can even see the cached list of ExitNodes with:
1sudo grep -B3 "^s.*Exit" /var/lib/tor/cached-microdesc-consensus | grep "^r" | awk '{print $6 ":" $7}'