added connector folder and HF file #313

abhisomala · 2024-07-08T18:13:24Z

HF file takes in any Huggingface identifier and then returns an AtlasDataset

Updates:

Doesn't parse through error for config
Updated example usage
Updated the way it handles lists

Testing:

20 HF datasets with a couple of larger ones, and most of them are smaller ones

Limitations:

Audio files
Image files

🚀	This description was created by Ellipsis for commit `9ae14f4`

Summary:

Introduced a new connector for Hugging Face datasets, processed data using Apache Arrow, and provided an example usage script.

Key points:

Introduced a new connector for Hugging Face datasets.
Added connectors/huggingface_connector.py.
Implemented connectors/huggingface_connector.get_hfdata to load datasets and handle configuration issues.
Added unique IDs to each dataset entry using a sequential counter.
Implemented connectors/huggingface_connector.hf_atlasdataset to create an AtlasDataset.
Included data processing functions connectors/huggingface_connector.convert_to_string and connectors/huggingface_connector.process_table.
Used Apache Arrow for data processing.
Included a command-line interface in connectors/huggingface_connector.py.
Updated connectors/__init__.py and examples/HF_example_usage.py.
add_data accepts arrow tables directly.
Made an interactive script using argparse in example file.
Tested with ~80 different datasets, including small and large datasets.
Works for text, lists, booleans, numbers, special symbols, file paths, columns with special characters.
Limitations: Does not support images or audio.

Generated with ❤️ by ellipsis.dev

ellipsis-dev

❌ Changes requested. Reviewed everything up to 68df99a in 53 seconds

More details

Looked at 53 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_g9aSZ6jnsjudJEhx

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

connectors/huggingface_connecter.py

ellipsis-dev

❌ Changes requested. Incremental review on 152a99e in 36 seconds

More details

Looked at 29 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_SGhnothmBoTYYbbD

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

connectors/__init__.py

…o log

ellipsis-dev

❌ Changes requested. Incremental review on 2f783ca in 1 minute and 33 seconds

More details

Looked at 54 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_qrqWYnwFpCtSGGBI

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2024-07-09T15:55:47Z

connectors/huggingface_connecter.py

+
+
+# Function to fetch data from a Hugging Face dataset
+def fetch_data_from_huggingface(dataset_identifier, dataset_split=None):


Consider adding error handling for network-related exceptions such as requests.exceptions.RequestException to ensure robustness when fetching data from external APIs.

ellipsis-dev · 2024-07-09T15:55:47Z

connectors/huggingface_connecter.py

+            raise e  # Re-raise other ValueErrors
+
+# Load function to be used as a connector
+def load(dataset_identifier, dataset_split=None):


Consider also stripping the dataset_split argument to handle potential spaces:

Suggested change

def load(dataset_identifier, dataset_split=None):

data = fetch_data_from_huggingface(dataset_identifier.strip(), dataset_split.strip() if dataset_split else None)

ellipsis-dev · 2024-07-09T15:55:47Z

examples/HF_example_usage.py

+#Takes last two parts of url to get allenai/quartz
+atlas_dataset = huggingface_connecter.load('allenai/quartz')
+
+atlas_dataset.create_index(topic_model=True, embedding_model='NomicEmbed') 


Add error handling around the create_index method to manage potential failures gracefully:

RLesser

some comments, maybe some things to change

connectors/huggingface_connecter.py

RLesser · 2024-07-09T21:35:23Z

connectors/huggingface_connecter.py

+                entry[key] = str(value)
+
+
+    dataset.add_data(data=data)


are we making an index here?

Not quite; that part mainly just converts anything that isn't a string, like booleans or lists, into strings because it will error out if I don't. I didn't really like how it is just a bunch of conditional statements but I couldn't find a better way to resolve it.

RLesser · 2024-07-09T21:37:22Z

connectors/huggingface_connecter.py

+
+
+    # Convert all booleans and lists to strings
+    for entry in data:


two potential issues here:

This seems like it would be extremely slow (and possibly crash) for large huggingface datasets.

I'm a bit worried about assuming this is how people would want to handle these fields, but i guess they can edit it themselves if they want it done differently...

RLesser · 2024-07-09T21:38:11Z

connectors/huggingface_connecter.py

+
+    # Processes dataset entries
+    data = []
+    for split in dataset.keys():


same question here about large datasets - are we sure this will not break?

Could I solve this with batch processing and by passing streaming=True through load dataset? Aaron mentioned that so it seems like that could prevent that issue

apage43 · 2024-07-09T21:40:26Z

Loads dataset using load_dataset library and assigns a unique ULID ID for each entry

(andriy) These should be ULID not sh256 or uuids. @apage43 has a nice function for these

The primary reason to avoid pure-random IDs or hash based IDs is that they cause worst case performance when used as keys in ordered data structures (such b-tree indexes in a database), ULIDs improve on this by making the beginning of the ID a timestamp so that IDs created around the same time have some locality to each other, but, like UUIDs, they are still kinda big

Big (semi)random IDs like ULID are best used when you need uniqueness while also avoiding coordination, e.g. you have multiple processes inserting data into something and it would add a lot of complexity to make them cooperate to assign non-overlapping IDs - but in situations you where can use purely sequential IDs it is usually better to, as smaller ids are cheaper to store and look up

When using map_data the nomic client already has functionality to create a sequential ID field (note that its still required to be a string so it base64s their binary representation), it may make sense to copy that behavior. See here:

nomic/nomic/atlas.py

Line 77 in 1f042be

data = [{ATLAS_DEFAULT_ID_FIELD: b64int(i)} for i in range(len(embeddings))]

Tested with datasets smaller than 10k for speed but can work with larger datasets

I believe this will not currently work when the dataset size exceeds available RAM on the machine running this - HF datasets understands slice syntax when specifying a split so you can test with portions of a very large dataset with load_dataset("really-big-dataset", split="train[:100000]") to only get the first 100k rows.

Making it work should be possible by working in chunks and using IterableDatasets https://huggingface.co/docs/datasets/v2.20.0/en/about_mapstyle_vs_iterable#downloading-and-streaming

here is a notebook where I'm uploading from an iterabledataset in chunks (note, though, that because I call load_dataset and then to_iterable_dataset this still downloads the entire dataset - you can also pass streaming=True to load_dataset to get an IterableDataset that only downloads as much as you actually read, which may be desirable if you're only working with a subset of a large dataset): https://gist.github.com/apage43/9e80b0f4378ed466ec5d1c0a4042c398

RLesser · 2024-07-09T23:23:45Z

Going off @apage43's comment, I feel strongly that we should be taking advantage of huggingface datasets use of Arrow to pass data to atlas, which also speaks fluent arrow. We should also be taking advantage of batching or chunking for arbitrarily large datasets. Using base python iterators means this will break for larger datasets.

…ssing and getting config without parsing through error message

apage43 · 2024-07-10T19:45:50Z

connectors/huggingface_connector.py

+    try:
+        # Loads dataset without specifying config
+        dataset = load_dataset(dataset_identifier)
+    except ValueError as e:
+        # Grabs available configs and loads dataset using it
+        configs = get_dataset_split_names(dataset_identifier)
+        config = configs[0]
+        dataset = load_dataset(dataset_identifier, config, trust_remote_code=True, streaming=True, split=config + "[:100000]")


I would, instead of this, just make the split (and max length) optional arguments to this function (and in turn, to the top level hf_atlasdataset as well) instead of silently checking for splits and grabbing the first one which may not be the one that a user intends - for example, on wikimedia/wikipedia which is split by language, that would be Abkhazian wikipedia, since ab is alphabetically first

also note that streaming=True makes load_dataset return an IterableDataset which has slightly different behaviors than a normal Dataset - should probably always using streaming if you're going to use it - otherwise you'd need to make sure to test both cases.

Also the row limit should probably be an optional argument as well rather than hardcoded - someone may want to upload a >100k row dataset or less - while using split='splitname[:1000] does work its also not the only way - the .take on a dataset will return a new dataset with only that many rows (and on IterableDatasets will do the right thing and only fetch that many): dataset = dataset.take(1000) - this is probably more sensible for exposing the limit as an argument.

another issue is that slicing this or using .take will get the beginning of the dataset - often times if you are wanting to map a sample of a dataset (because you want to quickly get a picture of what's in it without spending the time/compute to map the whole thing) you want a random sample, for example the wikipedia dataset is also in alphabetical order by title so articles near the beginning will just be ones with titles starting with A which - probably won't get a very representative map of the whole dataset.

Datasets also have a .shuffle method which works similarlu to .take and should be applied before .take. E.g. to get 1000 random rows from a dataset you want dataset = dataset.shuffle().take(1000) - it probably makes sense to use this any time a limit is specified, but its not needed if the whole dataset will be uploaded.

apage43 · 2024-07-10T19:47:48Z

examples/HF_example_usage.py

+import logging
+
+if __name__ == "__main__":
+    dataset_identifier = input("Enter Hugging Face dataset identifier: ").strip()


how about instead of making this an interactive script, use argparse: https://docs.python.org/3.10/library/argparse.html?highlight=argparse#module-argparse

that way it's easier to handle optional args like split and limit

apage43 · 2024-07-10T20:10:16Z

connectors/huggingface_connector.py

+
+
+    # Convert the data list to an Arrow table
+    table = pa.Table.from_pandas(pd.DataFrame(data))


you could probably do pa.Table.from_pylist(data) instead and avoid roundtripping through pandas here

apage43 · 2024-07-10T20:12:11Z

connectors/huggingface_connector.py

+def process_table(table):
+    # Converts columns with complex types to strings
+    for col in table.schema.names:
+        column = table[col].to_pandas()


pyarrow.compute.cast may be able to handle some of this without having to go through pandas/pure python
https://arrow.apache.org/docs/python/generated/pyarrow.compute.cast.html

My code seems to throw an Attribute error without it. There are some cases that it works but with some like this one it doesn't.
https://huggingface.co/datasets/Anthropic/hh-rlhf

can leave this as is for now then

apage43 · 2024-07-10T20:13:46Z

connectors/huggingface_connector.py

+
+
+    # Adds data to the AtlasDataset
+    dataset.add_data(data=processed_table.to_pandas().to_dict(orient='records'))


add_data accepts arrow tables directly, no need for this conversion (add_data will have to convert it back into an arrow table before uploading if you do this)

apage43 · 2024-07-11T16:26:22Z

current version of this has no create_index calls so it'll only create an AtlasDataset with data in it but no map - is that intended?

eelegiap · 2024-07-18T15:23:47Z

connectors/huggingface_connector.py

+
+# Gets data from HF dataset
+def get_hfdata(dataset_identifier, split="train", limit=100000):
+    try:


Add docstring to function definition

eelegiap · 2024-07-18T15:24:11Z

connectors/huggingface_connector.py

+
+
+        # Load the dataset
+        dataset = load_dataset(dataset_identifier, split=split, streaming=True)


Move load_dataset outside of try/except

eelegiap · 2024-07-18T15:25:23Z

connectors/huggingface_connector.py

+
+        # Load the dataset
+        dataset = load_dataset(dataset_identifier, split=split, streaming=True)
+    except ValueError as e:


We don't need to handle this error, seems like handling config through the error message is too risky

So you can take out the try/except here

eelegiap · 2024-07-18T15:26:30Z

connectors/huggingface_connector.py

+    # Processes dataset entries using Arrow
+    id_counter = 0
+    data = []
+    if dataset:


if no dataset exists, function should fail

eelegiap · 2024-07-18T15:37:22Z

connectors/huggingface_connector.py

+        column = table[col]
+        if pa.types.is_boolean(column.type):
+            table = table.set_column(table.schema.get_field_index(col), col, pc.cast(column, pa.string()))
+        elif pa.types.is_list(column.type):


I think if you flatten the column as a list then cast as a string, the column will be too long and not match up with the other columns of the table. We may need to refactor this slightly by making sure column length is the same and structs are handled on a row by row basis

eelegiap · 2024-07-18T15:39:41Z

examples/HF_example_usage.py

+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Create an AtlasDataset from a Hugging Face dataset.')
+    parser.add_argument('--dataset_identifier', type=str, required=True, help='The Hugging Face dataset identifier')


you may want to explicitly specify in the arg that it's a hf_dataset_identifier because atlas has its own datasets

ellipsis-dev

❌ Changes requested. Incremental review on 9ae14f4 in 1 minute and 4 seconds

More details

Looked at 140 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_5mSlb7zHjU1nAKlz

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2024-07-18T16:37:42Z

connectors/huggingface_connector.py

+# Gets data from HF dataset
+def get_hfdata(dataset_identifier, split="train", limit=100000):
+    splits = get_dataset_split_names(dataset_identifier)
+    dataset = load_dataset(dataset_identifier, split=split, streaming=True)


The current implementation does not handle the case where the specified split is not available in the dataset. Previously, there was a mechanism to check available splits and use an alternative if the specified one was not found. Consider reintroducing this functionality to avoid runtime errors.

added connector folder and HF file

68df99a

ellipsis-dev bot reviewed Jul 8, 2024

View reviewed changes

connectors/huggingface_connecter.py Outdated Show resolved Hide resolved

connectors/huggingface_connecter.py Outdated Show resolved Hide resolved

connectors/huggingface_connecter.py Outdated Show resolved Hide resolved

connectors/huggingface_connecter.py Outdated Show resolved Hide resolved

Fixed comments from ellipsis-dev bot

16257b9

AndriyMulyar requested changes Jul 8, 2024

View reviewed changes

abhisomala and others added 6 commits July 8, 2024 16:33

added init.py and edits to HF connecter

4cd48ff

minor edits

86b230f

Made a couple edits (working on init.py)

497e04b

this should add init.py

74be0dc

adding init.py

a5a41f4

updated init.py and created a file for an example

152a99e

ellipsis-dev bot reviewed Jul 9, 2024

View reviewed changes

connectors/__init__.py Outdated Show resolved Hide resolved

emptied init file,moved HF example usage and changed print statment t…

2f783ca

…o log

ellipsis-dev bot reviewed Jul 9, 2024

View reviewed changes

abhisomala added 2 commits July 9, 2024 15:44

updated connector and example

9f0575b

removed print statments in example

d30529d

RLesser reviewed Jul 9, 2024

View reviewed changes

renamed file, lot of updates including using arrow format batch proce…

829d7df

…ssing and getting config without parsing through error message

apage43 reviewed Jul 10, 2024

View reviewed changes

edits to connector and changed example usage

8bf9c52

eelegiap suggested changes Jul 18, 2024

View reviewed changes

more updates to HF connector and example

9ae14f4

ellipsis-dev bot reviewed Jul 18, 2024

View reviewed changes



		# Function to fetch data from a Hugging Face dataset
		def fetch_data_from_huggingface(dataset_identifier, dataset_split=None):

	def load(dataset_identifier, dataset_split=None):
	data = fetch_data_from_huggingface(dataset_identifier.strip(), dataset_split.strip() if dataset_split else None)



		# Convert all booleans and lists to strings
		for entry in data:



		# Convert the data list to an Arrow table
		table = pa.Table.from_pandas(pd.DataFrame(data))



		# Adds data to the AtlasDataset
		dataset.add_data(data=processed_table.to_pandas().to_dict(orient='records'))



		# Load the dataset
		dataset = load_dataset(dataset_identifier, split=split, streaming=True)

added connector folder and HF file #313

Are you sure you want to change the base?

added connector folder and HF file #313

Uh oh!

Conversation

abhisomala commented Jul 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Jul 9, 2024

Choose a reason for hiding this comment

Uh oh!

RLesser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apage43 commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RLesser commented Jul 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apage43 commented Jul 11, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhisomala commented Jul 8, 2024 •

edited

Loading

apage43 commented Jul 9, 2024 •

edited

Loading